- I finally cleaned up and published my latest evaluation of [JPEG, WebP, and AVIF](https://alanorth.github.io/improved-dspace-thumbnails/evaluating-jpeg-webp-avif.html)
- I [filed an issue on DSpace](https://github.com/DSpace/DSpace/issues/8849) to track this
## 2023-05-17
- Re-sync CGSpace to DSpace 7 Test
- I came up with a naive patch to use WebP instead of JPEG in the DSpace ImageMagick filter, and it works, but doesn't replace existing JPEGs... hmmm
- Also, it does PDF to WebP to WebP haha
## 2023-05-18
- I created a [pull request](https://github.com/DSpace/DSpace/pull/8850) to improve some minor documentation, typo, and logic issues in the DSpace ImageMagick thumbnail filters
- I realized that there is a quick win to the generation loss issue with ImageMagickThumbnailFilter
- We can use ImageMagick's internal MIFF instead of JPEG when writing the intermediate image
- According to the [libvips author PNG is very slow](https://github.com/libvips/libvips/issues/571)!
- I re-ran my `generation-loss.sh` script using MIFF and found that it had essentially the same results as PNG, which is about 1.1 points higher on the ssimulacra2 (v2.1) scoring scale
- Also, according to my tests with the cosmo rusage.com utility, I see that MIFF is indeed much faster than PNG
- I updated my pull request to add this quick win
- Weekly CG Core types meeting
- Low attendance so I just kept working on the spreadsheet
| sed 1d > /tmp/dois-for-cc-items-with-old-bitstreams.txt
```
- This results in 262 items that have DOIs that are CC-BY (but not ND)
- This is a good starting point, but misses some that had low-quality thumbnails uploaded after they were added (ie, there's no record of a bitstream in the provenance field)
- I ran the list through my Sci-Hub download script and filtered out a few that downloaded invalid PDFs (manually), then generated thumbnails for all of them:
- Then I joined the CSVs on the DOI column, filtered out any that we didn't find PDFs for, and formatted the resulting CSV with an id, filename, and bundle column:
```console
$ csvjoin -c doi bitstreams.csv /tmp/items-with-old-bitstreams.csv \
| csvgrep -c filename --invert-match -r '^$' \
| sed '1s/dspace_object_id/id/' \
| csvcut -c id,filename \
| sed -e '1s/^\(.*\)$/\1,bundle/' -e '2,$s/^\(.*\)$/\1.jpg__description:libvips thumbnail,THUMBNAIL/' > new-thumbnails.csv
```
- I did a dry run with `ilri/post_bitstreams.py` and it seems that most (all?) already have thumbnails from the last time I did a massive Sci-Hub check
- So relying on the provenance field is not very reliable it seems, and that was a waste of two hours...
- I did discover, while originally posting WebP thumbnails, that the format doesn't seem to be set correctly when uploading WebP via the REST API, but it does work when uploading via XMLUI—the format is set to Unknown
- POSTing a JPG to the THUMBNAIL bundle sets the format to JPEG...
- I am guessing that is a bug that I won't bother troubleshooting since the DSpace 6.x REST API is deprecated
- Export CGSpace again to do some major cleanups in OpenRefine
- I found a few countries that are in the ISO 3166-1 and UN M.49 lists, but not in ours so I added them to the list in `input-forms.xml` and regenerated the controlled vocabularies for the CGSpace Submission Guidelines
- There were a handful of issues with ISSNs, ISBNs, DOIs, access status, licenses, and missing CGIAR Trust Fund donors for Initiatives outputs
- This was about 455 items
- Helping the Alliance web team understand the DSpace REST API for determining which collection an item belongs to