mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-16 11:57:03 +01:00
9.1 KiB
9.1 KiB
title | date | author | categories | |
---|---|---|---|---|
May, 2023 | 2023-05-03T08:53:36+03:00 | Alan Orth |
|
2023-05-03
- Alliance's TIP team emailed me to ask about issues authenticating on CGSpace
- It seems their password expired, which is annoying
- I continued looking at the CGSpace subjects for the FAO / AGROVOC exercise that I started last week
- There are many of our subjects that would match if they added a "-" like "high yielding varieties" or used singular...
- Also I found at least two spelling mistakes, for example "decison support systems", which would match if it was spelled correctly
- Work on cleaning, proofing, and uploading twenty-seven records for IFPRI to CGSpace
- I notice there are a few dozen locks from the
dspaceWeb
pool that are five days old on CGSpace so I killed them
$ psql < locks-age.sql | grep " days " | awk -F"|" '{print $10}' | sort -u | xargs kill
2023-05-04
- Sync DSpace Test with CGSpace
- I replaced one item's thumbnail with a WebP version and XMLUI displays it fine
- I spent some time checking the CMYK issue with Arch's ImageMagick 7 and the Docker container and I think ImageMagick 7 just handles CMYK wrong...
- libvips does it correctly automatically and looks closer to the PDF
- Meeting about CG Core types
2023-05-10
- Write a script to find the
metadata_field_id
values associated with the non-AGROVOC subjects I am working on for Sara- This is useful because we want to know who to contact for a definition
- The script was:
while read -r subject; do
metadata_field_id=$(psql -h localhost -U postgres -d dspacetest -qtAX <<SQL
SELECT DISTINCT(metadata_field_id) FROM metadatavalue WHERE LOWER(text_value)='$subject'
SQL
)
metadata_field_id=$(echo $metadata_field_id | sed 's/[[:space:]]/||/g')
echo "$subject,$metadata_field_id"
done < <(csvcut -c 1 ~/Downloads/2023-04-26\ CGIAR\ non-AGROVOC\ subjects.csv | sed 1d)
- I also realized that Bernard Bett didn't have any items on CGSpace tagged with his ORCID identifier, so I tagged 230!
2023-05-11
- CG Core meeting
- Finalize looking at the CGSpace non-AGROVOC subjects for FAO
2023-05-12
- Export the Alliance community to do some country/region fixes
- I also sent Maria and Francesca the export because they want to add more regions and subregions
- Export the entire CGSpace to check for missing Initiative collection mappings
- I also adding missing regions
2023-05-16
- I finally cleaned up and published my latest evaluation of JPEG, WebP, and AVIF
- I filed an issue on DSpace to track this
2023-05-17
- Re-sync CGSpace to DSpace 7 Test
- I came up with a naive patch to use WebP instead of JPEG in the DSpace ImageMagick filter, and it works, but doesn't replace existing JPEGs... hmmm
- Also, it does PDF to WebP to WebP haha
2023-05-18
- I created a pull request to improve some minor documentation, typo, and logic issues in the DSpace ImageMagick thumbnail filters
- I realized that there is a quick win to the generation loss issue with ImageMagickThumbnailFilter
- We can use ImageMagick's internal MIFF instead of JPEG when writing the intermediate image
- According to the libvips author PNG is very slow!
- I re-ran my
generation-loss.sh
script using MIFF and found that it had essentially the same results as PNG, which is about 1.1 points higher on the ssimulacra2 (v2.1) scoring scale - Also, according to my tests with the cosmo rusage.com utility, I see that MIFF is indeed much faster than PNG
- I updated my pull request to add this quick win
- Weekly CG Core types meeting
- Low attendance so I just kept working on the spreadsheet
- We are at the stage of voting on definitions
2023-05-19
- I ported a few of the minor ImageMagick Thumbnail Filter improvements to our
6_x-prod
branch
2023-05-20
- I deployed the latest thumbnail changes on CGSpace, ran all updates, and rebooted it
- I exported CGSpace to check for missing Initiative mappings
- Then I started a harvest on AReS
2023-05-23
- Help Francesca with an import of a journal article with a few hundred authors
- I used the DSpace 7 live import from PubMed
- I also noticed a bug in the CrossRef live import if you change the DOI field, so I filed an issue
2023-05-25
- Meeting on output types
- Make a pull request on DSpace to capture publisher during live import from Crossref
2023-05-26
- Make a pull request on DSpace to update checkstyle
- Make a pull request on DSpace-angular to fix an incorrect i18n UI string
- I'm experimenting with replacing old thumbnails
- In the past we used to upload thumbnails for journal covers, but those were low quality and look horrible now
- Using the provenance field I want to identify items with 1 bitstream of type gif or jpg, then extract the item IDs along with DOIs:
\COPY (SELECT
text_value,
dspace_object_id
FROM
metadatavalue
WHERE
dspace_object_id IN (
SELECT
dspace_object_id
FROM
metadatavalue
WHERE
metadata_field_id = 28
AND place = 0
AND (text_value LIKE '%No. of bitstreams: 1%'
AND text_value SIMILAR TO '%.(gif|jpg|jpeg)%'))
AND metadata_field_id = 220) TO /tmp/items-with-old-bitstreams.csv WITH CSV HEADER;
- I extract the DOIs and look them up on CrossRef to see which are CC-BY, then extract those:
$ csvcut -c text_value /tmp/items-with-old-bitstreams.csv | sed 1d > /tmp/dois.txt
$ ./ilri/crossref_doi_lookup.py -i /tmp/dois.txt -e fuuu@example.com -o /tmp/dois-resolved.csv
$ csvgrep -c license -m 'creativecommons' /tmp/dois-resolved.csv \
| csvgrep -c license -m 'by-nc-nd' --invert-match \
| csvcut -c doi \
| sed '2,$s_^\(.*\)$_https://doi.org/\1_' \
| sed 1d > /tmp/dois-for-cc-items-with-old-bitstreams.txt
- This results in 262 items that have DOIs that are CC-BY (but not ND)
- This is a good starting point, but misses some that had low-quality thumbnails uploaded after they were added (ie, there's no record of a bitstream in the provenance field)
- I ran the list through my Sci-Hub download script and filtered out a few that downloaded invalid PDFs (manually), then generated thumbnails for all of them:
$ ~/src/git/DSpace/ilri/get_scihub_pdfs.py -i /tmp/dois-for-cc-items-with-old-bitstreams.txt -o bitstreams.csv
$ chrt -b 0 vipsthumbnail *.pdf --export-profile srgb -s 600x600 -o './%s.pdf.jpg[Q=02,optimize_coding,strip]'
- Then I joined the CSVs on the DOI column, filtered out any that we didn't find PDFs for, and formatted the resulting CSV with an id, filename, and bundle column:
$ csvjoin -c doi bitstreams.csv /tmp/items-with-old-bitstreams.csv \
| csvgrep -c filename --invert-match -r '^$' \
| sed '1s/dspace_object_id/id/' \
| csvcut -c id,filename \
| sed -e '1s/^\(.*\)$/\1,bundle/' -e '2,$s/^\(.*\)$/\1.jpg__description:libvips thumbnail,THUMBNAIL/' > new-thumbnails.csv
- I did a dry run with
ilri/post_bitstreams.py
and it seems that most (all?) already have thumbnails from the last time I did a massive Sci-Hub check- So relying on the provenance field is not very reliable it seems, and that was a waste of two hours...
- I did discover, while originally posting WebP thumbnails, that the format doesn't seem to be set correctly when uploading WebP via the REST API, but it does work when uploading via XMLUI—the format is set to Unknown
- POSTing a JPG to the THUMBNAIL bundle sets the format to JPEG...
- I am guessing that is a bug that I won't bother troubleshooting since the DSpace 6.x REST API is deprecated
2023-05-27
- Export CGSpace to check for missing Initiative collection mappings
- Then I also ran the csv-metadata-quality tool on the Initiatives to do some easy fixes like country/region mapping and whitespace fixes
- Start a havest on AReS
2023-05-29
- Re-create my local PostgreSQL 14 container:
$ podman rm dspacedb14
$ podman pull docker.io/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d docker.io/postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
- Export CGSpace again to do some major cleanups in OpenRefine
- I found a few countries that are in the ISO 3166-1 and UN M.49 lists, but not in ours so I added them to the list in
input-forms.xml
and regenerated the controlled vocabularies for the CGSpace Submission Guidelines - There were a handful of issues with ISSNs, ISBNs, DOIs, access status, licenses, and missing CGIAR Trust Fund donors for Initiatives outputs
- This was about 455 items
- I found a few countries that are in the ISO 3166-1 and UN M.49 lists, but not in ours so I added them to the list in
- Helping the Alliance web team understand the DSpace REST API for determining which collection an item belongs to