December, 2022

Thu Dec 01, 2022 in Notes

2022-12-01

Fix some incorrect regions on CGSpace
- I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
Replace “East Asia” with “Eastern Asia” region on CGSpace (UN M.49 region)

CGSpace and PRMS information session with Enrico and a bunch of researchers
I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance
I startd a harvest on AReS since we’ve updated so much metadata recently

2022-12-02

File some issues related to metadata on the MEL issue tracker

2022-12-03

I downloaded a fresh copy of CLARISA’s institutions list as well as ROR’s latest dump from 2022-12-01 to check how many are matching:

$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
1864
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt

Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:

$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt

I checked CGSpace’s top 1,000 affiliations too, first exporting from PostgreSQL:

localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;

Then cutting (tab is the default delimeter):

$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
542

So that’s a 54% match for our top affiliations
I realized we should actually check affiliations and sponsors, since those are stored in separate fields
- When I add those the matches go down a bit to 45%
Oh man, I realized institutions like Université d'Abomey Calavi don’t match in ROR because they are like this in the JSON:

"name": "Universit\u00e9 d'Abomey-Calavi"

So we likely match a bunch more than 50%…
I exported a list of affiliations and donors from CGSpace for Peter to look over and send corrections

2022-12-05

First day of PRMS technical workshop in Rome
Last night I submitted a CSV import with changes to 1,500 Alliance items (adding regions) and it hadn’t completed after twenty-four hours so I canceled it
- Not sure if there is some rollback that will happen or what state the database will be in, so I will wait a few hours to see what happens before trying to modify those items again
- I started it again a few hours later with a subset of the items and 4GB of RAM instead of 2
- It completed successfully…

2022-12-07

I found a bug in my csv-metadata-quality script regarding the regions
- I was accidentally checking cg.coverage.subregion due to a sloppy regex
- This means I’ve added a few thousand UN M.49 regions to the cg.coverage.subregion field in the last few days
- I had to extract them from CGSpace and delete them using delete-metadata-values.py
My DSpace 7.x pull request to tell ImageMagick about the PDF CropBox was merged
Start a harvest on AReS

2022-12-08

While on the plane I decided to fix some ORCID identifiers, as I had seen some poorly formatted ones
- I couldn’t remember the XPath syntax so this was kinda ghetto:

$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE 'label=".*"' | sed -e 's/label="//' -e 's/"$//' > /tmp/orcid-names.txt
$ ./ilri/update-orcids.py -i /tmp/orcid-names.txt -db dspace -u dspace -p 'fuuu' -m 247

After that there were still some poorly formatted ones that my script didn’t fix, so perhaps these are new ones not in our list
- I dumped them and combined with the existing ones to resolve later:

localhost/dspace= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=247 AND text_value LIKE '%http%') to /tmp/orcid-formatting.txt;
COPY 36

I think there are really just some new ones…

$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/orcid-formatting.txt| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2022-12-08-orcids.txt 
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1907
$ wc -l /tmp/2022-12-08-orcids.txt
1939 /tmp/2022-12-08-orcids.txt

Then I applied these updates on CGSpace
Maria mentioned that she was getting a lot more items in her daily subscription emails
- I had a hunch it was related to me updating the last_modified timestamp after updating a bunch of countries, regions, etc in items
- Then today I noticed this option in dspace.cfg: eperson.subscription.onlynew
- By default DSpace sends notifications for modified items too! I’ve disabled it now…