CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

December, 2022

2022-12-01

  • Fix some incorrect regions on CGSpace
    • I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
  • Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
  • Replace “East Asia” with “Eastern Asia” region on CGSpace (UN M.49 region)
  • CGSpace and PRMS information session with Enrico and a bunch of researchers
  • I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance
  • I startd a harvest on AReS since we’ve updated so much metadata recently

2022-12-02

2022-12-03

  • I downloaded a fresh copy of CLARISA’s institutions list as well as ROR’s latest dump from 2022-12-01 to check how many are matching:
$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
1864
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
  • Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
  • If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:
$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
  • I checked CGSpace’s top 1,000 affiliations too, first exporting from PostgreSQL:
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
  • Then cutting (tab is the default delimeter):
$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
542
  • So that’s a 54% match for our top affiliations
  • I realized we should actually check affiliations and sponsors, since those are stored in separate fields
    • When I add those the matches go down a bit to 45%
  • Oh man, I realized institutions like Université d'Abomey Calavi don’t match in ROR because they are like this in the JSON:
"name": "Universit\u00e9 d'Abomey-Calavi"
  • So we likely match a bunch more than 50%…
  • I exported a list of affiliations and donors from CGSpace for Peter to look over and send corrections

2022-12-05

  • First day of PRMS technical workshop in Rome
  • Last night I submitted a CSV import with changes to 1,500 Alliance items (adding regions) and it hadn’t completed after twenty-four hours so I canceled it
    • Not sure if there is some rollback that will happen or what state the database will be in, so I will wait a few hours to see what happens before trying to modify those items again
    • I started it again a few hours later with a subset of the items and 4GB of RAM instead of 2
    • It completed successfully…

2022-12-07

  • I found a bug in my csv-metadata-quality script regarding the regions
    • I was accidentally checking cg.coverage.subregion due to a sloppy regex
    • This means I’ve added a few thousand UN M.49 regions to the cg.coverage.subregion field in the last few days
    • I had to extract them from CGSpace and delete them using delete-metadata-values.py
  • My DSpace 7.x pull request to tell ImageMagick about the PDF CropBox was merged
  • Start a harvest on AReS

2022-12-08

  • While on the plane I decided to fix some ORCID identifiers, as I had seen some poorly formatted ones
    • I couldn’t remember the XPath syntax so this was kinda ghetto:
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE 'label=".*"' | sed -e 's/label="//' -e 's/"$//' > /tmp/orcid-names.txt
$ ./ilri/update-orcids.py -i /tmp/orcid-names.txt -db dspace -u dspace -p 'fuuu' -m 247
  • After that there were still some poorly formatted ones that my script didn’t fix, so perhaps these are new ones not in our list
    • I dumped them and combined with the existing ones to resolve later:
localhost/dspace= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=247 AND text_value LIKE '%http%') to /tmp/orcid-formatting.txt;
COPY 36
  • I think there are really just some new ones…
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/orcid-formatting.txt| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2022-12-08-orcids.txt 
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1907
$ wc -l /tmp/2022-12-08-orcids.txt
1939 /tmp/2022-12-08-orcids.txt
  • Then I applied these updates on CGSpace
  • Maria mentioned that she was getting a lot more items in her daily subscription emails
    • I had a hunch it was related to me updating the last_modified timestamp after updating a bunch of countries, regions, etc in items
    • Then today I noticed this option in dspace.cfg: eperson.subscription.onlynew
    • By default DSpace sends notifications for modified items too! I’ve disabled it now…
  • I applied 498 fixes and two deletions to affiliations sent by Peter
  • I applied 206 fixes and eighty-one deletions to donors sent by Peter
  • I tried to figure out how to authenticate to the DSpace 7 REST API
    • First you need a CSRF token, before you can even try to authenticate
    • Then you can authenticate, but I can’t get it to work:
$ curl -v https://dspace7test.ilri.org/server/api
...
dspace-xsrf-token: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4
...
$ curl -v -X POST --data "user=aorth@omg.com&password=myPassword" "https://dspace7test.ilri.org/server/authn/login" -H "X-XSRF-TOKEN: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4"
  • Start a harvest on AReS

2022-12-09

2022-12-11

  • I got LDAP authentication working on DSpace 7

2022-12-12

2022-12-13

dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
 text_lang |  count  
-----------+---------
 en_US     | 3050302
 en        |     618
           |     605
 fr        |       2
 vi        |       2
 es        |       1
           |       0
(7 rows)

dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '', NULL);
UPDATE 1223
dspace=# COMMIT;
COMMIT
  • I wrote an initial version of a script to map CGSpace items to Initiative collections based on their cg.contributor.initiative metadata
    • I am still considering if I want to add a mode to un-map items that are mapped to collections, but do not have the corresponding metadata tag

2022-12-14

  • Lots of work on PRMS related metadata issues with CGSpace
    • We noticed that PRMS uses cg.identifier.dataurl for the FAIR score, but not cg.identifier.url
    • We don’t use these consistently for datasets in CGSpace so I decided to move them to the dataurl field, but we will also ask the PRMS team to consider the normal URL field, as there are commonly other external resources related to the knowledge product there
  • I updated the move-metadata-values.py script to use the latest best practices from my other scripts and some of the helper functions from util.py
    • Then I exported a list of text values pointing to Dataverse instances from cg.identifier.url:
localhost/dspace= ☘ \COPY (SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=219 AND (text_value LIKE '%persistentId%' OR text_value LIKE '%20.500.11766.1/%')) to /tmp/data.txt;
COPY 61
  • Then I moved them to cg.identifier.dataurl on CGSpace:
$ ./ilri/move-metadata-values.py -i /tmp/data.txt -db dspace -u dspace -p 'dom@in34sniper' -f cg.identifier.url -t cg.identifier.dataurl
  • I still need to add a note to the CGSpace submission form to inform submitters about the correct field for dataset URLs
  • I finalized work on my new fix-initiative-mappings.py script
    • It has two modes:
      1. Check item metadata to see which Initiatives are tagged and then map the item if it is not yet mapped to the corresponding Initiative collection
      2. Check item collections to see which Initiatives are mapped and then unmap the item if the corresponding Initiative metadata is missing
    • The second one is disabled by default until I can get more feedback from Abenet, Michael, and others
  • After I applied a handful of collection mappings I started a harvest on AReS

2022-12-15

  • I did some metadata quality checks on the Initiatives collection, adding some missing regions and removing a few duplicate ones

2022-12-18

  • Load on the server is a bit high
    • Looking at the nginx logs I see someone from the University of Chicago (128.135.98.29) is using RStudio Desktop to query and scrape CGSpace
# grep -c 'RStudio Desktop' /var/log/nginx/access.log
5570
  • RStudio is already in the ILRI bot overrides for DSpace so it shouldn’t be causing any extra hits, but I’ll put an HTTP 403 in the nginx config to tell the user to use the REST API
  • Start a harvest on AReS

2022-12-21

  • I saw that load on CGSpace was over 20.0 for several hours
    • I saw there were some stuck locks in PostgreSQL:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    948 dspaceApi
     30 dspaceCli
   1237 dspaceWeb
  • Ah, it’s likely there is something stuck because I see the load high since yesterday at 6AM, which is 24 hours now:

CPU load day PostgreSQL locks week

  • I ran all updates and restarted the server

2022-12-22

  • I exported the Initiatives collection to check the mappings
    • My fix-initiative-mappings.py script found six items that could be mapped to new collections based on metadata
    • I am still not doing automatic unmappings though…

2022-12-23

  • I exported the Initiatives collection to check the metadata quality
    • I fixed a few errors and missing regions using csv-metadata-quality
  • Abenet and Bizu noticed some strange characters in affiliations submitted by MEL
    • They appear like so in four items currently Instituto Nacional de Investigaci�n y Tecnolog�a Agraria y Alimentaria, Spain
    • I submitted an issue on MEL’s GitHub repository

2022-12-24

  • Export the ILRI community to try to see if there were any items with Initiative metadata that are not mapped to Initiative collections
    • I found about twenty…
    • Then I did the same for the AICCRA community

2022-12-25

  • The load on the server is high and I see some seemingly stuck PostgreSQL locks from dspaceCli:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
     44 dspaceApi
     58 dspaceCli
SELECT pl.pid FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.application_name = 'dspaceCli'
  • And the SQL queries themselves:
postgres=# SELECT pid, state, usename, query, query_start 
FROM pg_stat_activity 
WHERE pid IN (
  SELECT pl.pid FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE psa.application_name = 'dspaceCli'
);
  • For these fifty-eight locks there are only six queries running
    • Interestingly, they all started at either 04:00 or 05:00 this morning…
  • I canceled one using SELECT pg_cancel_backend(1098749); and then two of the other PIDs died, perhaps they were dependent?
    • Then I canceled the next one and the remaining ones died also
  • I exported the entire CGSpace and then ran the fix-initiative-mappings.py script, which found 124 items to be mapped
    • Getting only the items that have new mappings from the output file is currently tricky because you have to change the file to unix encoding, capture the diff output from the original, and re-add the column headers, but at least this makes the DSpace batch import have to check WAY fewer items
    • For the record, I used grep to get only the new lines:
$ grep -xvFf /tmp/orig.csv /tmp/cgspace-mappings.csv > /tmp/2022-12-25-fix-mappings.csv
  • Then I imported to CGSpace, and will start an AReS harvest once its done
    • The import process was quick but it triggered a lot of Solr updates and I see locks rising from dspaceCli again
    • After five hours the Solr updating from the metadata import wasn’t finished, so I cancelled it, and I see that the items were not mapped…
    • I split the CSV into multiple files, each with ten items, and the first one imported, but the second went on to do Solr updating stuff forever…
    • All twelve files worked except the second one, so it must be something with one of those items…
  • Now I started a harvest on AReS

2022-12-28

  • I got a notice from UptimeRobot that CGSpace was down
    • I look at the server and the load is only 3 or 4.x and looking at Munin I don’t see any system statistics that are alarming
    • PostgreSQL locks look fine, memory and DSpace sessions look fine…
    • There were a strangely high number of tuple accesses half an hour ago, and high CPU going up to then

PostgreSQL tuple access CPU day

  • And I can access the website just fine, so I guess everything is OK
  • I exported the Initiatives collection to tag missing regions…

2022-12-29

  • I exported the Initiatives collection again and I’m wondering why we have so many items with text_lang set to NULL and others when I have been periodically resetting them
    • It turns out that doing ... text_lang IN ('en', '', NULL) doesn’t properly check for values with NULL
    • We actually need to do:
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IS NULL OR text_lang IN ('en', '');
  • I updated the text lang values on CGSpace and re-exported the community
    • I fixed a bunch of invalid licenses in these items
    • Then I added mappings for another handful of items
  • I tagged ORCID identifiers for another thirty items or so
  • At 8PM I got a notice from UptimeRobot again that CGSpace was down
    • The load is still only around 2.x or 3.x, but there are a lot (and increasing) number of PostgreSQL connections and locks
    • They appear to be all from the frontend:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
   2892 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
   2950 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
   3792 dspaceWeb
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
   4460 dspaceWeb
  • I don’t see any other system statistics that look out of order…
    • DSpace sessions, network throughput, CPU, etc all seem sane…
    • And then all of a sudden, I didn’t do anything, but all the locks disappeared and I was able to access the website… WTF

2022-12-30

  • Start a harvest on AReS

2022-12-31

  • I found a bunch of items on AReS that have issue dates in 2023 which made me curious
    • Looking closer, I think all of these have been tagged incorrectly because they were published online already in 2022
    • I sent a mail to Abenet and Bizu to ask, but for sure I know that PRMS will be considering first published date as first published date, no matter if that is online or in print
    • I also added some ORCID identifiers to our list and generated thumbnails for some journal articles that were Creative Commons