--- title: "March, 2024" date: 2024-03-01T09:55:00+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2024-03-01 - Last week Bizu reported an issue with the "browse by issue date" drop down - I verified it, and suspect it could be due to missing issue dates... - It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808 - I spent some time trying to reproduce the bug affecting `onebox` fields that are configured to use external vocabularies and are not repeatable - I filed an issue: https://github.com/DSpace/dspace-angular/issues/2846 ## 2024-03-03 - I did some cleanups on abstracts, licenses, and dates from CrossRef - I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list ## 2024-03-05 - I tried a new technique to get some affiliations from Crossref using OpenRefine - First I split them and clustered, resolving a few hundred clusters out of 1500 (!) - Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work - Then I joined them with our affiliations, paying no attention to duplicates - Then I deduped them using the Jython technique I learned in 2023-02 ## 2024-03-06 - Peter sent me some more corrections for the authors that I had sent him in 2023-12 ## 2024-03-08 - IFPRI sent me their 2023 records from CONTENTdm so I started working on those - I found a way to match their ORCID identifiers in our list using Jython in OpenRefine: ```python import re with open(r"/tmp/cg-creator-identifier.txt",'r') as f : orcid_ids = [orcid_id.strip() for orcid_id in f] matched = False for orcid_id in orcid_ids: if re.search(r'.+: {}'.format(value), orcid_id): matched = True break if matched: return orcid_id else: return value ``` - I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata: ```sql UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF'); ``` - Note the use of two single quotes to escape the one in the name ## 2024-03-11 - Experimenting with moving some of my Python scripts to the DSpace 7 REST API - I need a way to get UUIDs for Handles... - Seems that I can use a Discovery query like: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&query=handle:10568/130864 - Then just take the first result...? - I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic - I also noticed that one item had two abstracts, but the first one was blank! - Looking deeper, I found 113 blank metadata values so I deleted those: ```sql BEGIN; DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value=''; COMMIT; ``` - I also found a few dozen items with "N/A" for their citation, so I deleted those too: ```sql BEGIN; DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146; COMMIT; ``` - I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time - Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her: ```console $ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \ | csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv ``` ## 2024-03-12 - Run the duplicate checker for IFPRI 2023 batch upload ## 2024-03-13 - I found about 428 duplicates in the IFPRI 2023 batch records - Alarmingly, I found about 18 that are duplicated on CGSpace as well! - I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones - Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable - I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance: ```sql SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%'; ``` ## 2024-03-14 - Looking in to reports of rate limiting of Altmetric's bot on CGSpace - I don't see any HTTP 429 responses for their user agents in any of our logs... - I tried myself on an item page and never hit a limit... ```console $ for num in {1..60}; do echo -n "Request ${num}: "; curl -s -o /dev/null -w "%{http_code}" https://dspace7test.ilri.org/items/c9b8999d-3001-42ba-a267-14f4bfa90b53 && echo; done Request 1: 200 Request 2: 200 Request 3: 200 Request 4: 200 ... Request 60: 200 ``` - All responses were HTTP 200... - In any case, I whitelisted their production IPs and told them to try again - I imported 468 of IFPRI's 2023 records that were confirmed to not be duplicates to CGSpace - I also spent some time merging metadata from 415 of the remaining 432 duplicates with the metadata for the existing items on CGSpace - This was a bit of dirty work using csvkit, xsv, and OpenRefine ## 2024-03-17 - There are 17 records from IFPRI's 2023 batch that are remaining from the 432 that I identified as already being on CGSpace - These are different in that they are duplicates on CGSpace as well, so the csvjoin failed and the metadata got messed up in my migration - I looked closer and whittled this down to 14 actual records, and spent some time working on them - I isolated 12 of these items that existed on CGSpace and added publication ranks, project identifiers, and provenance links - Now there only remain two confusing records about the Inkomati catchment ## 2024-03-18 - Checking to see how many IFPRI records we have migrated so far: ```console $ csvgrep -c 'dc.description.provenance[en_US]' -m 'Original URL from IFPRI CONTENTdm' cgspace.csv \ | csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],dc.description.provenance[en_US],dcterms.type[en_US]' \ | tee /tmp/ifpri-records.csv \ | csvstat --count 898 ``` - I finalized the remaining two on Inkomati catchment and now we are at 900! # 2024-03-19 - IWMI sent me some new author ORCID identifiers so I updated our list - Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC - First extracting all unique subjects on CGSpace: ``` localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER; COPY 28024 ``` - Then I extracted the subjects and looked them up against AGROVOC: ```console $ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv | sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt $ ./ilri/agrovoc_lookup.py -i /tmp/2024-03-19-cgspace-subjects.txt -o /tmp/2024-03-19-cgspace-subjects-results.csv ``` ## 2024-03-20 - Identify seven duplicates on CGSpace from the PRMS results and withdraw them from CGSpace ## 2024-03-21 - Look more closely at duplicates on CGSpace based on a fresh export - Using DOIs I found ~842 that occur more than once for journal articles alone, so probably around 400 duplicates - I did a handful of them, merging the metadata and withdrawing the duplicate, and decided to add `dcterms.replaces` with the handle in the original ## 2024-03-22 - Look at duplicate DOIs on CGSpace and address a dozen or so ## 2024-03-23 - Look at duplicate DOIs on CGSpace and address a dozen or so - Update Tomcat and Solr to latest versions - I had done some tests with these last week, and did a last minute test on DSpace 7 Test to make sure submission and searching worked ## 2024-03-24 - Slowly process several dozen more duplicate DOIs on CGSpace, sigh... ## 2024-03-26 - File an issue on dspace-angular about improving withdrawn item tombstones: https://github.com/DSpace/dspace-angular/issues/2880 - Merge metadata and withdraw more duplicates on CGSpace