CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2024

2024-03-01

2024-03-03

  • I did some cleanups on abstracts, licenses, and dates from CrossRef
  • I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list

2024-03-05

  • I tried a new technique to get some affiliations from Crossref using OpenRefine
    • First I split them and clustered, resolving a few hundred clusters out of 1500 (!)
    • Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work
    • Then I joined them with our affiliations, paying no attention to duplicates
    • Then I deduped them using the Jython technique I learned in 2023-02

2024-03-06

  • Peter sent me some more corrections for the authors that I had sent him in 2023-12

2024-03-08

  • IFPRI sent me their 2023 records from CONTENTdm so I started working on those
    • I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:
import re

with open(r"/tmp/cg-creator-identifier.txt",'r') as f :
    orcid_ids = [orcid_id.strip() for orcid_id in f]

matched = False
for orcid_id in orcid_ids:
    if re.search(r'.+: {}'.format(value), orcid_id):
        matched = True
        break

if matched:
    return orcid_id
else:
    return value
UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');
  • Note the use of two single quotes to escape the one in the name

2024-03-11

  • Experimenting with moving some of my Python scripts to the DSpace 7 REST API
  • I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic
    • I also noticed that one item had two abstracts, but the first one was blank!
    • Looking deeper, I found 113 blank metadata values so I deleted those:
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
COMMIT;
  • I also found a few dozen items with “N/A” for their citation, so I deleted those too:
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146;
COMMIT;
  • I deployed the change to disable Angular SSR’s inlineCriticalCss on production because we had heavy load on the frontend and I’ve been meaning to do this permanently for some time
  • Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:
$ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \
  | csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv

2024-03-12

  • Run the duplicate checker for IFPRI 2023 batch upload

2024-03-13

  • I found about 428 duplicates in the IFPRI 2023 batch records
    • Alarmingly, I found about 18 that are duplicated on CGSpace as well!
    • I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones
  • Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable
SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';