CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

June, 2023

2023-06-02

  • Spend some time testing my post_bitstreams.py script to update thumbnails for items on CGSpace
    • Interestingly I found an item with a JFIF thumbnail and another with a WebP thumbnail…
  • Meeting with Valentina, Stefano, and Sara about MODS metadata in CGSpace
    • They have experience with improving the MODS interface in MELSpace’s OAI-PMH for use with AGRIS and were curious if we could do the same in CGSpace
    • From what I can see we need to upgrade the MODS schema from 3.1 to 3.7 and then just add a bunch of our fields to the crosswalk

2023-06-04

  • Upgrade CGSpace to Ubuntu 22.04
    • The upgrade was mostly normal, but I had to unhold the openjdk package in order for do-release-upgrade to run:
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
  • In 2022-11 an upstream Java update broke the DSpace 6 Handle server so we will have to pin this again after the upgrade to Ubuntu 22.04
  • After the upgrade I made sure CGSpace was working, then proceeded to upgrade PostgreSQL from 12 to 14, like I did on DSpace Test in 2023-03
  • Then I had to downgrade OpenJDK to fix the Handle server using the ones I had previously downloaded for Ubuntu 20.04 because they no longer exist on Launchpad:
# dpkg -i openjdk-8-j*8u342-b07*.deb
  • Export CGSpace to fix missing Initiative collection mappings
  • Start a harvest on AReS
  • Work on the DSpace 7 migration a bit more
    • I decided to rebase and drop all the submission form edits because they conflict every time upstream changes!

2023-06-06

  • Fix some incorrect ORCID identifiers for an Alliance author on CGSpace
  • Export our list of ORCID identifiers, resolve them, and update the records in CGSpace:
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-06-06-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2023-06-06-orcids.txt -o /tmp/2023-06-06-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2023-06-06-orcids-names.txt -db dspacetest -u dspace -p 'ffff' -m 247
  • Start working on updating the MODS schema in CGSpace from 3.1 to 3.8 based on Stefano and Salem’s work last year

2023-06-08

  • Continue working on the MODS schema mapping
  • Export CGSpace to check and update dcterms.extent fields
    • I normalized about 1,500 to use either “p. 1-6” or “5 p.” format
    • Also, I used this GREL expression to extract missing pages from the citation field: cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*(pp?\.\s?\d+[-–]\d+).*/)[0]
    • This was over 4,000 items with a format like “p. 1-6” and “pp. 1-6” in the citation
    • I used another GREL expression to extract another 5,000: cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*?(\d+\s+?[Pp]+\.).*/)[0]
    • This was for the format like “1 p.” (note we had to protect against the greedy .* in the beginning)
  • I also did some work to capture a handful of missing DOIs and ISSNs, but it was only about 100 items and I will have to wait until the 10,000+ above finish importing