CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

August, 2022

2022-08-01

2022-08-02

  • Resume working on the MARLO Innovations
    • Last week Jose had sent me an updated CSV with UTF-8 formatting, which was missing the filename column
    • I joined it with the older file (stripped down to just the cg.number and filename columns and then did the same cleanups I had done last week
    • I noticed there are six PDFs unused, so I asked Jose
  • Spent some time trying to understand the REST API submission issues that Rafael from CIAT is having with tip-approve and tip-submit
    • First, according to my notes in 2020-10, a user must be a collection admin in order to submit via the REST API
    • Second, a collection must have a “Accept/Reject/Edit Metadata” step defined in the workflow
    • Also, I referenced my notes from this gist I had made for exactly this purpose! https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a

2022-08-03

  • I came up with an interesting idea to add missing countries and AGROVOC terms to the MARLO Innovation metadata
    • I copied the abstract column to two new fields: countrytest and agrovoctest and then used this Jython code as a transform to drop terms that don’t match (using CGSpace’s country list and list of 1,400 AGROVOC terms):
with open(r"/tmp/cgspace-countries.txt",'r') as f :
    countries = [name.rstrip().lower() for name in f]

return "||".join([x for x in value.split(' ') if x.lower() in countries])
  • Then I joined them with the other country and AGROVOC columns
    • I had originally tried to use csv-metadata-quality to look up and drop invalid AGROVOC terms but it was timing out ever dozen or so requests
    • Then I briefly tried to use lightrdf to export a text file of labels from AGROVOC’s RDF, but I couldn’t figure it out
    • I just realized this will not match countries with spaces in our cell value, ugh… and Jython has weird syntax and errors and I can’t get normal Python code to work here, I’m missing something
  • Then I extracted the titles, dates, and types and added IDs, then ran them through check-duplicates.py to find the existing items on CGSpace so I can add them as dcterm.relation links
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-08-03-Innovations-Cleaned.csv | sed '1s/line_number/id/' > /tmp/innovations-temp.csv
$ ./ilri/check-duplicates.py -i /tmp/innovations-temp.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/ccafs-duplicates.csv
  • There were about 115 with existing items on CGSpace
  • Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
$ csvjoin --left -c dc.title ~/Downloads/2022-08-03-Innovations-Cleaned.csv ~/Downloads/2022-08-03-Innovations-relations.csv > /tmp/innovations-with-relations.csv
  • Then I used SAFBuilder to create a SimpleItemArchive and import to DSpace Test:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map
  • Meeting with Mohammed Salem about harmonizing MEL and CGSpace metadata fields
    • I still need to share our results and recommendations with Peter, Enrico, Sara, Svetlana, et al
  • I made some minor fixes to csv-metadata-quality while working on the MARLO CRP Innovations

2022-08-05

  • I discussed issues with the DSpace 7 submission forms on Slack and Mark Wood found that the migration tool creates a non-working submission form
    • After updating the class name of the collection step and removing the “complete” and “sample” steps the submission form was working
    • Now the issue is that the controlled vocabularies show up like this:

Controlled vocabulary bug in DSpace 7

  • I think we need to add IDs, I will have to check what the implications of that are
  • Emilio contacted me last week to say they have re-worked their harvester on Hetzner to use a new user agent: AICCRA website harvester
    • I verified that I see it in the REST API logs, but I don’t see any new stats hits for it
    • I do see 11,000 hits from that IP last month when I had the incorrect nginx configuration that was sending a literal $http_user_agent so I purged those
    • It is lucky that we have harvest in the DSpace spider agent example file so Solr doesn’t log these hits, nothing needed to be done in nginx

2022-08-13

  • I noticed there was high load on CGSpace, around 9 or 10
    • Looking at the Munin graphs it seems to just be the last two hours or so, with a slight increase in PostgreSQL connections, firewall traffic, and a more noticeable increase in CPU
    • DSpace sessions are normal
    • The number of unique hosts making requests to nginx is pretty low, though it’s only 6AM in the server’s time
  • I see one IP in Sweden making a lot of requests with a normal user agent: 80.248.237.167
    • This host is on Internet Vikings (INTERNETBOLAGET), and I see 140,000 requests from them in Solr
    • I see reports of excessive scraping on AbuseIPDB.com
    • I’m gonna add their 80.248.224.0/20 to the bot-networks.conf in nginx
    • I will also purge all the hits from this IP in Solr statistics
  • I also see the core.ac.uk bot making tens of thousands of requests today, but we are already tagging that as a bot in Tomcat’s Crawler Session Manager valve, so they should be sharing a Tomcat session with other bots and not creating too many sessions

2022-08-15

  • Start indexing on AReS
  • Add CONSERVATION to ILRI subjects on CGSpace
    • I see that AGROVOC has conservation agriculture and I suggested that we use that instead

2022-08-17

  • Peter and Jose sent more feedback about the CRP Innovation records from MARLO
    • We expanded the CRP names in the citation and removed the cg.identifier.url URLs because they are ugly and will stop working eventually
    • The mappings of MARLO links will be done internally with the cg.number IDs like “IN-1119” and the Handle URIs

2022-08-18

  • I talked to Jose about the CCAFS MARLO records
    • He still hasn’t finished re-processing the PDFs to update the internal MARLO links
    • I started looking at the other records (MELIAs, OICRs, Policies) and found some minor issues in the MELIAs so I sent feedback to Jose
    • On second thought, I opened the MELIAs file in OpenRefine and it looks OK, so this must have been a parsing issue in LibreOffice when I was checking the file (or perhaps I didn’t use the correct quoting when importing)
  • Import the original MELIA v2 CSV file into OpenRefine to fix encoding before processing with csvcut/csvjoin
    • Then extract the IDs and filenames from the original V2 file and join with the UTF-8 file:
$ csvcut -c 'cg.number (series/report No.)',File ~/Downloads/MELIA-Metadata-v2-csv.csv > MELIA-v2-IDs-Files.csv
$ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM.csv MELIA-v2-IDs-Files.csv > MELIAs-UTF-8-with-files.csv
  • Then I imported them into OpenRefine to start metadata cleaning and enrichment
  • Make some minor changes to cgspace-submission-guidelines
    • Upgrade to Bootstrap v5.2.0
    • Dedupe value pairs and controlled vocabularies before writing them
    • Sort the controlled vocabularies before writing them (we don’t do this for value pairs because some are added in specific order, like CRPs)

2022-08-19

  • Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
    • I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
    • Then I checked for duplicates:
$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
  • I sent the list of ~130 possible duplicates to Peter to check
  • Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
    • The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
    • I asked them why they don’t just use the original links in the first place in case tinyurl.com disappears
  • I continued working on the MARLO MELIA v2 UTF-8 metadata
    • I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
    • It helps to replace some characters with spaces first with this GREL: value.replace(/[.\/;(),]/, " ")
    • This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
    • Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
  • Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
  • I had to use xsv because csvcut was throwing an error detecting the dialect of the input CSVs (?)
  • I created a SAF bundle and imported the 749 MELIAs to DSpace Test
  • I found thirteen items on CGSpace with dates in format “DD/MM/YYYY” so I fixed those