CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2022

2022-11-01

  • Last night I re-synced DSpace 7 Test from CGSpace
    • I also updated all my local 7_x-dev branches on the latest upstreams
  • I spent some time updating the authorizations in Alliance collections
    • I want to make sure they use groups instead of individuals where possible!
  • I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue

2022-11-02

  • I joined the FAO–CGIAR AGROVOC results sharing meeting
    • From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion
  • Doing duplicate check on IFPRI’s batch upload and I found one duplicate uploaded by IWMI earlier this year
    • I will update the metadata of that item and map it to the correct Initiative collection

2022-11-03

  • I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace
  • I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:
localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
COPY 1268
  • Then I started a test run on DSpace Test:
$ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt
  • I’ll be curious to check the log after it’s all done.
    • After a few hours I see:
$ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log 
626
  • Not bad, because last week I did a more manual selection of collections and deleted ~200
    • I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool
  • I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
    • Export a list of items with PDFs linked there:
localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
COPY 4621
  • After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones…
$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
2752
  • I’m not sure how we’ll handle the duplicates because many items are book chapters or something where they share a PDF

2022-11-04

  • I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description “Generated Thumbnail”:
localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
COPY 1147
$ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt
$ wc -l /tmp/old-thumbnail-handles.txt 
987 /tmp/old-thumbnail-handles.txt
  • A bunch of these have \N for some reason when I use the ds6_bitstream2itemhandle function to get their handles so I had to exclude those…
    • I forced the media-filter for these items on CGSpace:
$ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt
  • Upload some batch records via CSV for Peter
  • Update the about page on CGSpace with new text from Peter
  • Add a few more ORCID identifiers and names to my growing file 2022-09-22-add-orcids.csv
    • I tagged fifty-four new authors using this list
  • I deleted and mapped one duplicate item for Maria Garruccio
  • I updated the CG Core website from Bootstrap v4.6 to v5.2

2022-11-07

  • I did a harvest on AReS last night but it seems that MELSpace’s sitemap is broken again because we have 10,000 fewer records
  • I filed an issue on the iso-3166-1 npm package to update the name of Turkey to Türkiye
    • I also filed an issue and a pull request on the pycountry package
    • I also filed an issue and a pull request on the country-converter package
    • I also changed one item on CGSpace that had been submitted since the name was changed
    • I also imported the new iso-codes 4.12.0 into cgspace-java-helpers
    • I also updated it in the DSpace input-forms.xml
    • I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
  • Since I was making all these pull requests I also made one on country-converter for the UN M.49 region “South-eastern Asia”
  • Port the ImageMagick PDF cropbox fix to DSpace 6.x
    • I deployed it on CGSpace, ran all updates, and rebooted the host
    • I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
  • But looking at the items it processed, I’m not sure it’s working as expected
    • I looked at a few dozen
  • I found some links to the Bioversity website on CGSpace that are not redirecting properly:
$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html 
GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.bioversityinternational.org
User-Agent: HTTPie/3.2.1

HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 275
Content-Type: text/html; charset=iso-8859-1
Date: Mon, 07 Nov 2022 16:35:21 GMT
Keep-Alive: timeout=15, max=100
Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
Server: Apache
  • The Location header is clearly wrong, and if I try https directly I get an HTTP 500

2022-11-08

  • Looking at the Solr statistics hits on CGSpace for 2022-11
    • I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
    • I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
    • I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
    • I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
    • I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
    • I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
    • I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
    • I will purge all these hits and proably add China Unicom’s subnet mask to my nginx bot-network.conf file to tag them as bots since there are SO many bad and malicious requests coming from there
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 8975 hits from 221.219.100.42 in statistics
Purging 7577 hits from 122.10.101.60 in statistics
Purging 6536 hits from 135.125.21.38 in statistics
Purging 23950 hits from 163.237.216.11 in statistics
Purging 4093 hits from 51.254.154.148 in statistics
Purging 2797 hits from 221.219.103.211 in statistics
Purging 2618 hits from 216.218.223.53 in statistics

Total number of bot hits purged: 56546
  • Also interesting to see a few new user agents:
    • RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)
    • rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)
    • MEL
    • Gov employment data scraper ([[your email]])
    • RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)
  • I will purge all these:
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 6155 hits from RStudio in statistics
Purging 1929 hits from rstudio in statistics
Purging 1454 hits from MEL in statistics
Purging 1094 hits from Gov employment data scraper in statistics

Total number of bot hits purged: 10632
  • Work on the CIAT Library items a bit again in OpenRefine
    • I flagged items with:
      • URL containing “#page” at the end (these are linking to book chapters, but we don’t want to upload the PDF multiple times)
      • Same URL used by more than one item (“Duplicates” facet in OpenRefine, these are some corner case I don’t want to handle right now)
      • URL containing “:8080” to CIAT’s old DSpace (this server is no longer live)
    • I want to try to handle the simple cases that should cover most of the items first

2022-11-09

  • Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
    • I got the basic functionality working