November, 2022

Tue Nov 01, 2022 in Notes

2022-11-01

Last night I re-synced DSpace 7 Test from CGSpace
- I also updated all my local 7_x-dev branches on the latest upstreams
I spent some time updating the authorizations in Alliance collections
- I want to make sure they use groups instead of individuals where possible!
I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue

I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!
I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test
Tim merged my pull request to override the ImageMagick PDF density in DSpace 7
- I ported it to DSpace 6.x and submitted a pull request: https://github.com/DSpace/DSpace/pull/8560

2022-11-02

I joined the FAO–CGIAR AGROVOC results sharing meeting
- From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion
Doing duplicate check on IFPRI’s batch upload and I found one duplicate uploaded by IWMI earlier this year
- I will update the metadata of that item and map it to the correct Initiative collection

2022-11-03

I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace
I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:

localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
COPY 1268

Then I started a test run on DSpace Test:

$ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt

I’ll be curious to check the log after it’s all done.
- After a few hours I see:

$ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log 
626

Not bad, because last week I did a more manual selection of collections and deleted ~200
- I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool
I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
- Export a list of items with PDFs linked there:

localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
COPY 4621

After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones…

$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
2752

I’m not sure how we’ll handle the duplicates because many items are book chapters or something where they share a PDF

2022-11-04

I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description “Generated Thumbnail”:

localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
COPY 1147
$ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt
$ wc -l /tmp/old-thumbnail-handles.txt 
987 /tmp/old-thumbnail-handles.txt

A bunch of these have \N for some reason when I use the ds6_bitstream2itemhandle function to get their handles so I had to exclude those…
- I forced the media-filter for these items on CGSpace:

$ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt

Upload some batch records via CSV for Peter
Update the about page on CGSpace with new text from Peter
Add a few more ORCID identifiers and names to my growing file 2022-09-22-add-orcids.csv
- I tagged fifty-four new authors using this list
I deleted and mapped one duplicate item for Maria Garruccio
I updated the CG Core website from Bootstrap v4.6 to v5.2

2022-11-07

I did a harvest on AReS last night but it seems that MELSpace’s sitemap is broken again because we have 10,000 fewer records
I filed an issue on the iso-3166-1 npm package to update the name of Turkey to Türkiye
- I also filed an issue and a pull request on the pycountry package
- I also filed an issue and a pull request on the country-converter package
- I also changed one item on CGSpace that had been submitted since the name was changed
- I also imported the new iso-codes 4.12.0 into cgspace-java-helpers
- I also updated it in the DSpace input-forms.xml
- I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
  - I submitted a pull request to update this upstream
Since I was making all these pull requests I also made one on country-converter for the UN M.49 region “South-eastern Asia”
Port the ImageMagick PDF cropbox fix to DSpace 6.x
- I deployed it on CGSpace, ran all updates, and rebooted the host
- I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:

$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log

But looking at the items it processed, I’m not sure it’s working as expected
- I looked at a few dozen
I found some links to the Bioversity website on CGSpace that are not redirecting properly:

$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html 
GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.bioversityinternational.org
User-Agent: HTTPie/3.2.1

HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 275
Content-Type: text/html; charset=iso-8859-1
Date: Mon, 07 Nov 2022 16:35:21 GMT
Keep-Alive: timeout=15, max=100
Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
Server: Apache

The Location header is clearly wrong, and if I try https directly I get an HTTP 500

2022-11-08

Looking at the Solr statistics hits on CGSpace for 2022-11
- I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
- I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
- I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
- I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
- I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
- I will purge all these hits and proably add China Unicom’s subnet mask to my nginx bot-network.conf file to tag them as bots since there are SO many bad and malicious requests coming from there

$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 8975 hits from 221.219.100.42 in statistics
Purging 7577 hits from 122.10.101.60 in statistics
Purging 6536 hits from 135.125.21.38 in statistics
Purging 23950 hits from 163.237.216.11 in statistics
Purging 4093 hits from 51.254.154.148 in statistics
Purging 2797 hits from 221.219.103.211 in statistics
Purging 2618 hits from 216.218.223.53 in statistics

Total number of bot hits purged: 56546

Also interesting to see a few new user agents:
- RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)
- rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)
- MEL
- Gov employment data scraper ([[your email]])
- RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)
I will purge all these:

$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 6155 hits from RStudio in statistics
Purging 1929 hits from rstudio in statistics
Purging 1454 hits from MEL in statistics
Purging 1094 hits from Gov employment data scraper in statistics

Total number of bot hits purged: 10632

Work on the CIAT Library items a bit again in OpenRefine
- I flagged items with:
  - URL containing “#page” at the end (these are linking to book chapters, but we don’t want to upload the PDF multiple times)
  - Same URL used by more than one item (“Duplicates” facet in OpenRefine, these are some corner case I don’t want to handle right now)
  - URL containing “:8080” to CIAT’s old DSpace (this server is no longer live)
- I want to try to handle the simple cases that should cover most of the items first

2022-11-09

Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
- I got the basic functionality working

2022-11-12

Start a harvest on AReS

2022-11-15

Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
- We agreed to continue adding the feedback for each of the proposed actions
- The others will start filling in definitions for the types
- Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
  - We need to be careful especially with regards to author’s outputs that will be reported in the PRMS

2022-11-16

Maria asked if we can extend the timeout for XMLUI sessions
- According to this issue it seems to be 30 minutes by default, as a Tomcat default
- I think we could extend this to an hour, as there is no real security risk (we’re not a bank) and most user’s lock screens would have activated after ten minutes or so anyways

2022-11-20

Start a harvest on AReS

2022-11-22

Check and upload some items to CGSpace for Peter
- I am waiting for some feedback from him about some duplicates and metadata issues for the rest

2022-11-23

Fix some authorization issues for ABC’s TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)
Peter sent me feedback about the duplicates and metadata questions from yesterday
- I uploaded the eight items for COHESA and sixty-two for Gender
I ran the script to tag ORCID identifiers with my 2022-09-22-add-orcids.csv file and tagged twenty-seven
Maria asked for help uploading a large PDF to CGSpace
- The PDF is only two pages, but it is 139MB!
- I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):

$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-screen.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-ebook.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf

The ebook one looks really good and is only 2.4MB…
But for reference, this free Adobe tool seems to work: https://www.adobe.com/acrobat/online/compress-pdf.html

2022-11-24

My script finished downloading the CIAT Library PDFs
- I did some more work on my post-ciat-pdfs.py script and tested uploading the items to my local DSpace and DSpace Test
- Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items

2022-11-25

Tony Murray, who is working on IFPRI’s CGSpace integration, emailed me to ask some questions about the REST API
Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:

$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/77010
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
Loading @mire database changes for module MQM
Changes have been processed
IM Thumbnail tropentag2016_marshall.pdf is replacable.
File: tropentag2016_marshall.pdf.jpg
ERROR filtering, skipping bitstream:

        Item Handle: 10568/77010
        Bundle Name: ORIGINAL
        File Size: 1486580
        Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
        Asset Store: 0
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
        at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
        at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
        at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
        at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
        at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
        at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
        at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
        at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
        at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
        at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)

Salem gave me a list of CGSpace collections that have double spaces in the names
- Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!
- He sent me a list of about ten collection UUIDs so I fixed them
I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses… I updated about fifty of them

2022-11-26

Sync DSpace Test with CGSpace
I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
- See: https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44
I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
- Then after coming back up the handle server won’t start
- The handle-server.log file shows:

Shutting down...
"2022/11/26 02:12:17 CET" 25 Rotating log files
Error: null
       (see the error log for details.)

In the error.log file I see:

"2022/11/26 02:12:18 CET" 25 Started new run.
java.lang.UnsupportedOperationException
        at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
        at java.lang.System.runFinalizersOnExit(System.java:1059)
        at net.handle.server.Main.initialize(Main.java:124)
        at net.handle.server.Main.main(Main.java:75)
Shutting down...

Ah, it seems to be due to an issue in OpenJDK 1.8.0_352
I see the server upgraded to the new JDK version on 2022-11-10:

Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
End-Date: 2022-11-10  04:10:45

As highlighted in the dspace-tech mailing list thread above, this OpenJDK release deprecated Runtime.runFinalizersOnExit:

  - JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE

I downloaded the previous versions of the packages from Launchpad:

# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# dpkg -i openjdk-8-j*8u342-b07*.deb

Then the handle-server process starts up fine, so I held these OpenJDK versions for now:

# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
openjdk-8-jdk-headless set on hold.
openjdk-8-jre-headless set on hold.

Start a harvest on AReS

2022-11-27

I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
- For PDFs with only one page I was seeing this in the filter-media output:

java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2

It turns out that PDDocument’s getPage() is zero-based
I also updated PDFBox from 2.0.24 to 2.0.27
I synced DSpace 7 Test with CGSpace
- I had to follow my notes from 2022-03 to delete the missing Atmire migrations

2022-11-28

Update ilri/fix-metadata-values.py to update the last_modified date for items when it updates metadata
- This should allow us to use the normal index-discovery (with out -b) as well as having REST API responses showing a correct last modified date
Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary
- I also updated the add-orcid-identifiers-csv.py to update the last_modified timestamp of the item
I re-factored my CGSpace Python scripts to use a helper util.py module with common functions
- For now it only has the one for updating an item’s last_modified timestamp but I will gradually add more
I also ran our list of ORCID identifiers against ORCID’s API to see if anyone changed their name format
- Then I ran them on CGSpace with ilri/update-orcids.py to fix them
Normalize the text_lang values for CGSpace metadata again:

localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
 text_lang │  count
───────────┼─────────
 en_US     │ 2912429
           │  108387
 en        │   12457
 fr        │       2
 vi        │       2
 es        │       1
 ␀         │       0
(7 rows)

Time: 624.651 ms
localhost/dspacetest= ☘ BEGIN;
BEGIN
Time: 0.130 ms
localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '');
UPDATE 120844
Time: 4074.879 ms (00:04.075)
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
 text_lang │  count  
───────────┼─────────
 en_US     │ 3033273
 fr        │       2
 vi        │       2
 es        │       1
 ␀         │       0
(5 rows)

Time: 346.913 ms
localhost/dspacetest= ☘ COMMIT;

Discussing the UN M.49 regions on CGSpace with Valentina and Abenet
- The PRMS team is confused about our regions, which are mostly UN M.49 with some legacy stuff using different ones
- I think we can fix all the stuff for Initiatives from this year very easily, then work on the legacy stuff later
- Also, I noticed that that country_converter was using the wrong UN M.49 region for Myanmar
- I submitted a pull request
I exported a CSV of the Initiatives and ran the csv-metadata-quality script to add missing UN M.49 regions
- To make sure everything was correct I got a list of the changes from csv-metadata-quality and checked them all manually on the UN M.49 site, just in case there was another bug in country_converter
- This fixed regions for about fifty items
I dumped the UN M.49 regions from the CSV on the UNSD website:

$ csvcut -d";" -c 'Region Name,Sub-region Name,Intermediate Region Name' ~/Downloads/UNSD\ —\ Methodology.csv | sed -e 1d -e 's/,/\n/g' | sort -u

Africa
Americas
Asia
Australia and New Zealand
Caribbean
Central America
Central Asia
Channel Islands
Eastern Africa
Eastern Asia
Eastern Europe
Europe
Latin America and the Caribbean
Melanesia
Micronesia
Middle Africa
Northern Africa
Northern America
Northern Europe
Oceania
Polynesia
South America
South-eastern Asia
Southern Africa
Southern Asia
Southern Europe
Sub-Saharan Africa
Western Africa
Western Asia
Western Europe

For now I will combine it with our existing list, which contains a few legacy regions, while we discuss about a long-term plan with Peter and Abenet
Peter wrote to ask me to change the PIM CRP’s full name from Policies, Institutions and Markets to Policies, Institutions, and Markets
- It’s apparently the only CRP with an Oxford comma…?
- I updated them all on CGSpace
Also, I ran an index-discovery without the -b since now my metadata update scripts update the last_modified timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets

2022-11-29

Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about dcterms.type for CG Core
- We discussed some of the feedback from Peter
Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
- I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
- These are UN M.49 regions

2022-11-30

I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
- It fixed some whitespace issues and added missing regions to about 1,200 items
I thought of a way to delete duplicate metadata values, since the CSV upload method can’t detect them correctly
- First, I wrote a SQL query to identify metadata falues with the same text_value, metadata_field_id, and dspace_object_id:

\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id 
    FROM metadatavalue a
    JOIN (
        SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
    ) b
    ON a.dspace_object_id = b.dspace_object_id
    AND a.text_value = b.text_value
    AND a.metadata_field_id = b.metadata_field_id
    ORDER BY a.text_value) TO /tmp/duplicates.txt

(This query excludes metadata for accession and available dates, provenance, format, etc)
Then, I sorted the file by fields four and one (dspace_object_id and text_value) so that the duplicate metadata for each item were next to each other, used awk to print the second field (metadata_field_id) from every other line, and created a SQL script to delete the metadata

$ sort -k4,1 /tmp/duplicates.txt    | \
    awk -F'\t' 'NR%2==0 {print $2}' | \
    sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql

This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
- I just ran it again two more times to find the last duplicates, now we have none!