--- title: "November, 2022" date: 2022-11-01T09:11:36+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2022-11-01 - Last night I re-synced DSpace 7 Test from CGSpace - I also updated all my local `7_x-dev` branches on the latest upstreams - I spent some time updating the authorizations in Alliance collections - I want to make sure they use groups instead of individuals where possible! - I reverted the Cocoon autosave change because it was more of a nuissance that Peter can't upload CSVs from the web interface and is a very low severity security issue - I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items! - I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test - Tim merged my [pull request to override the ImageMagick PDF density in DSpace 7](https://github.com/DSpace/DSpace/pull/8553) - I ported it to DSpace 6.x and submitted a pull request: https://github.com/DSpace/DSpace/pull/8560 ## 2022-11-02 - I joined the FAO–CGIAR AGROVOC results sharing meeting - From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion - Doing duplicate check on IFPRI's batch upload and I found one duplicate uploaded by IWMI earlier this year - I will update the metadata of that item and map it to the correct Initiative collection ## 2022-11-03 - I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace - I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community: ```console localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt COPY 1268 ``` - Then I started a test run on DSpace Test: ```console $ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt ``` - I'll be curious to check the log after it's all done. - After a few hours I see: ```console $ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log 626 ``` - Not bad, because last week I did a more manual selection of collections and deleted ~200 - I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool - I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down - Export a list of items with PDFs linked there: ```console localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv; COPY 4621 ``` - After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones... ```console $ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l 2752 ``` - I'm not sure how we'll handle the duplicates because many items are book chapters or something where they share a PDF ## 2022-11-04 - I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description "Generated Thumbnail": ```console localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt; COPY 1147 $ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt $ wc -l /tmp/old-thumbnail-handles.txt 987 /tmp/old-thumbnail-handles.txt ``` - A bunch of these have `\N` for some reason when I use the `ds6_bitstream2itemhandle` function to get their handles so I had to exclude those... - I forced the media-filter for these items on CGSpace: ```console $ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt ``` - Upload some batch records via CSV for Peter - Update the about page on CGSpace with new text from Peter - Add a few more ORCID identifiers and names to my growing file `2022-09-22-add-orcids.csv` - I tagged fifty-four new authors using this list - I deleted and mapped one duplicate item for Maria Garruccio - I updated the CG Core website from Bootstrap v4.6 to v5.2 ## 2022-11-07 - I did a harvest on AReS last night but it seems that MELSpace's sitemap is broken again because we have 10,000 fewer records - I filed [an issue](https://github.com/ecrmnn/iso-3166-1/issues/10) on the iso-3166-1 npm package to update the name of Turkey to Türkiye - I also filed [an issue](https://github.com/flyingcircusio/pycountry/issues/148) and [a pull request](https://github.com/flyingcircusio/pycountry/pull/149) on the pycountry package - I also filed [an issue](https://github.com/konstantinstadler/country_converter/issues/121) and [a pull request](https://github.com/konstantinstadler/country_converter/pull/122) on the country-converter package - I also changed one item on CGSpace that had been submitted since the name was changed - I also imported the new iso-codes 4.12.0 into cgspace-java-helpers - I also updated it in the DSpace `input-forms.xml` - I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork - I submitted a [pull request](https://github.com/ecrmnn/iso-3166-1/pull/11) to update this upstream - Since I was making all these pull requests I also made [one on country-converter for the UN M.49 region "South-eastern Asia"](https://github.com/konstantinstadler/country_converter/pull/123) - Port the [ImageMagick PDF cropbox fix](https://github.com/DSpace/DSpace/pull/8550) to DSpace 6.x - I deployed it on CGSpace, ran all updates, and rebooted the host - I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist: ```console $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log ``` - But looking at the items it processed, I'm not sure it's working as expected - I looked at a few dozen - I found some links to the Bioversity website on CGSpace that are not redirecting properly: ```console $ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: www.bioversityinternational.org User-Agent: HTTPie/3.2.1 HTTP/1.1 302 Found Connection: Keep-Alive Content-Length: 275 Content-Type: text/html; charset=iso-8859-1 Date: Mon, 07 Nov 2022 16:35:21 GMT Keep-Alive: timeout=15, max=100 Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html Server: Apache ``` - The `Location` header is clearly wrong, and if I try https directly I get an HTTP 500 ## 2022-11-08 - Looking at the Solr statistics hits on CGSpace for 2022-11 - I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent - I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent - I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection - I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent - I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection - I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent - I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent - I will purge all these hits and proably add China Unicom's subnet mask to my nginx `bot-network.conf` file to tag them as bots since there are SO many bad and malicious requests coming from there ```console $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p Purging 8975 hits from 221.219.100.42 in statistics Purging 7577 hits from 122.10.101.60 in statistics Purging 6536 hits from 135.125.21.38 in statistics Purging 23950 hits from 163.237.216.11 in statistics Purging 4093 hits from 51.254.154.148 in statistics Purging 2797 hits from 221.219.103.211 in statistics Purging 2618 hits from 216.218.223.53 in statistics Total number of bot hits purged: 56546 ``` - Also interesting to see a few new user agents: - `RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)` - `rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)` - `MEL` - `Gov employment data scraper ([[your email]])` - `RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)` - I will purge all these: ```console $ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p Purging 6155 hits from RStudio in statistics Purging 1929 hits from rstudio in statistics Purging 1454 hits from MEL in statistics Purging 1094 hits from Gov employment data scraper in statistics Total number of bot hits purged: 10632 ``` - Work on the CIAT Library items a bit again in OpenRefine - I flagged items with: - URL containing "#page" at the end (these are linking to book chapters, but we don't want to upload the PDF multiple times) - Same URL used by more than one item ("Duplicates" facet in OpenRefine, these are some corner case I don't want to handle right now) - URL containing ":8080" to CIAT's old DSpace (this server is no longer live) - I want to try to handle the simple cases that should cover most of the items first ## 2022-11-09 - Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace - I got the basic functionality working ## 2022-11-12 - Start a harvest on AReS ## 2022-11-15 - Meeting with Marie-Angelique, Sara, and Valentina about CG Core types - We agreed to continue adding the feedback for each of the proposed actions - The others will start filling in definitions for the types - Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository - We need to be careful especially with regards to author's outputs that will be reported in the PRMS ## 2022-11-16 - Maria asked if we can extend the timeout for XMLUI sessions - According to [this issue](https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44) it seems to be 30 minutes by default, as a Tomcat default - I think we could extend this to an hour, as there is no real security risk (we're not a bank) and most user's lock screens would have activated after ten minutes or so anyways ## 2022-11-20 - Start a harvest on AReS ## 2022-11-22 - Check and upload some items to CGSpace for Peter - I am waiting for some feedback from him about some duplicates and metadata issues for the rest ## 2022-11-23 - Fix some authorization issues for ABC's TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test) - Peter sent me feedback about the duplicates and metadata questions from yesterday - I uploaded the eight items for COHESA and sixty-two for Gender - I ran the script to tag ORCID identifiers with my `2022-09-22-add-orcids.csv` file and tagged twenty-seven - Maria asked for help uploading a large PDF to CGSpace - The PDF is only two pages, but it is 139MB! - I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi): ```console $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-screen.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-ebook.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf ``` - The ebook one looks really good and is only 2.4MB... - But for reference, this free Adobe tool seems to work: https://www.adobe.com/acrobat/online/compress-pdf.html ## 2022-11-24 - My script finished downloading the CIAT Library PDFs - I did some more work on my `post-ciat-pdfs.py` script and tested uploading the items to my local DSpace and DSpace Test - Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items ## 2022-11-25 - Tony Murray, who is working on IFPRI's CGSpace integration, emailed me to ask some questions about the REST API - Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago: ```console $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/77010 The following MediaFilters are enabled: Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter Loading @mire database changes for module MQM Changes have been processed IM Thumbnail tropentag2016_marshall.pdf is replacable. File: tropentag2016_marshall.pdf.jpg ERROR filtering, skipping bitstream: Item Handle: 10568/77010 Bundle Name: ORIGINAL File Size: 1486580 Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5) Asset Store: 0 java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2 java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2 at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325) at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248) at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543) at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167) at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27) at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103) at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61) at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181) at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159) at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81) ``` - Salem gave me a list of CGSpace collections that have double spaces in the names - Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them! - He sent me a list of about ten collection UUIDs so I fixed them - I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses... I updated about fifty of them ## 2022-11-26 - Sync DSpace Test with CGSpace - I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago - See: https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44 - I re-built DSpace on CGSpace, ran all updates, and rebooted the machine - Then after coming back up the handle server won't start - The `handle-server.log` file shows: ```console Shutting down... "2022/11/26 02:12:17 CET" 25 Rotating log files Error: null (see the error log for details.) ``` - In the `error.log` file I see: ```console "2022/11/26 02:12:18 CET" 25 Started new run. java.lang.UnsupportedOperationException at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287) at java.lang.System.runFinalizersOnExit(System.java:1059) at net.handle.server.Main.initialize(Main.java:124) at net.handle.server.Main.main(Main.java:75) Shutting down... ``` - Ah, it seems to be due to an [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1) - I see the server upgraded to the new JDK version on 2022-11-10: ```console Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04) End-Date: 2022-11-10 04:10:45 ``` - As highlighted in the dspace-tech mailing list thread above, [this OpenJDK release deprecated `Runtime.runFinalizersOnExit`](https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html): ```console - JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE ``` - I downloaded the previous versions of the packages from Launchpad: ```console # wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb # wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb # dpkg -i openjdk-8-j*8u342-b07*.deb ``` - Then the handle-server process starts up fine, so I held these OpenJDK versions for now: ```console # apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64 openjdk-8-jdk-headless set on hold. openjdk-8-jre-headless set on hold. ``` - Start a harvest on AReS ## 2022-11-27 - I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago - For PDFs with only one page I was seeing this in the filter-media output: ```console java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2 ``` - It turns out that [PDDocument's getPage() is zero-based](https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-) - I also updated PDFBox from 2.0.24 to 2.0.27 - I synced DSpace 7 Test with CGSpace - I had to follow my notes from 2022-03 to delete the missing Atmire migrations