371 lines
19 KiB
Markdown
371 lines
19 KiB
Markdown
---
|
||
title: "November, 2022"
|
||
date: 2022-11-01T09:11:36+03:00
|
||
author: "Alan Orth"
|
||
categories: ["Notes"]
|
||
---
|
||
|
||
## 2022-11-01
|
||
|
||
- Last night I re-synced DSpace 7 Test from CGSpace
|
||
- I also updated all my local `7_x-dev` branches on the latest upstreams
|
||
- I spent some time updating the authorizations in Alliance collections
|
||
- I want to make sure they use groups instead of individuals where possible!
|
||
- I reverted the Cocoon autosave change because it was more of a nuissance that Peter can't upload CSVs from the web interface and is a very low severity security issue
|
||
|
||
<!--more-->
|
||
|
||
- I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!
|
||
- I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test
|
||
- Tim merged my [pull request to override the ImageMagick PDF density in DSpace 7](https://github.com/DSpace/DSpace/pull/8553)
|
||
- I ported it to DSpace 6.x and submitted a pull request: https://github.com/DSpace/DSpace/pull/8560
|
||
|
||
## 2022-11-02
|
||
|
||
- I joined the FAO–CGIAR AGROVOC results sharing meeting
|
||
- From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion
|
||
- Doing duplicate check on IFPRI's batch upload and I found one duplicate uploaded by IWMI earlier this year
|
||
- I will update the metadata of that item and map it to the correct Initiative collection
|
||
|
||
## 2022-11-03
|
||
|
||
- I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace
|
||
- I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:
|
||
|
||
```console
|
||
localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
|
||
COPY 1268
|
||
```
|
||
|
||
- Then I started a test run on DSpace Test:
|
||
|
||
```console
|
||
$ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt
|
||
```
|
||
|
||
- I'll be curious to check the log after it's all done.
|
||
- After a few hours I see:
|
||
|
||
```console
|
||
$ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log
|
||
626
|
||
```
|
||
|
||
- Not bad, because last week I did a more manual selection of collections and deleted ~200
|
||
- I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool
|
||
- I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
|
||
- Export a list of items with PDFs linked there:
|
||
|
||
```console
|
||
localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
|
||
COPY 4621
|
||
```
|
||
|
||
- After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones...
|
||
|
||
```console
|
||
$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
|
||
2752
|
||
```
|
||
|
||
- I'm not sure how we'll handle the duplicates because many items are book chapters or something where they share a PDF
|
||
|
||
## 2022-11-04
|
||
|
||
- I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description "Generated Thumbnail":
|
||
|
||
```console
|
||
localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
|
||
COPY 1147
|
||
$ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt
|
||
$ wc -l /tmp/old-thumbnail-handles.txt
|
||
987 /tmp/old-thumbnail-handles.txt
|
||
```
|
||
|
||
- A bunch of these have `\N` for some reason when I use the `ds6_bitstream2itemhandle` function to get their handles so I had to exclude those...
|
||
- I forced the media-filter for these items on CGSpace:
|
||
|
||
```console
|
||
$ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt
|
||
```
|
||
|
||
- Upload some batch records via CSV for Peter
|
||
- Update the about page on CGSpace with new text from Peter
|
||
- Add a few more ORCID identifiers and names to my growing file `2022-09-22-add-orcids.csv`
|
||
- I tagged fifty-four new authors using this list
|
||
- I deleted and mapped one duplicate item for Maria Garruccio
|
||
- I updated the CG Core website from Bootstrap v4.6 to v5.2
|
||
|
||
## 2022-11-07
|
||
|
||
- I did a harvest on AReS last night but it seems that MELSpace's sitemap is broken again because we have 10,000 fewer records
|
||
- I filed [an issue](https://github.com/ecrmnn/iso-3166-1/issues/10) on the iso-3166-1 npm package to update the name of Turkey to Türkiye
|
||
- I also filed [an issue](https://github.com/flyingcircusio/pycountry/issues/148) and [a pull request](https://github.com/flyingcircusio/pycountry/pull/149) on the pycountry package
|
||
- I also filed [an issue](https://github.com/konstantinstadler/country_converter/issues/121) and [a pull request](https://github.com/konstantinstadler/country_converter/pull/122) on the country-converter package
|
||
- I also changed one item on CGSpace that had been submitted since the name was changed
|
||
- I also imported the new iso-codes 4.12.0 into cgspace-java-helpers
|
||
- I also updated it in the DSpace `input-forms.xml`
|
||
- I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
|
||
- I submitted a [pull request](https://github.com/ecrmnn/iso-3166-1/pull/11) to update this upstream
|
||
- Since I was making all these pull requests I also made [one on country-converter for the UN M.49 region "South-eastern Asia"](https://github.com/konstantinstadler/country_converter/pull/123)
|
||
- Port the [ImageMagick PDF cropbox fix](https://github.com/DSpace/DSpace/pull/8550) to DSpace 6.x
|
||
- I deployed it on CGSpace, ran all updates, and rebooted the host
|
||
- I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:
|
||
|
||
```console
|
||
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
|
||
```
|
||
|
||
- But looking at the items it processed, I'm not sure it's working as expected
|
||
- I looked at a few dozen
|
||
- I found some links to the Bioversity website on CGSpace that are not redirecting properly:
|
||
|
||
```console
|
||
$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
|
||
GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
|
||
Accept: */*
|
||
Accept-Encoding: gzip, deflate
|
||
Connection: keep-alive
|
||
Host: www.bioversityinternational.org
|
||
User-Agent: HTTPie/3.2.1
|
||
|
||
HTTP/1.1 302 Found
|
||
Connection: Keep-Alive
|
||
Content-Length: 275
|
||
Content-Type: text/html; charset=iso-8859-1
|
||
Date: Mon, 07 Nov 2022 16:35:21 GMT
|
||
Keep-Alive: timeout=15, max=100
|
||
Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
|
||
Server: Apache
|
||
```
|
||
|
||
- The `Location` header is clearly wrong, and if I try https directly I get an HTTP 500
|
||
|
||
## 2022-11-08
|
||
|
||
- Looking at the Solr statistics hits on CGSpace for 2022-11
|
||
- I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
|
||
- I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
|
||
- I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
|
||
- I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
|
||
- I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
|
||
- I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
|
||
- I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
|
||
- I will purge all these hits and proably add China Unicom's subnet mask to my nginx `bot-network.conf` file to tag them as bots since there are SO many bad and malicious requests coming from there
|
||
|
||
```console
|
||
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
||
Purging 8975 hits from 221.219.100.42 in statistics
|
||
Purging 7577 hits from 122.10.101.60 in statistics
|
||
Purging 6536 hits from 135.125.21.38 in statistics
|
||
Purging 23950 hits from 163.237.216.11 in statistics
|
||
Purging 4093 hits from 51.254.154.148 in statistics
|
||
Purging 2797 hits from 221.219.103.211 in statistics
|
||
Purging 2618 hits from 216.218.223.53 in statistics
|
||
|
||
Total number of bot hits purged: 56546
|
||
```
|
||
|
||
- Also interesting to see a few new user agents:
|
||
- `RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)`
|
||
- `rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)`
|
||
- `MEL`
|
||
- `Gov employment data scraper ([[your email]])`
|
||
- `RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)`
|
||
- I will purge all these:
|
||
|
||
```console
|
||
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
|
||
Purging 6155 hits from RStudio in statistics
|
||
Purging 1929 hits from rstudio in statistics
|
||
Purging 1454 hits from MEL in statistics
|
||
Purging 1094 hits from Gov employment data scraper in statistics
|
||
|
||
Total number of bot hits purged: 10632
|
||
```
|
||
|
||
- Work on the CIAT Library items a bit again in OpenRefine
|
||
- I flagged items with:
|
||
- URL containing "#page" at the end (these are linking to book chapters, but we don't want to upload the PDF multiple times)
|
||
- Same URL used by more than one item ("Duplicates" facet in OpenRefine, these are some corner case I don't want to handle right now)
|
||
- URL containing ":8080" to CIAT's old DSpace (this server is no longer live)
|
||
- I want to try to handle the simple cases that should cover most of the items first
|
||
|
||
## 2022-11-09
|
||
|
||
- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
|
||
- I got the basic functionality working
|
||
|
||
## 2022-11-12
|
||
|
||
- Start a harvest on AReS
|
||
|
||
## 2022-11-15
|
||
|
||
- Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
|
||
- We agreed to continue adding the feedback for each of the proposed actions
|
||
- The others will start filling in definitions for the types
|
||
- Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
|
||
- We need to be careful especially with regards to author's outputs that will be reported in the PRMS
|
||
|
||
## 2022-11-16
|
||
|
||
- Maria asked if we can extend the timeout for XMLUI sessions
|
||
- According to [this issue](https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44) it seems to be 30 minutes by default, as a Tomcat default
|
||
- I think we could extend this to an hour, as there is no real security risk (we're not a bank) and most user's lock screens would have activated after ten minutes or so anyways
|
||
|
||
## 2022-11-20
|
||
|
||
- Start a harvest on AReS
|
||
|
||
## 2022-11-22
|
||
|
||
- Check and upload some items to CGSpace for Peter
|
||
- I am waiting for some feedback from him about some duplicates and metadata issues for the rest
|
||
|
||
## 2022-11-23
|
||
|
||
- Fix some authorization issues for ABC's TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)
|
||
- Peter sent me feedback about the duplicates and metadata questions from yesterday
|
||
- I uploaded the eight items for COHESA and sixty-two for Gender
|
||
- I ran the script to tag ORCID identifiers with my `2022-09-22-add-orcids.csv` file and tagged twenty-seven
|
||
- Maria asked for help uploading a large PDF to CGSpace
|
||
- The PDF is only two pages, but it is 139MB!
|
||
- I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):
|
||
|
||
```console
|
||
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-screen.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
|
||
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-ebook.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
|
||
```
|
||
|
||
- The ebook one looks really good and is only 2.4MB...
|
||
- But for reference, this free Adobe tool seems to work: https://www.adobe.com/acrobat/online/compress-pdf.html
|
||
|
||
## 2022-11-24
|
||
|
||
- My script finished downloading the CIAT Library PDFs
|
||
- I did some more work on my `post-ciat-pdfs.py` script and tested uploading the items to my local DSpace and DSpace Test
|
||
- Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items
|
||
|
||
## 2022-11-25
|
||
|
||
- Tony Murray, who is working on IFPRI's CGSpace integration, emailed me to ask some questions about the REST API
|
||
- Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:
|
||
|
||
```console
|
||
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/77010
|
||
The following MediaFilters are enabled:
|
||
Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
|
||
org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
|
||
Loading @mire database changes for module MQM
|
||
Changes have been processed
|
||
IM Thumbnail tropentag2016_marshall.pdf is replacable.
|
||
File: tropentag2016_marshall.pdf.jpg
|
||
ERROR filtering, skipping bitstream:
|
||
|
||
Item Handle: 10568/77010
|
||
Bundle Name: ORIGINAL
|
||
File Size: 1486580
|
||
Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
|
||
Asset Store: 0
|
||
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
|
||
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
|
||
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
|
||
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
|
||
at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
|
||
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
|
||
at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
|
||
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
|
||
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
|
||
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
|
||
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
|
||
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||
```
|
||
|
||
- Salem gave me a list of CGSpace collections that have double spaces in the names
|
||
- Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!
|
||
- He sent me a list of about ten collection UUIDs so I fixed them
|
||
- I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses... I updated about fifty of them
|
||
|
||
## 2022-11-26
|
||
|
||
- Sync DSpace Test with CGSpace
|
||
- I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
|
||
- See: https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44
|
||
- I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
|
||
- Then after coming back up the handle server won't start
|
||
- The `handle-server.log` file shows:
|
||
|
||
```console
|
||
Shutting down...
|
||
"2022/11/26 02:12:17 CET" 25 Rotating log files
|
||
Error: null
|
||
(see the error log for details.)
|
||
```
|
||
|
||
- In the `error.log` file I see:
|
||
|
||
```console
|
||
"2022/11/26 02:12:18 CET" 25 Started new run.
|
||
java.lang.UnsupportedOperationException
|
||
at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
|
||
at java.lang.System.runFinalizersOnExit(System.java:1059)
|
||
at net.handle.server.Main.initialize(Main.java:124)
|
||
at net.handle.server.Main.main(Main.java:75)
|
||
Shutting down...
|
||
```
|
||
|
||
- Ah, it seems to be due to an [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1)
|
||
- I see the server upgraded to the new JDK version on 2022-11-10:
|
||
|
||
```console
|
||
Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
|
||
End-Date: 2022-11-10 04:10:45
|
||
```
|
||
|
||
- As highlighted in the dspace-tech mailing list thread above, [this OpenJDK release deprecated `Runtime.runFinalizersOnExit`](https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html):
|
||
|
||
```console
|
||
- JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
|
||
```
|
||
|
||
- I downloaded the previous versions of the packages from Launchpad:
|
||
|
||
```console
|
||
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
|
||
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
|
||
# dpkg -i openjdk-8-j*8u342-b07*.deb
|
||
```
|
||
|
||
- Then the handle-server process starts up fine, so I held these OpenJDK versions for now:
|
||
|
||
```console
|
||
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
|
||
openjdk-8-jdk-headless set on hold.
|
||
openjdk-8-jre-headless set on hold.
|
||
```
|
||
|
||
- Start a harvest on AReS
|
||
|
||
## 2022-11-27
|
||
|
||
- I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
|
||
- For PDFs with only one page I was seeing this in the filter-media output:
|
||
|
||
```console
|
||
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
|
||
```
|
||
|
||
- It turns out that [PDDocument's getPage() is zero-based](https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-)
|
||
- I also updated PDFBox from 2.0.24 to 2.0.27
|
||
- I synced DSpace 7 Test with CGSpace
|
||
- I had to follow my notes from 2022-03 to delete the missing Atmire migrations
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|