2022-11-01
- Last night I re-synced DSpace 7 Test from CGSpace
- I also updated all my local
7_x-dev
branches on the latest upstreams
- I spent some time updating the authorizations in Alliance collections
- I want to make sure they use groups instead of individuals where possible!
- I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue
2022-11-02
- I joined the FAO–CGIAR AGROVOC results sharing meeting
- From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion
- Doing duplicate check on IFPRI’s batch upload and I found one duplicate uploaded by IWMI earlier this year
- I will update the metadata of that item and map it to the correct Initiative collection
2022-11-03
- I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace
- I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:
localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
COPY 1268
- Then I started a test run on DSpace Test:
$ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt
- I’ll be curious to check the log after it’s all done.
$ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log
626
- Not bad, because last week I did a more manual selection of collections and deleted ~200
- I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool
- I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
- Export a list of items with PDFs linked there:
localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
COPY 4621
- After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones…
$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
2752
- I’m not sure how we’ll handle the duplicates because many items are book chapters or something where they share a PDF
2022-11-04
- I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description “Generated Thumbnail”:
localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
COPY 1147
$ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt
$ wc -l /tmp/old-thumbnail-handles.txt
987 /tmp/old-thumbnail-handles.txt
- A bunch of these have
\N
for some reason when I use the ds6_bitstream2itemhandle
function to get their handles so I had to exclude those…
- I forced the media-filter for these items on CGSpace:
$ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt
- Upload some batch records via CSV for Peter
- Update the about page on CGSpace with new text from Peter
- Add a few more ORCID identifiers and names to my growing file
2022-09-22-add-orcids.csv
- I tagged fifty-four new authors using this list
- I deleted and mapped one duplicate item for Maria Garruccio
- I updated the CG Core website from Bootstrap v4.6 to v5.2
2022-11-07
- I did a harvest on AReS last night but it seems that MELSpace’s sitemap is broken again because we have 10,000 fewer records
- I filed an issue on the iso-3166-1 npm package to update the name of Turkey to Türkiye
- I also filed an issue and a pull request on the pycountry package
- I also filed an issue and a pull request on the country-converter package
- I also changed one item on CGSpace that had been submitted since the name was changed
- I also imported the new iso-codes 4.12.0 into cgspace-java-helpers
- I also updated it in the DSpace
input-forms.xml
- I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
- Since I was making all these pull requests I also made one on country-converter for the UN M.49 region “South-eastern Asia”
- Port the ImageMagick PDF cropbox fix to DSpace 6.x
- I deployed it on CGSpace, ran all updates, and rebooted the host
- I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
- But looking at the items it processed, I’m not sure it’s working as expected
- I found some links to the Bioversity website on CGSpace that are not redirecting properly:
$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.bioversityinternational.org
User-Agent: HTTPie/3.2.1
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 275
Content-Type: text/html; charset=iso-8859-1
Date: Mon, 07 Nov 2022 16:35:21 GMT
Keep-Alive: timeout=15, max=100
Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
Server: Apache
- The
Location
header is clearly wrong, and if I try https directly I get an HTTP 500
2022-11-08
- Looking at the Solr statistics hits on CGSpace for 2022-11
- I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
- I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
- I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
- I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
- I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
- I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
- I will purge all these hits and proably add China Unicom’s subnet mask to my nginx
bot-network.conf
file to tag them as bots since there are SO many bad and malicious requests coming from there
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 8975 hits from 221.219.100.42 in statistics
Purging 7577 hits from 122.10.101.60 in statistics
Purging 6536 hits from 135.125.21.38 in statistics
Purging 23950 hits from 163.237.216.11 in statistics
Purging 4093 hits from 51.254.154.148 in statistics
Purging 2797 hits from 221.219.103.211 in statistics
Purging 2618 hits from 216.218.223.53 in statistics
Total number of bot hits purged: 56546
- Also interesting to see a few new user agents:
RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)
rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)
MEL
Gov employment data scraper ([[your email]])
RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)
- I will purge all these:
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 6155 hits from RStudio in statistics
Purging 1929 hits from rstudio in statistics
Purging 1454 hits from MEL in statistics
Purging 1094 hits from Gov employment data scraper in statistics
Total number of bot hits purged: 10632
- Work on the CIAT Library items a bit again in OpenRefine
- I flagged items with:
- URL containing “#page” at the end (these are linking to book chapters, but we don’t want to upload the PDF multiple times)
- Same URL used by more than one item (“Duplicates” facet in OpenRefine, these are some corner case I don’t want to handle right now)
- URL containing “:8080” to CIAT’s old DSpace (this server is no longer live)
- I want to try to handle the simple cases that should cover most of the items first
2022-11-09
- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
- I got the basic functionality working
2022-11-12
2022-11-15
- Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
- We agreed to continue adding the feedback for each of the proposed actions
- The others will start filling in definitions for the types
- Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
- We need to be careful especially with regards to author’s outputs that will be reported in the PRMS
2022-11-16
- Maria asked if we can extend the timeout for XMLUI sessions
- According to this issue it seems to be 30 minutes by default, as a Tomcat default
- I think we could extend this to an hour, as there is no real security risk (we’re not a bank) and most user’s lock screens would have activated after ten minutes or so anyways
2022-11-20
2022-11-22
- Check and upload some items to CGSpace for Peter
- I am waiting for some feedback from him about some duplicates and metadata issues for the rest
2022-11-23
- Fix some authorization issues for ABC’s TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)
- Peter sent me feedback about the duplicates and metadata questions from yesterday
- I uploaded the eight items for COHESA and sixty-two for Gender
- I ran the script to tag ORCID identifiers with my
2022-09-22-add-orcids.csv
file and tagged twenty-seven
- Maria asked for help uploading a large PDF to CGSpace
- The PDF is only two pages, but it is 139MB!
- I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-screen.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-ebook.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
2022-11-24
- My script finished downloading the CIAT Library PDFs
- I did some more work on my
post-ciat-pdfs.py
script and tested uploading the items to my local DSpace and DSpace Test
- Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items
2022-11-25
- Tony Murray, who is working on IFPRI’s CGSpace integration, emailed me to ask some questions about the REST API
- Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/77010
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
Loading @mire database changes for module MQM
Changes have been processed
IM Thumbnail tropentag2016_marshall.pdf is replacable.
File: tropentag2016_marshall.pdf.jpg
ERROR filtering, skipping bitstream:
Item Handle: 10568/77010
Bundle Name: ORIGINAL
File Size: 1486580
Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
Asset Store: 0
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
- Salem gave me a list of CGSpace collections that have double spaces in the names
- Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!
- He sent me a list of about ten collection UUIDs so I fixed them
- I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses… I updated about fifty of them
2022-11-26
- Sync DSpace Test with CGSpace
- I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
- I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
- Then after coming back up the handle server won’t start
- The
handle-server.log
file shows:
Shutting down...
"2022/11/26 02:12:17 CET" 25 Rotating log files
Error: null
(see the error log for details.)
- In the
error.log
file I see:
"2022/11/26 02:12:18 CET" 25 Started new run.
java.lang.UnsupportedOperationException
at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
at java.lang.System.runFinalizersOnExit(System.java:1059)
at net.handle.server.Main.initialize(Main.java:124)
at net.handle.server.Main.main(Main.java:75)
Shutting down...
Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
End-Date: 2022-11-10 04:10:45
- JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
- I downloaded the previous versions of the packages from Launchpad:
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
# dpkg -i openjdk-8-j*8u342-b07*.deb
- Then the handle-server process starts up fine, so I held these OpenJDK versions for now:
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
openjdk-8-jdk-headless set on hold.
openjdk-8-jre-headless set on hold.
2022-11-27
- I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
- For PDFs with only one page I was seeing this in the filter-media output:
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
- It turns out that PDDocument’s getPage() is zero-based
- I also updated PDFBox from 2.0.24 to 2.0.27
- I synced DSpace 7 Test with CGSpace
- I had to follow my notes from 2022-03 to delete the missing Atmire migrations
2022-11-28
- Update
ilri/fix-metadata-values.py
to update the last_modified
date for items when it updates metadata
- This should allow us to use the normal
index-discovery
(with out -b
) as well as having REST API responses showing a correct last modified date
- Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary
- I also updated the
add-orcid-identifiers-csv.py
to update the last_modified
timestamp of the item
- I re-factored my CGSpace Python scripts to use a helper
util.py
module with common functions
- For now it only has the one for updating an item’s
last_modified
timestamp but I will gradually add more
- I also ran our list of ORCID identifiers against ORCID’s API to see if anyone changed their name format
- Then I ran them on CGSpace with
ilri/update-orcids.py
to fix them
- Normalize the
text_lang
values for CGSpace metadata again:
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang │ count
───────────┼─────────
en_US │ 2912429
│ 108387
en │ 12457
fr │ 2
vi │ 2
es │ 1
␀ │ 0
(7 rows)
Time: 624.651 ms
localhost/dspacetest= ☘ BEGIN;
BEGIN
Time: 0.130 ms
localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '');
UPDATE 120844
Time: 4074.879 ms (00:04.075)
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang │ count
───────────┼─────────
en_US │ 3033273
fr │ 2
vi │ 2
es │ 1
␀ │ 0
(5 rows)
Time: 346.913 ms
localhost/dspacetest= ☘ COMMIT;
- Discussing the UN M.49 regions on CGSpace with Valentina and Abenet
- I exported a CSV of the Initiatives and ran the csv-metadata-quality script to add missing UN M.49 regions
- To make sure everything was correct I got a list of the changes from csv-metadata-quality and checked them all manually on the UN M.49 site, just in case there was another bug in country_converter
- This fixed regions for about fifty items
- I dumped the UN M.49 regions from the CSV on the UNSD website:
$ csvcut -d";" -c 'Region Name,Sub-region Name,Intermediate Region Name' ~/Downloads/UNSD\ —\ Methodology.csv | sed -e 1d -e 's/,/\n/g' | sort -u
Africa
Americas
Asia
Australia and New Zealand
Caribbean
Central America
Central Asia
Channel Islands
Eastern Africa
Eastern Asia
Eastern Europe
Europe
Latin America and the Caribbean
Melanesia
Micronesia
Middle Africa
Northern Africa
Northern America
Northern Europe
Oceania
Polynesia
South America
South-eastern Asia
Southern Africa
Southern Asia
Southern Europe
Sub-Saharan Africa
Western Africa
Western Asia
Western Europe
- For now I will combine it with our existing list, which contains a few legacy regions, while we discuss about a long-term plan with Peter and Abenet
- Peter wrote to ask me to change the PIM CRP’s full name from
Policies, Institutions and Markets
to Policies, Institutions, and Markets
- It’s apparently the only CRP with an Oxford comma…?
- I updated them all on CGSpace
- Also, I ran an
index-discovery
without the -b
since now my metadata update scripts update the last_modified
timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
2022-11-29
- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about
dcterms.type
for CG Core
- We discussed some of the feedback from Peter
- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
- I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
- These are UN M.49 regions
2022-11-30
- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
- It fixed some whitespace issues and added missing regions to about 1,200 items
- I thought of a way to delete duplicate metadata values, since the CSV upload method can’t detect them correctly
- First, I wrote a SQL query to identify metadata falues with the same
text_value
, metadata_field_id
, and dspace_object_id
:
\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
FROM metadatavalue a
JOIN (
SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
) b
ON a.dspace_object_id = b.dspace_object_id
AND a.text_value = b.text_value
AND a.metadata_field_id = b.metadata_field_id
ORDER BY a.text_value) TO /tmp/duplicates.txt
- (This query excludes metadata for accession and available dates, provenance, format, etc)
- Then, I sorted the file by fields four and one (
dspace_object_id
and text_value
) so that the duplicate metadata for each item were next to each other, used awk to print the second field (metadata_field_id
) from every other line, and created a SQL script to delete the metadata
$ sort -k4,1 /tmp/duplicates.txt | \
awk -F'\t' 'NR%2==0 {print $2}' | \
sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
- I just ran it again two more times to find the last duplicates, now we have none!
- I also generated another SQL file with commands to update the last modified timestamps of these items:
$ awk -F'\t' '{print $4}' /tmp/duplicates.txt | sort -u | sed "s/^\(.*\)$/UPDATE item SET last_modified=NOW() WHERE uuid='\1';/" > /tmp/update-timestamp.sql
- Tezira said she was having trouble archiving submissions
- In the afternoon I looked and found a high number of locks:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c | sort -n
60 dspaceCli
176 dspaceApi
1194 dspaceWeb
!PostgreSQL database locks
- The timing looks suspiciously close to when I was running the batch updates on the ILRI community this morning.
- I restarted Tomcat and PostgreSQL and everything was back to normal
- I found some items on CGSpace in Dinka, Ndogo, and Bari languages, but the
dcterms.language
field was “other”
- That’s so unfortunate! These languages are not in ISO 639-1, but they are in ISO 639-3, which uses Alpha 3 and has more space for languages
- I changed them from other to use the three-letter codes, and I will suggest to the CG Core group that we use ISO 639-3 in the future