CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

August, 2021

2021-08-01

  • Update Docker images on AReS server (linode20) and reboot the server:
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
  • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
  • First running all existing updates, taking some backups, checking for broken packages, and then rebooting:
# apt update && apt dist-upgrade
# apt autoremove && apt autoclean
# check for any packages with residual configs we can purge
# dpkg -l | grep -E '^rc' | awk '{print $2}'
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# dpkg -C
# dpkg -l > 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
# do-release-upgrade
  • … but of course it hit the libxcrypt bug
  • I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually
# apt install -f
# apt dist-upgrade
# reboot
  • After rebooting I purged all packages with residual configs and cleaned up again:
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# apt autoremove && apt autoclean

2021-08-02

  • Help Udana with OAI validation on CGSpace

2021-08-03

  • Run fresh re-harvest on AReS

2021-08-05

  • Have a quick call with Mishell Portilla from CIP about a journal article that was flagged as being in a predatory journal (Beall’s List)
    • We agreed to unmap it from RTB’s collection for now, and I asked for advice from Peter and Abenet for what to do in the future
  • A developer from the Alliance asked for access to the CGSpace database so they can make some integration with PowerBI
    • I told them we don’t allow direct database access, and that it would be tricky anyways (that’s what APIs are for!)
  • I’m curious if there are still any requests coming in to CGSpace from the abusive Russian networks
    • I extracted all the unique IPs that nginx processed in the last week:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
  • Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
    • Indeed, now I see that there are no IPs from those networks coming in now:
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv

2021-08-08

  • Advise IWMI colleagues on best practices for thumbnails
  • Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
    • I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:
    • WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific
    • WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea
  • Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
    • 35.174.144.154 made 11,000 requests last month with the following user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
  • That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP
  • I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
    • Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so
    • Same deal with 93.158.90.91, also in Sweden
  • 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
  • 61.143.40.50 is in China and uses this hilarious user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
  • 47.252.80.214 is owned by Alibaba in the US and has the same user agent
  • 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
  • 95.87.154.12 seems to be a new bot with the following user agent:
Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
  • They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
    • I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots
  • I see a new bot using this user agent:
nettle (+https://www.nettle.sk)
  • 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
  • 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
  • 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
  • There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
Purging 24863 hits from 3.225.28.105 in statistics
Purging 2988 hits from 93.158.90.91 in statistics
Purging 2497 hits from 61.143.40.50 in statistics
Purging 13866 hits from 159.138.131.15 in statistics
Purging 2721 hits from 95.87.154.12 in statistics
Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics

Total number of bot hits purged: 90485
  • Then I purged a few thousand more by user agent:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri 
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics

Total number of hits from bots: 4492
  • I found some CGSpace metadata in the wrong fields
    • Seven metadata in dc.subject (57) should be in dcterms.subject (187)
    • Twelve metadata in cg.title.journal (202) should be in cg.journal (251)
    • Three dc.identifier.isbn (20) should be in cg.isbn (252)
    • Three dc.identifier.issn (21) should be in cg.issn (253)
    • I re-ran the migrate-fields.sh script on CGSpace
  • I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:
$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
  • Then in OpenRefine I merged all null, blank, and en fields into the en_US one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
    • In total it was a few thousand metadata entries or so so I had to split the CSV with xsv split in order to process it
    • I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes…
    • In total it was 1,195 changes to ISSN and ISBN metadata fields

2021-08-09

  • Extract all unique ISSNs to look up on Sherpa Romeo and Crossref
$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
  • Then I updated the CSV headers for each and joined the CSVs on the issn column:
$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
  • In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
  • Then I exported the list of journals that differ and sent it to Peter for comments and corrections
    • I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it
  • Convert my generate-thumbnails.py script to use libvips instead of Graphicsmagick
    • It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
    • One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can’t work in CMYK
    • I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I’m not sure…
    • Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
  • I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
39004:0.08
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg 
40932:0.53
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
41724:0.59
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
24736:0.04

2021-08-11

  • Peter got back to me about the journal title cleanup
    • From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref’s
    • Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter’s selections for where Sherpa Romeo and Crossref differred:
$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
  • Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
  • I exported a list of all the journal titles we have in the cg.journal field:
localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
  • I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them…
    • I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
    • Or instead of doing it via SQL I could use CSV and parse the values there…
  • A few more issues:
    • Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org’s web search (their API is invite only)
    • Some titles are different across all three datasets, for example ISSN 0003-1305:
  • I also realized that our previous controlled vocabulary came from CGSpace’s top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
    • Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
    • I requested access to the issn.org OAI-PMH API so I can use their registry…