CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

August, 2021

2021-08-01

  • Update Docker images on AReS server (linode20) and reboot the server:
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
  • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
  • First running all existing updates, taking some backups, checking for broken packages, and then rebooting:
# apt update && apt dist-upgrade
# apt autoremove && apt autoclean
# check for any packages with residual configs we can purge
# dpkg -l | grep -E '^rc' | awk '{print $2}'
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# dpkg -C
# dpkg -l > 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
# do-release-upgrade
  • … but of course it hit the libxcrypt bug
  • I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually
# apt install -f
# apt dist-upgrade
# reboot
  • After rebooting I purged all packages with residual configs and cleaned up again:
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# apt autoremove && apt autoclean

2021-08-02

  • Help Udana with OAI validation on CGSpace

2021-08-03

  • Run fresh re-harvest on AReS

2021-08-05

  • Have a quick call with Mishell Portilla from CIP about a journal article that was flagged as being in a predatory journal (Beall’s List)
    • We agreed to unmap it from RTB’s collection for now, and I asked for advice from Peter and Abenet for what to do in the future
  • A developer from the Alliance asked for access to the CGSpace database so they can make some integration with PowerBI
    • I told them we don’t allow direct database access, and that it would be tricky anyways (that’s what APIs are for!)
  • I’m curious if there are still any requests coming in to CGSpace from the abusive Russian networks
    • I extracted all the unique IPs that nginx processed in the last week:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
  • Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
    • Indeed, now I see that there are no IPs from those networks coming in now:
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv

2021-08-08

  • Advise IWMI colleagues on best practices for thumbnails
  • Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
    • I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:
    • WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific
    • WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea
  • Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
    • 35.174.144.154 made 11,000 requests last month with the following user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
  • That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP
  • I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
    • Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so
    • Same deal with 93.158.90.91, also in Sweden
  • 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
  • 61.143.40.50 is in China and uses this hilarious user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
  • 47.252.80.214 is owned by Alibaba in the US and has the same user agent
  • 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
  • 95.87.154.12 seems to be a new bot with the following user agent:
Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
  • They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
    • I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots
  • I see a new bot using this user agent:
nettle (+https://www.nettle.sk)
  • 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
  • 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
  • 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
  • There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
Purging 24863 hits from 3.225.28.105 in statistics
Purging 2988 hits from 93.158.90.91 in statistics
Purging 2497 hits from 61.143.40.50 in statistics
Purging 13866 hits from 159.138.131.15 in statistics
Purging 2721 hits from 95.87.154.12 in statistics
Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics

Total number of bot hits purged: 90485
  • Then I purged a few thousand more by user agent:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri 
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics

Total number of hits from bots: 4492
  • I found some CGSpace metadata in the wrong fields
    • Seven metadata in dc.subject (57) should be in dcterms.subject (187)
    • Twelve metadata in cg.title.journal (202) should be in cg.journal (251)
    • Three dc.identifier.isbn (20) should be in cg.isbn (252)
    • Three dc.identifier.issn (21) should be in cg.issn (253)
    • I re-ran the migrate-fields.sh script on CGSpace
  • I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:
$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
  • Then in OpenRefine I merged all null, blank, and en fields into the en_US one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
    • In total it was a few thousand metadata entries or so so I had to split the CSV with xsv split in order to process it
    • I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes…
    • In total it was 1,195 changes to ISSN and ISBN metadata fields

2021-08-09

  • Extract all unique ISSNs to look up on Sherpa Romeo and Crossref
$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
  • Then I updated the CSV headers for each and joined the CSVs on the issn column:
$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
  • In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
  • Then I exported the list of journals that differ and sent it to Peter for comments and corrections
    • I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it
  • Convert my generate-thumbnails.py script to use libvips instead of Graphicsmagick
    • It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
    • One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can’t work in CMYK
    • I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I’m not sure…
    • Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
  • I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
39004:0.08
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg 
40932:0.53
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
41724:0.59
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
24736:0.04

2021-08-11

  • Peter got back to me about the journal title cleanup
    • From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref’s
    • Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter’s selections for where Sherpa Romeo and Crossref differred:
$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
  • Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
  • I exported a list of all the journal titles we have in the cg.journal field:
localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
  • I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them…
    • I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
    • Or instead of doing it via SQL I could use CSV and parse the values there…
  • A few more issues:
    • Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org’s web search (their API is invite only)
    • Some titles are different across all three datasets, for example ISSN 0003-1305:
  • I also realized that our previous controlled vocabulary came from CGSpace’s top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
    • Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
    • I requested access to the issn.org OAI-PMH API so I can use their registry…

2021-08-12

  • I sent an email to Sherpa Romeo’s help contact to ask about missing ISSNs
    • They pointed me to their inclusion criteria and said that missing journals should submit their open access policies to be included
  • The contact from issn.org got back to me and said I should pay 1,000/year EUR for 100,000 requests to their API… no thanks
  • Submit a pull request to COUNTER-Robots for the httpx bot (#45)
    • In the mean time I added it to our local ILRI overrides

2021-08-15

  • Start a fresh reindex on AReS

2021-08-16

  • Meeting with Abenet and Peter about CGSpace actions and future
    • We agreed to move three top-level Feed the Future projects into one community, so I created one and moved them:
$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
  • I made a minor fix to OpenRXV to prefix all image names with docker.io so it works with less changes on podman
    • Docker assumes the docker.io registry by default, but we should be explicit

2021-08-17

  • I made an initial attempt on the policy statements page on DSpace Test
    • It is modeled on Sherpa Romeo’s OpenDOAR policy statements advice
  • Sit with Moayad and discuss the future of AReS
    • We specifically discussed formalizing the API and documenting its use to allow as an alternative to harvesting directly from CGSpace
    • We also discussed allowing linking to search results to enable something like “Explore this collection” links on CGSpace collection pages
  • Lower case all AGROVOC metadata, as I had noticed a few in sentence case:
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 484
  • Also update some DOIs using the dx.doi.org format, just to keep things uniform:
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 469
  • Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    322m16.917s
user    226m43.121s
sys     3m17.469s
  • I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
    -H 'Content-Type: application/json' \
    -d '{
    "size": 10,
    "query": {
        "bool": {
            "filter": {
                "term": {
                    "repo.keyword": "CGSpace"
                }
            }
        }
    }
}'
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
  • This uses the Elasticsearch scroll ID to page through results
    • The second query doesn’t need the request body because it is saved for 1 day as part of the first request
  • Attempt to re-do my tests with VisualVM from 2019-04
    • I found that I can’t connect to the Tomcat JMX port using SSH forwarding (visualvm gives an error about localhost already being monitored)
    • Instead, I had to create a SOCKS proxy with SSH (ssh -D 8096), then set that up as a proxy in the VisualVM network settings, and then add the JMX connection
    • See: https://dzone.com/articles/visualvm-monitoring-remote-jvm
    • I have to spend more time on this…
  • I fixed a bug in the Altmetric donuts on OpenRXV
  • I improved the quality of the “no thumbnail” placeholder image on AReS: https://github.com/ilri/OpenRXV/pull/114
  • I sent some feedback to some ILRI and CCAFS colleagues about how to use better thumbnails for publications

2021-08-24

  • In the last few days I did a lot of work on OpenRXV
    • I started exploring the Angular 9.0 to 9.1 update
    • I tested some updates to dependencies for Angular 9 that we somehow missed, like @tinymce/tinymce-angular, @nicky-lenaers/ngx-scroll-to, and @ng-select/ng-select
    • I changed the default target from ES5 to ES2015 because ES5 was released in 2009 and the only thing we lose by moving to ES2015 is IE11 support
    • I fixed a handful of issues in the Docker build and deployment process
    • I started exploring changing the Docker configuration from using volumes to COPY instructions in the Dockerfile because we are having sporadic issues with permissions in containers caused by copying the host’s frontend/backend directories and not being able to write to them
    • I tested moving from node-sass to sass, as it has been supported since Angular 8 apparently and will allow us to avoid stupid node-gyp issues

2021-08-25

  • I did a bunch of tests of the OpenRXV Angular 9.1 update and merged it to master (#115)
  • Last week Maria Garruccio sent me a handful of new ORCID identifiers for Bioversity staff
    • We currently have 1320 unique identifiers, so this adds eleven new ones:
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
$ wc -l /tmp/2021-08-25-combined-orcids.txt
1331
  • After I combined them and removed duplicates, I resolved all the names using my resolve-orcids.py script:
$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
  • Tag existing items from the Alliance’s new authors with ORCID iDs using add-orcid-identifiers-csv.py (181 new metadata fields added):
$ cat 2021-08-25-add-orcids.csv 
dc.contributor.author,cg.creator.identifier
"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
"Jager M.","Matthias Jager: 0000-0003-1059-3949"
"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304" 
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'

2021-08-29

  • Run a full harvest on AReS
  • Also do more work the past few days on OpenRXV
    • I switched the backend target from ES2017 to ES2019
    • I did a proof of concept with multi-stage builds and simplifying the Docker configuration
  • Update the list of ORCID identifiers on CGSpace
  • Run system updates and reboot CGSpace (linode18)

2021-08-31

  • Yesterday I finished the work to make OpenRXV use a new multi-stage Docker build system and use smarter COPY instructions instead of runtime volumes
    • Today I merged the changes to the master branch and re-deployed AReS on linode20
    • Because the docker-compose.yml moved to the root the Docker volume prefix changed from docker_ to openrxv_ so I had to stop the containers and rsync the data from the old volume to the new one in /var/lib/docker