CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

January, 2023

2023-01-01

  • Apply some more ORCID identifiers to items on CGSpace using my 2022-09-22-add-orcids.csv file
    • I want to update all ORCID names and refresh them in the database
    • I see we have some new ones that aren’t in our list if I combine with this file:
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep - oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l                         
1939
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
1973
  • I will extract and process them with my resolve-orcids.py script:
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-01-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2023-01-01-orcids.txt -o /tmp/2023-01-01-orcids-names.txt -d
$ ./ilri/update-orcids.py -i /tmp/2023-01-01-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
  • Load on CGSpace is high around 9.x
    • I see there is a CIAT bot harvesting via the REST API with IP 45.5.186.2
    • Other than that I don’t see any particular system stats as alarming
    • There has been a marked increase in load in the last few weeks, perhaps due to Initiative activity…
    • Perhaps there are some stuck PostgreSQL locks from CLI tools?
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
     58 dspaceCli
     46 dspaceWeb
  • The current time on the server is 08:52 and I see the dspaceCli locks were started at 04:00 and 05:00… so I need to check which cron jobs those belong to as I think I noticed this last month too
    • I’m going to wait and see if they finish, but by tomorrow I will kill them

2023-01-02

  • The load on the server is now very low and there are no more locks from dspaceCli
    • So there was some long-running process that was running and had to finish!
    • That finally sheds some light on the “high load on Sunday” problem where I couldn’t find any other distinct pattern in the nginx or Tomcat requests

2023-01-03

  • The load from the server on Sundays, which I have noticed for a long time, seems to be coming from the DSpace checker cron job
    • This checks the checksums of all bitstreams to see if they match the ones in the database
  • I exported the entire CGSpace metadata to do country/region checks with csv-metadata-quality
    • I extracted only the items with countries, which was about 48,000, then split the file into parts of 10,000 items, but the upload found 2,000 changes in the first one and took several hours to complete…
  • IWMI sent me ORCID identifiers for new scientsts, bringing our total to 2,010

2023-01-04

  • I finally finished applying the region imports (in five batches of 10,000)
    • It was about 7,500 missing regions in total…
  • Now I will move on to doing the Initiative mappings
    • I modified my fix-initiative-mappings.py script to only write out the items that have updated mappings
    • This makes it way easier to apply fixes to the entire CGSpace because we don’t try to import 100,000 items with no changes in mappings
  • More dspaceCli locks from 04:00 this morning (current time on server is 07:33) and today is a Wednesday
    • The checker cron job runs on 0,3, which is Sunday and Wednesday, so this is from that…
    • Finally at 16:30 I decided to kill the PIDs associated with those locks…
    • I am going to disable that cron job for now and watch the server load for a few weeks
  • Start a harvest on AReS

2023-01-08

  • It’s Sunday and I see some PostgreSQL locks belonging to dspaceCli that started at 05:00
    • That’s strange because I disabled the dspace checker one last week, so I’m not sure which this is…
    • It’s currently 2:30PM on the server so these locks have been there for almost twelve hours
  • I exported the entire CGSpace to update the Initiative mappings
    • Items were mapped to ~58 new Initiative collections
  • Then I ran the ORCID import to catch any new ones that might not have been tagged
  • Then I started a harvest on AReS

2023-01-09

  • Fix some invalid Initiative names on CGSpace and then check for missing mappings
  • Check for missing regions in the Initiatives collection
  • Export a list of author affiliations from the Initiatives community for Peter to check
    • Was slightly ghetto because I did it from a CSV export of the Initiatives community, then imported to OpenRefine to split multi-value fields, then did some sed nonsense to handle the quoting:
$ csvcut -c 'cg.contributor.affiliation[en_US]' ~/Downloads/2023-01-09-initiatives.csv | \
  sed -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' | \
  sort -u | \
  sed -e 's/^\(.*\)/"\1/' -e 's/\(.*\)$/\1"/' > /tmp/2023-01-09-initiatives-affiliations.csv

2023-01-10

  • Export the CGSpace Initiatives collection to check for missing regions and collection mappings

2023-01-11

  • I’m trying the DSpace 7 REST API again
$ curl --head https://dspace7test.ilri.org/server/api
...
set-cookie: DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519; Path=/server; Secure; HttpOnly; SameSite=None
dspace-xsrf-token: 42c78c56-613d-464f-89ea-79142fc5b519
$ curl -v -X POST https://dspace7test.ilri.org/server/api/authn/login --data "user=alantest%40cgiar.org&password=dspace" -H "X-XSRF-TOKEN: 42c78c56-613d-464f-89ea-79142fc5b519" -b "DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519"
...
authorization: Bearer eyJh...9-0
$ curl -v "https://dspace7test.ilri.org/api/core/items" -H "Authorization: Bearer eyJh...9-0"
  • I created a pull request to fix the docs
  • I did quite a lot of cleanup and updates on the IFPRI batch items for the Gender Equality batch upload
    • Then I uploaded them to CGSpace
  • I added about twenty more ORCID identifiers to my list and tagged them on CGSpace

2023-01-12

  • I exported the entire CGSpace and did some cleanups on all metadata in OpenRefine
    • I was primarily interested in normalizing the DOIs, but I also normalized a bunch of publishing places
    • After this imports I will export it again to do the Initiative and region mappings
    • I ran the fix-initiative-mappings.py script and got forty-nine new mappings…
  • I added several dozen new ORCID identifiers to my list and tagged ~500 on CGSpace
  • Start a harvest on AReS

2023-01-13

  • Do a bit more cleanup on licenses, issue dates, and publishers
    • Then I started importing my large list of 5,000 items changed from yesterday
  • Help Karen add abstracts to a bunch of SAPLING items that were missing them on CGSpace
    • For now I only did open access journal articles, but I should do the reports and others too

2023-01-14

  • Export CGSpace and check for missing Initiative mappings
    • There were a total of twenty-five
    • Then I exported the Initiatives communinty to check the countries and regions

2023-01-15

  • Start a harvest on AReS

2023-01-16

  • Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems
  • Batch import another twenty-eight items for IFPRI across several Initiatives
    • On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
    • I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
    • Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values

2023-01-17

  • Batch import another twenty-three items for IFPRI across several Initiatives
    • I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
    • I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
    • Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669
    • Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values
  • I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality
  • I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace
  • There is a high load on CGSpace pretty regularly
    • Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks:

DSpace sessions year

  • Is this attributable to all the PRMS harvesting?
  • I also see some PostgreSQL locks starting earlier today:

PostgreSQL locks day

  • I’m curious to see what kinds of IPs have been connecting, so I will look at the last few weeks:
# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt
# wc -l /tmp/2023-01-17-cgspace-ips.txt 
129446 /tmp/2023-01-17-cgspace-ips.txt
  • I ran the IPs through my resolve-addresses-geoip2.py script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):
$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \
  /tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
  sed 1d | sort | uniq > /tmp/networks-to-block.txt
$ wc -l /tmp/networks-to-block.txt 
776 /tmp/networks-to-block.txt
  • I added the list of networks to nginx’s bot-networks.conf so they will all be heavily rate limited
  • Looking at the Munin stats again I see the load has been extra high since yesterday morning:

CPU week

  • But still, it’s suspicious that there are so many PostgreSQL locks
  • Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy)
    • I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!)
    • I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request
    • I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request
    • I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request
    • I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request
    • I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it’s a data center ISP so nope
    • I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request
    • I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request
    • I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request
    • I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request
    • I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request
    • I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request
    • I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request
    • I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request
    • I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request
    • I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked
    • I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request
    • There are too many to count… so I will purge these and then move on to user agents
  • I purged hits from those IPs:
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 439185 hits from 31.148.223.10 in statistics
Purging 2151 hits from 18.203.245.60 in statistics
Purging 1990 hits from 3.249.192.212 in statistics
Purging 1975 hits from 34.244.160.145 in statistics
Purging 1969 hits from 52.213.59.101 in statistics
Purging 2540 hits from 91.209.8.29 in statistics
Purging 1624 hits from 54.78.176.127 in statistics
Purging 1236 hits from 54.74.197.53 in statistics
Purging 1327 hits from 54.246.128.111 in statistics
Purging 1108 hits from 52.16.103.133 in statistics
Purging 1045 hits from 63.32.99.252 in statistics
Purging 999 hits from 176.34.141.181 in statistics
Purging 997 hits from 34.243.17.80 in statistics
Purging 985 hits from 34.240.206.16 in statistics
Purging 862 hits from 18.203.81.120 in statistics
Purging 1654 hits from 176.97.210.106 in statistics
Purging 1628 hits from 51.81.193.200 in statistics
Purging 1020 hits from 79.110.73.54 in statistics
Purging 842 hits from 35.153.105.213 in statistics
Purging 1689 hits from 54.164.237.125 in statistics

Total number of bot hits purged: 466826
  • Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones:
    • azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0
    • Gov employment data scraper ([[your email]])
    • Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)
    • crownpeak
    • Mozilla/5.0 (compatible)
  • Also, a ton of them are lower case, which I’ve never seen before… it might be possible, but looks super fishy to me:
    • mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0
    • mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36
    • mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36
    • mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0
    • mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36
    • mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36
    • mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36
    • mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36
  • I purged some of those:
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 1658 hits from azure-logic-apps\/1.0 in statistics
Purging 948 hits from Gov employment data scraper in statistics
Purging 786 hits from Microsoft\.Data\.Mashup in statistics
Purging 303 hits from crownpeak in statistics
Purging 332 hits from Mozilla\/5.0 (compatible) in statistics

Total number of bot hits purged: 4027
  • Then I ran all system updates on the server and rebooted it
    • Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers
    • I need to re-work how I’m doing this whitelisting and blacklisting… it’s way too complicated now
  • Export entire CGSpace to check Initiative mappings, and add nineteen…
  • Start a harvest on AReS