CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

April, 2019

2019-04-01

  • Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
    • They asked if we had plans to enable RDF support in CGSpace
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
   4432 200
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d

2019-04-02

  • CTA says the Amazon IPs are AWS gateways for real user traffic
  • I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
    • If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!
    • I ended up finding him via searching for his email address

2019-04-03

  • Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
    • First I need to extract the ones that are unique from their list compared to our existing one:
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
  • We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
  • Next I will resolve all their names using my resolve-orcids.py script:
$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
  • After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
  • One user’s name has changed so I will update those using my fix-metadata-values.py script:
$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
  • I created a pull request and merged the changes to the 5_x-prod branch (#417)
  • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:
2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
  • Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:
$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
      1 
      3 http://localhost:8081/solr//statistics-2017
   5662 http://localhost:8081/solr//statistics-2018
  • I will have to keep an eye on it because nothing should be updating 2018 stats in 2019…

2019-04-05

  • Uptime Robot reported that CGSpace (linode18) went down tonight
  • I see there are lots of PostgreSQL connections:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
     10 dspaceCli
    250 dspaceWeb
  • I still see those weird messages about updating the statistics-2018 Solr core:
2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
  • Looking at iostat 1 10 I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:

CPU usage week

  • The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…
  • I ran all system updates and rebooted the server
    • The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:
statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
  • I restarted it again and all the Solr cores came up properly…

2019-04-06

  • Udana asked why item 1056891278 didn’t have an Altmetric badge on CGSpace, but on the WLE website it does
    • I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
    • I tweeted the item and I assume this will link the Handle with the DOI in the system
    • Twenty minutes later I see the same Altmetric score (9) on CGSpace
  • Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    222 18.195.78.144
    245 207.46.13.58
    303 207.46.13.194
    328 66.249.79.33
    564 207.46.13.210
    566 66.249.79.62
    575 40.77.167.66
   1803 66.249.79.59
   2834 2a01:4f8:140:3192::2
   9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     31 66.249.79.62
     41 207.46.13.210
     42 40.77.167.66
     54 42.113.50.219
    132 66.249.79.59
    785 2001:41d0:d:1990::
   1164 45.5.184.72
   2014 50.116.102.77
   4267 45.5.186.2
   4893 205.186.128.185
  • 45.5.184.72 is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:
GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
  • Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
  • They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  22077 /handle/10568/72970/discover
  • Yesterday they made 43,000 requests and we actually blocked most of them:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  43631 /handle/10568/72970/discover
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
    142 200
  43489 503
  • I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover
  • Maria from Bioversity recommended that we use the phrase “AGROVOC subject” instead of “Subject” in Listings and Reports
    • I made a pull request to update this and merged it to the 5_x-prod branch (#418)