April, 2019

Mon Apr 01, 2019 by Alan Orth in Notes

2019-04-01

Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
- They asked if we had plans to enable RDF support in CGSpace
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
- I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
   4432 200

In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
Apply country and region corrections and deletions on DSpace Test and CGSpace:

$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d

2019-04-02

CTA says the Amazon IPs are AWS gateways for real user traffic
I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
- If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!
- I ended up finding him via searching for his email address

2019-04-03

Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
- First I need to extract the ones that are unique from their list compared to our existing one:

$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt

We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
Next I will resolve all their names using my resolve-orcids.py script:

$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d

After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
One user’s name has changed so I will update those using my fix-metadata-values.py script:

$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d

I created a pull request and merged the changes to the 5_x-prod branch (#417)
A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:

2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018

Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:

$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
      1 
      3 http://localhost:8081/solr//statistics-2017
   5662 http://localhost:8081/solr//statistics-2018

I will have to keep an eye on it because nothing should be updating 2018 stats in 2019…

2019-04-05

Uptime Robot reported that CGSpace (linode18) went down tonight
I see there are lots of PostgreSQL connections:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
     10 dspaceCli
    250 dspaceWeb

I still see those weird messages about updating the statistics-2018 Solr core:

2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018

Looking at iostat 1 10 I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:

CPU usage week

The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…
I ran all system updates and rebooted the server
- The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:

statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher

I restarted it again and all the Solr cores came up properly…