CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

April, 2019

2019-04-01

  • Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
    • They asked if we had plans to enable RDF support in CGSpace
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
   4432 200
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d

2019-04-02

  • CTA says the Amazon IPs are AWS gateways for real user traffic
  • I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
    • If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!
    • I ended up finding him via searching for his email address

2019-04-03

  • Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
    • First I need to extract the ones that are unique from their list compared to our existing one:
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
  • We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
  • Next I will resolve all their names using my resolve-orcids.py script:
$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
  • After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
  • One user’s name has changed so I will update those using my fix-metadata-values.py script:
$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
  • I created a pull request and merged the changes to the 5_x-prod branch (#417)
  • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:
2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
  • Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:
$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
      1 
      3 http://localhost:8081/solr//statistics-2017
   5662 http://localhost:8081/solr//statistics-2018
  • I will have to keep an eye on it because nothing should be updating 2018 stats in 2019…

2019-04-05

  • Uptime Robot reported that CGSpace (linode18) went down tonight
  • I see there are lots of PostgreSQL connections:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
     10 dspaceCli
    250 dspaceWeb
  • I still see those weird messages about updating the statistics-2018 Solr core:
2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
  • Looking at iostat 1 10 I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:

CPU usage week

  • The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…
  • I ran all system updates and rebooted the server
    • The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:
statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
  • I restarted it again and all the Solr cores came up properly…

2019-04-06

  • Udana asked why item 1056891278 didn’t have an Altmetric badge on CGSpace, but on the WLE website it does
    • I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
    • I tweeted the item and I assume this will link the Handle with the DOI in the system
    • Twenty minutes later I see the same Altmetric score (9) on CGSpace
  • Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    222 18.195.78.144
    245 207.46.13.58
    303 207.46.13.194
    328 66.249.79.33
    564 207.46.13.210
    566 66.249.79.62
    575 40.77.167.66
   1803 66.249.79.59
   2834 2a01:4f8:140:3192::2
   9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     31 66.249.79.62
     41 207.46.13.210
     42 40.77.167.66
     54 42.113.50.219
    132 66.249.79.59
    785 2001:41d0:d:1990::
   1164 45.5.184.72
   2014 50.116.102.77
   4267 45.5.186.2
   4893 205.186.128.185
  • 45.5.184.72 is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:
GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
  • Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
  • They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  22077 /handle/10568/72970/discover
  • Yesterday they made 43,000 requests and we actually blocked most of them:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  43631 /handle/10568/72970/discover
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
    142 200
  43489 503
  • I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover
  • Maria from Bioversity recommended that we use the phrase “AGROVOC subject” instead of “Subject” in Listings and Reports
    • I made a pull request to update this and merged it to the 5_x-prod branch (#418)

2019-04-07

  • Looking into the impact of harvesters like 45.5.184.72, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands per day
  • Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
    • I am trying to see if these are registered as downloads in Solr or not
    • I see 96,925 downloads from their AWS gateway IPs in 2019-03:
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
{
    "response": {
        "docs": [],
        "numFound": 96925,
        "start": 0
    },
    "responseHeader": {
        "QTime": 1,
        "params": {
            "fq": [
                "statistics_type:view",
                "bundleName:ORIGINAL",
                "dateYearMonth:2019-03"
            ],
            "indent": "true",
            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
            "rows": "0",
            "wt": "json"
        },
        "status": 0
    }
}
  • Strangely I don’t see many hits in 2019-04:
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
{
    "response": {
        "docs": [],
        "numFound": 38,
        "start": 0
    },
    "responseHeader": {
        "QTime": 1,
        "params": {
            "fq": [
                "statistics_type:view",
                "bundleName:ORIGINAL",
                "dateYearMonth:2019-04"
            ],
            "indent": "true",
            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
            "rows": "0",
            "wt": "json"
        },
        "status": 0
    }
}
$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:38:34 GMT
Expires: Sun, 07 Apr 2019 09:38:34 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block

$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf      
HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1                                                            
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:39:01 GMT
Expires: Sun, 07 Apr 2019 09:39:01 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly                                            
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block
  • And from the server side, the nginx logs show:
78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
  • So definitely the size of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
    • After twenty minutes of waiting I still don’t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
2019-04-07 02:05:30,966 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
  • So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
    • Strangely, the statistics Solr core says it hasn’t been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:
2019-04-07 02:08:44,186 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
  • Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are statistics_type:view… very weird
    • I don’t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
    • I will try to re-deploy the 5_x-dev branch and test again
  • According to the DSpace 5.x Solr documentation the default commit time is after 15 minutes or 10,000 documents (see solrconfig.xml)
  • I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they do register as downloads (even though they are internal):
$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
{
    "response": {
        "docs": [],
        "numFound": 909,
        "start": 0
    },
    "responseHeader": {
        "QTime": 0,
        "params": {
            "fq": [
                "statistics_type:view",
                "isInternal:true"
            ],
            "indent": "true",
            "q": "type:0 AND time:2019-04-07*",
            "rows": "0",
            "wt": "json"
        },
        "status": 0
    }
}
  • I confirmed the same on CGSpace itself after making one HEAD request
  • So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
    • I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
  • Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
    • This leads me to believe there is something specifically wrong with DSpace Test (linode19)
    • The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC
  • Holy shit, all this is actually because of the GeoIP1 deprecation and a missing GeoLiteCity.dat
    • For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!
    • See: DS-3986
    • See: DS-4020
    • See: DS-3832
    • DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been removed from MaxMind’s server as of 2018-04-01
    • Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this
  • UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check iostat 1 10 and I saw that CPU steal is around 10–30 percent right now…
  • The load average is super high right now, as I’ve noticed the last few times UptimeRobot said that CGSpace went down:
$ cat /proc/loadavg 
10.70 9.17 8.85 18/633 4198
  • According to the server logs there is actually not much going on right now:
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    118 18.195.78.144
    128 207.46.13.219
    129 167.114.64.100
    159 207.46.13.129
    179 207.46.13.33
    188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
    195 66.249.79.59
    363 40.77.167.21
    740 2a01:4f8:140:3192::2
   4823 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      3 66.249.79.62
      3 66.249.83.196
      4 207.46.13.86
      5 82.145.222.150
      6 2a01:4f9:2b:1263::2
      6 41.204.190.40
      7 35.174.176.49
     10 40.77.167.21
     11 194.246.119.6
     11 66.249.79.59
  • 45.5.184.72 is CIAT, who I already blocked and am waiting to hear from
  • 2a01:4f8:140:3192::2 is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve
  • 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142 is some stupid Chinese bot making malicious POST requests
  • There are free database connections in the pool:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
     23 dspaceWeb
  • It seems that the issue with CGSpace being “down” is actually because of CPU steal again!!!
  • I opened a ticket with support and asked them to migrate the VM to a less busy host

2019-04-08

  • Start checking IITA’s last round of batch uploads from March on DSpace Test (20193rd.xls)
    • Lots of problems with affiliations, I had to correct about sixty of them
    • I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:
$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
  • After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
    • The matched values can be accessed with cell.recon.match.name, but some of the new values don’t appear, perhaps because I edited the original cell values?
    • I ended up using this GREL expression to copy all values to a new column:
if(cell.recon.matched, cell.recon.match.name, value)
  • See the OpenRefine variables documentation for more notes about the recon object
  • I also noticed a handful of errors in our current list of affiliations so I corrected them:
$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
  • We should create a new list of affiliations to update our controlled vocabulary again
  • I dumped a list of the top 1500 affiliations:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
COPY 1500
  • Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):
dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
  • I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20
  • I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
  • UptimeRobot said that CGSpace (linode18) went down tonight
    • The load average is at 9.42, 8.87, 7.87
    • I looked at PostgreSQL and see shitloads of connections there:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
    250 dspaceWeb
  • On a related note I see connection pool errors in the DSpace log:
2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
  • But still I see 10 to 30% CPU steal in iostat that is also reflected in the Munin graphs:

CPU usage week

  • Linode Support still didn’t respond to my ticket from yesterday, so I attached a new output of iostat 1 10 and asked them to move the VM to a less busy host
  • The web server logs are not very busy:
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    124 40.77.167.135
    135 95.108.181.88
    139 157.55.39.206
    190 66.249.79.133
    202 45.5.186.2
    284 207.46.13.95
    359 18.196.196.108
    457 157.55.39.164
    457 40.77.167.132
   3822 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      5 129.0.79.206
      5 41.205.240.21
      7 207.46.13.95
      7 66.249.79.133
      7 66.249.79.135
      7 95.108.181.88
      8 40.77.167.111
     19 157.55.39.164
     20 40.77.167.132
    370 51.254.16.223

2019-04-09

  • Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning
  • Here are the top IPs in the web server logs around that time:
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     18 66.249.79.139
     21 157.55.39.160
     29 66.249.79.137
     38 66.249.79.135
     50 34.200.212.137
     54 66.249.79.133
    100 102.128.190.18
   1166 45.5.184.72
   4251 45.5.186.2
   4895 205.186.128.185
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    200 144.48.242.108
    202 207.46.13.185
    206 18.194.46.84
    239 66.249.79.139
    246 18.196.196.108
    274 31.6.77.23
    289 66.249.79.137
    312 157.55.39.160
    441 66.249.79.135
    856 66.249.79.133
  • 45.5.186.2 is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
  • Database connection usage looks fine:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
     11 dspaceWeb
  • Ironically I do still see some 2 to 10% of CPU steal in iostat 1 10
  • Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
    • I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items
  • In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night

2019-04-10

  • Abenet pointed out a possibility of validating funders against the CrossRef API
  • Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked
$ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
  • Otherwise, they provide the funder data in CSV and RDF format
  • I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…
  • If I want to write a script for this I could use the Python habanero library:
from habanero import Crossref
cr = Crossref(mailto="me@cgiar.org")
x = cr.funders(query = "mercator")

2019-04-11

  • Continue proofing IITA’s last round of batch uploads from March on DSpace Test (20193rd.xls)
    • One misspelled country
    • Three incorrect regions
    • Potential duplicates (same DOI, similar title, same authors):
    • 10568100580 and 10568100579
    • 10568100444 and 10568100423
    • Two DOIs with incorrect URL formatting
    • Two misspelled IITA subjects
    • Two authors with smart quotes
    • Lots of issues with sponsors
    • One misspelled “Peer review”
    • One invalid ISBN that I fixed by Googling the title
    • Lots of issues with sponsors (German Aid Agency, Swiss Aid Agency, Italian Aid Agency, Dutch Aid Agency, etc)
    • I validated all the AGROVOC subjects against our latest list with reconcile-csv
    • About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually
  • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:
$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
  • Answer more questions about DOIs and Altmetric scores from WLE
  • Answer more questions about DOIs and Altmetric scores from IWMI
    • They can’t seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
    • To make things worse, many of their items DON’T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don’t have scores because they never tweeted them!
    • They added a DOI to this old item 1056797087 this morning and wonder why Altmetric’s score hasn’t linked with the DOI magically
    • We should check in a week to see if Altmetric will make the association after one week when they harvest again

2019-04-13

  • I copied the statistics and statistics-2018 Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my dspace-statistics-api:

Java GC during Solr indexing with CMS

  • It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear “sawtooth” pattern in the garbage collection
  • I am curious if the GC pattern would be different if I switched from the -XX:+UseConcMarkSweepGC to G1GC
  • I switched to G1GC and restarted Tomcat but for some reason I couldn’t see the Tomcat PID in VisualVM…
    • Anyways, the indexing process took much longer, perhaps twice as long!
  • I tried again with the GC tuning settings from the Solr 4.10.4 release:

Java GC during Solr indexing Solr 4.10.4 settings

2019-04-14

  • Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.14.4 startup script:
GC_TUNE="-XX:NewRatio=3 \
    -XX:SurvivorRatio=4 \
    -XX:TargetSurvivorRatio=90 \
    -XX:MaxTenuringThreshold=8 \
    -XX:+UseConcMarkSweepGC \
    -XX:+UseParNewGC \
    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
    -XX:+CMSScavengeBeforeRemark \
    -XX:PretenureSizeThreshold=64m \
    -XX:+UseCMSInitiatingOccupancyOnly \
    -XX:CMSInitiatingOccupancyFraction=50 \
    -XX:CMSMaxAbortablePrecleanTime=6000 \
    -XX:+CMSParallelRemarkEnabled \
    -XX:+ParallelRefProcEnabled"
  • I need to remember to check the Munin JVM graphs in a few days
  • It might be placebo, but the site does feel snappier…

2019-04-15

  • Rework the dspace-statistics-api to use the vanilla Python requests library instead of Solr client
  • Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to iostat 1 10, though I haven’t had any Linode alerts in a few days
  • Abenet sent me a list of ILRI items that don’t have CRPs added to them
    • The spreadsheet only had Handles (no IDs), so I’m experimenting with using Python in OpenRefine to get the IDs
    • I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:
import json
import re
import urllib
import urllib2

handle = re.findall('[0-9]+/[0-9]+', value)

url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
req = urllib2.Request(url)
req.add_header('User-agent', 'Alan Python bot')
res = urllib2.urlopen(req)
data = json.load(res)
item_id = data['id']

return item_id
  • Luckily none of the items already had CRPs, so I didn’t have to worry about them getting removed
    • It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs…
  • I ran a full Discovery indexing on CGSpace because I didn’t do it after all the metadata updates last week:
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    82m45.324s
user    7m33.446s
sys     2m13.463s

2019-04-16

  • Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something

2019-04-17

  • Reading an interesting blog post about Solr caching
  • Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (statistics and statistics-2018) and monitored the memory usage of Tomcat in VisualVM
  • 4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
    • Run 1:
    • Time: 3.11s user 0.44s system 0% cpu 13:45.07 total
    • Tomcat (not Solr) max JVM heap usage: 2.04 GiB
    • Run 2:
    • Time: 3.23s user 0.43s system 0% cpu 13:46.10 total
    • Tomcat (not Solr) max JVM heap usage: 2.06 GiB
    • Run 3:
    • Time: 3.23s user 0.42s system 0% cpu 13:14.70 total
    • Tomcat (not Solr) max JVM heap usage: 2.13 GiB
    • filterCache size: 482, cumulative_lookups: 7062712, cumulative_hits: 167903, cumulative_hitratio: 0.02
    • queryResultCache size: 2
  • 4GB heap, CMS GC, 1024 filter cache, 512 query cache, with 28 million documents in two shards
    • Run 1:
    • Time: 2.92s user 0.39s system 0% cpu 12:33.08 total
    • Tomcat (not Solr) max JVM heap usage: 2.16 GiB
    • Run 2:
    • Time: 3.10s user 0.39s system 0% cpu 12:25.32 total
    • Tomcat (not Solr) max JVM heap usage: 2.07 GiB
    • Run 3:
    • Time: 3.29s user 0.36s system 0% cpu 11:53.47 total
    • Tomcat (not Solr) max JVM heap usage: 2.08 GiB
    • filterCache size: 951, cumulative_lookups: 7062712, cumulative_hits: 254379, cumulative_hitratio: 0.04
  • 4GB heap, CMS GC, 2048 filter cache, 512 query cache, with 28 million documents in two shards
    • Run 1:
    • Time: 2.90s user 0.48s system 0% cpu 10:37.31 total
    • Tomcat max JVM heap usage: 1.96 GiB
    • filterCache size: 1901, cumulative_lookups: 2354237, cumulative_hits: 180111, cumulative_hitratio: 0.08
    • Run 2:
    • Time: 2.97s user 0.39s system 0% cpu 10:40.06 total
    • Tomcat max JVM heap usage: 2.09 GiB
    • filterCache size: 1901, cumulative_lookups: 4708473, cumulative_hits: 360068, cumulative_hitratio: 0.08
    • Run 3:
    • Time: 3.28s user 0.37s system 0% cpu 10:49.56 total
    • Tomcat max JVM heap usage: 2.05 GiB
    • filterCache size: 1901, cumulative_lookups: 7062712, cumulative_hits: 540020, cumulative_hitratio: 0.08
  • 4GB heap, CMS GC, 4096 filter cache, 512 query cache, with 28 million documents in two shards
    • Run 1:
    • Time: 2.88s user 0.35s system 0% cpu 8:29.55 total
    • Tomcat max JVM heap usage: 2.15 GiB
    • filterCache size: 3770, cumulative_lookups: 2354237, cumulative_hits: 414512, cumulative_hitratio: 0.18
    • Run 2:
    • Time: 3.01s user 0.38s system 0% cpu 9:15.65 total
    • Tomcat max JVM heap usage: 2.17 GiB
    • filterCache size: 3945, cumulative_lookups: 4708473, cumulative_hits: 829093, cumulative_hitratio: 0.18
    • Run 3:
    • Time: 3.01s user 0.40s system 0% cpu 9:01.31 total
    • Tomcat max JVM heap usage: 2.07 GiB
    • filterCache size: 3770, cumulative_lookups: 7062712, cumulative_hits: 1243632, cumulative_hitratio: 0.18
  • The biggest takeaway I have is that this workload benefits from a larger filterCache (for Solr fq parameter), but barely uses the queryResultCache (for Solr q parameter) at all
    • The number of hits goes up and the time taken decreases when we increase the filterCache, and total JVM heap memory doesn’t seem to increase much at all
    • I guess the queryResultCache size is always 2 because I’m only doing two queries: type:0 and type:2 (downloads and views, respectively)
  • Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:

VisualVM Tomcat 4096 filterCache

  • I ran one test with a filterCache of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it drastically increased the performance and memory usage of the dspace-statistics-api indexer
  • 4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
    • Run 1:
    • Time: 2.85s user 0.42s system 2% cpu 2:28.92 total
    • Tomcat max JVM heap usage: 1.90 GiB
    • filterCache size: 14851, cumulative_lookups: 2354237, cumulative_hits: 2331186, cumulative_hitratio: 0.99
    • Run 2:
    • Time: 2.90s user 0.37s system 2% cpu 2:23.50 total
    • Tomcat max JVM heap usage: 1.27 GiB
    • filterCache size: 15834, cumulative_lookups: 4708476, cumulative_hits: 4664762, cumulative_hitratio: 0.99
    • Run 3:
    • Time: 2.93s user 0.39s system 2% cpu 2:26.17 total
    • Tomcat max JVM heap usage: 1.05 GiB
    • filterCache size: 15248, cumulative_lookups: 7062715, cumulative_hits: 6998267, cumulative_hitratio: 0.99
  • The JVM garbage collection graph is MUCH flatter, and memory usage is much lower (not to mention a drop in GC-related CPU usage)!

VisualVM Tomcat 16384 filterCache