cgspace-notes/2019-04.md at 4a4bd34e0e1d7b7bc9fbe2e1544b67f931d63da2

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-18 12:47:04 +01:00

Alan Orth 4a4bd34e0e

Add notes for 2019-04-17

2019-04-17 13:38:50 +03:00

36 KiB

Raw Blame History

title

date

author

2019-04-01

Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
- They asked if we had plans to enable RDF support in CGSpace
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
- I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
   4432 200

In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
Apply country and region corrections and deletions on DSpace Test and CGSpace:

$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d

2019-04-02

CTA says the Amazon IPs are AWS gateways for real user traffic
I was trying to add Felix Shaw's account back to the Administrators group on DSpace Test, but I couldn't find his name in the user search of the groups page
- If I searched for "Felix" or "Shaw" I saw other matches, included one for his personal email address!
- I ended up finding him via searching for his email address

2019-04-03

Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
- First I need to extract the ones that are unique from their list compared to our existing one:

$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt

We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
Next I will resolve all their names using my resolve-orcids.py script:

$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d

After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
One user's name has changed so I will update those using my fix-metadata-values.py script:

$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d

I created a pull request and merged the changes to the 5_x-prod branch (#417)
A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it's still going:

2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018

Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:

$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
      1 
      3 http://localhost:8081/solr//statistics-2017
   5662 http://localhost:8081/solr//statistics-2018

I will have to keep an eye on it because nothing should be updating 2018 stats in 2019...

2019-04-05

Uptime Robot reported that CGSpace (linode18) went down tonight
I see there are lots of PostgreSQL connections:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
     10 dspaceCli
    250 dspaceWeb

I still see those weird messages about updating the statistics-2018 Solr core:

2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018

Looking at iostat 1 10 I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:

The other thing visible there is that the past few days the load has spiked to 500% and I don't think it's a coincidence that the Solr updating thing is happening...
I ran all system updates and rebooted the server
- The load was lower on the server after reboot, but Solr didn't come back up properly according to the Solr Admin UI:

statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher

I restarted it again and all the Solr cores came up properly...

2019-04-06

Udana asked why item 10568/91278 didn't have an Altmetric badge on CGSpace, but on the WLE website it does
- I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
- I tweeted the item and I assume this will link the Handle with the DOI in the system
- Twenty minutes later I see the same Altmetric score (9) on CGSpace
Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:

# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    222 18.195.78.144
    245 207.46.13.58
    303 207.46.13.194
    328 66.249.79.33
    564 207.46.13.210
    566 66.249.79.62
    575 40.77.167.66
   1803 66.249.79.59
   2834 2a01:4f8:140:3192::2
   9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     31 66.249.79.62
     41 207.46.13.210
     42 40.77.167.66
     54 42.113.50.219
    132 66.249.79.59
    785 2001:41d0:d:1990::
   1164 45.5.184.72
   2014 50.116.102.77
   4267 45.5.186.2
   4893 205.186.128.185

45.5.184.72 is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection:

GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014

Their user agent is the one I added to the badbots list in nginx last week: "GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1"
They made 22,000 requests to Discover on this collection today alone (and it's only 11AM):

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  22077 /handle/10568/72970/discover

Yesterday they made 43,000 requests and we actually blocked most of them:

# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  43631 /handle/10568/72970/discover
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
    142 200
  43489 503

I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover
Maria from Bioversity recommended that we use the phrase "AGROVOC subject" instead of "Subject" in Listings and Reports
- I made a pull request to update this and merged it to the 5_x-prod branch (#418)

2019-04-07

Looking into the impact of harvesters like 45.5.184.72, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands per day
Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
- I am trying to see if these are registered as downloads in Solr or not
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:

$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
{
    "response": {
        "docs": [],
        "numFound": 96925,
        "start": 0
    },
    "responseHeader": {
        "QTime": 1,
        "params": {
            "fq": [
                "statistics_type:view",
                "bundleName:ORIGINAL",
                "dateYearMonth:2019-03"
            ],
            "indent": "true",
            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
            "rows": "0",
            "wt": "json"
        },
        "status": 0
    }
}

Strangely I don't see many hits in 2019-04:

$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
{
    "response": {
        "docs": [],
        "numFound": 38,
        "start": 0
    },
    "responseHeader": {
        "QTime": 1,
        "params": {
            "fq": [
                "statistics_type:view",
                "bundleName:ORIGINAL",
                "dateYearMonth:2019-04"
            ],
            "indent": "true",
            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
            "rows": "0",
            "wt": "json"
        },
        "status": 0
    }
}

Making some tests on GET vs HEAD requests on the CTA Spore 192 item on DSpace Test:

$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:38:34 GMT
Expires: Sun, 07 Apr 2019 09:38:34 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block

$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf      
HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1                                                            
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:39:01 GMT
Expires: Sun, 07 Apr 2019 09:39:01 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly                                            
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block

And from the server side, the nginx logs show:

78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"

So definitely the size of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
- After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:

2019-04-07 02:05:30,966 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818

So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
- Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log:

2019-04-07 02:08:44,186 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are statistics_type:view... very weird
- I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
- I will try to re-deploy the 5_x-dev branch and test again
According to the DSpace 5.x Solr documentation the default commit time is after 15 minutes or 10,000 documents (see solrconfig.xml)
I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they do register as downloads (even though they are internal):

$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
{
    "response": {
        "docs": [],
        "numFound": 909,
        "start": 0
    },
    "responseHeader": {
        "QTime": 0,
        "params": {
            "fq": [
                "statistics_type:view",
                "isInternal:true"
            ],
            "indent": "true",
            "q": "type:0 AND time:2019-04-07*",
            "rows": "0",
            "wt": "json"
        },
        "status": 0
    }
}

I confirmed the same on CGSpace itself after making one HEAD request
So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
- I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
- This leads me to believe there is something specifically wrong with DSpace Test (linode19)
- The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC
Holy shit, all this is actually because of the GeoIP1 deprecation and a missing GeoLiteCity.dat
- For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!
- See: DS-3986
- See: DS-4020
- See: DS-3832
- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been removed from MaxMind's server as of 2018-04-01
- Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this
UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check iostat 1 10 and I saw that CPU steal is around 10–30 percent right now...
The load average is super high right now, as I've noticed the last few times UptimeRobot said that CGSpace went down:

$ cat /proc/loadavg 
10.70 9.17 8.85 18/633 4198

According to the server logs there is actually not much going on right now:

# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    118 18.195.78.144
    128 207.46.13.219
    129 167.114.64.100
    159 207.46.13.129
    179 207.46.13.33
    188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
    195 66.249.79.59
    363 40.77.167.21
    740 2a01:4f8:140:3192::2
   4823 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      3 66.249.79.62
      3 66.249.83.196
      4 207.46.13.86
      5 82.145.222.150
      6 2a01:4f9:2b:1263::2
      6 41.204.190.40
      7 35.174.176.49
     10 40.77.167.21
     11 194.246.119.6
     11 66.249.79.59

45.5.184.72 is CIAT, who I already blocked and am waiting to hear from
2a01:4f8:140:3192::2 is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve
2408:8214:7a00:868f:7c1e:e0f3:20c6:c142 is some stupid Chinese bot making malicious POST requests
There are free database connections in the pool:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
     23 dspaceWeb

It seems that the issue with CGSpace being "down" is actually because of CPU steal again!!!
I opened a ticket with support and asked them to migrate the VM to a less busy host

2019-04-08

Start checking IITA's last round of batch uploads from March on DSpace Test (20193rd.xls)
- Lots of problems with affiliations, I had to correct about sixty of them
- I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:

$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id

After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
- The matched values can be accessed with cell.recon.match.name, but some of the new values don't appear, perhaps because I edited the original cell values?
- I ended up using this GREL expression to copy all values to a new column:

if(cell.recon.matched, cell.recon.match.name, value)

See the OpenRefine variables documentation for more notes about the recon object
I also noticed a handful of errors in our current list of affiliations so I corrected them:

$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d

We should create a new list of affiliations to update our controlled vocabulary again
I dumped a list of the top 1500 affiliations:

dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
COPY 1500

Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):

dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';

I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:

dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20

I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:

$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d

UptimeRobot said that CGSpace (linode18) went down tonight
- The load average is at 9.42, 8.87, 7.87
- I looked at PostgreSQL and see shitloads of connections there:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
    250 dspaceWeb

On a related note I see connection pool errors in the DSpace log:

2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].

But still I see 10 to 30% CPU steal in iostat that is also reflected in the Munin graphs:

Linode Support still didn't respond to my ticket from yesterday, so I attached a new output of iostat 1 10 and asked them to move the VM to a less busy host
The web server logs are not very busy:

# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    124 40.77.167.135
    135 95.108.181.88
    139 157.55.39.206
    190 66.249.79.133
    202 45.5.186.2
    284 207.46.13.95
    359 18.196.196.108
    457 157.55.39.164
    457 40.77.167.132
   3822 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      5 129.0.79.206
      5 41.205.240.21
      7 207.46.13.95
      7 66.249.79.133
      7 66.249.79.135
      7 95.108.181.88
      8 40.77.167.111
     19 157.55.39.164
     20 40.77.167.132
    370 51.254.16.223

2019-04-09

Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning
Here are the top IPs in the web server logs around that time:

# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     18 66.249.79.139
     21 157.55.39.160
     29 66.249.79.137
     38 66.249.79.135
     50 34.200.212.137
     54 66.249.79.133
    100 102.128.190.18
   1166 45.5.184.72
   4251 45.5.186.2
   4895 205.186.128.185
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    200 144.48.242.108
    202 207.46.13.185
    206 18.194.46.84
    239 66.249.79.139
    246 18.196.196.108
    274 31.6.77.23
    289 66.249.79.137
    312 157.55.39.160
    441 66.249.79.135
    856 66.249.79.133

45.5.186.2 is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36

Database connection usage looks fine:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
     11 dspaceWeb

Ironically I do still see some 2 to 10% of CPU steal in iostat 1 10
Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
- I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items
In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night

2019-04-10

Abenet pointed out a possibility of validating funders against the CrossRef API
Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked

$ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'

Otherwise, they provide the funder data in CSV and RDF format
I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn't match will need a human to go and do some manual checking and informed decision making...
If I want to write a script for this I could use the Python habanero library:

from habanero import Crossref
cr = Crossref(mailto="me@cgiar.org")
x = cr.funders(query = "mercator")

2019-04-11

Continue proofing IITA's last round of batch uploads from March on DSpace Test (20193rd.xls)
- One misspelled country
- Three incorrect regions
- Potential duplicates (same DOI, similar title, same authors):
  - 10568/100580 and 10568/100579
  - 10568/100444 and 10568/100423
- Two DOIs with incorrect URL formatting
- Two misspelled IITA subjects
- Two authors with smart quotes
- Lots of issues with sponsors
- One misspelled "Peer review"
- One invalid ISBN that I fixed by Googling the title
- Lots of issues with sponsors (German Aid Agency, Swiss Aid Agency, Italian Aid Agency, Dutch Aid Agency, etc)
- I validated all the AGROVOC subjects against our latest list with reconcile-csv
  - About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually
I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA's records, so I applied them to DSpace Test and CGSpace:

$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d

Answer more questions about DOIs and Altmetric scores from WLE
Answer more questions about DOIs and Altmetric scores from IWMI
- They can't seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
- To make things worse, many of their items DON'T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don't have scores because they never tweeted them!
- They added a DOI to this old item 10567/97087 this morning and wonder why Altmetric's score hasn't linked with the DOI magically
- We should check in a week to see if Altmetric will make the association after one week when they harvest again

2019-04-13

I copied the statistics and statistics-2018 Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my dspace-statistics-api:

It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear "sawtooth" pattern in the garbage collection
I am curious if the GC pattern would be different if I switched from the -XX:+UseConcMarkSweepGC to G1GC
I switched to G1GC and restarted Tomcat but for some reason I couldn't see the Tomcat PID in VisualVM...
- Anyways, the indexing process took much longer, perhaps twice as long!
I tried again with the GC tuning settings from the Solr 4.10.4 release:

2019-04-14

Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.14.4 startup script:

GC_TUNE="-XX:NewRatio=3 \
    -XX:SurvivorRatio=4 \
    -XX:TargetSurvivorRatio=90 \
    -XX:MaxTenuringThreshold=8 \
    -XX:+UseConcMarkSweepGC \
    -XX:+UseParNewGC \
    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
    -XX:+CMSScavengeBeforeRemark \
    -XX:PretenureSizeThreshold=64m \
    -XX:+UseCMSInitiatingOccupancyOnly \
    -XX:CMSInitiatingOccupancyFraction=50 \
    -XX:CMSMaxAbortablePrecleanTime=6000 \
    -XX:+CMSParallelRemarkEnabled \
    -XX:+ParallelRefProcEnabled"

I need to remember to check the Munin JVM graphs in a few days
It might be placebo, but the site does feel snappier...

2019-04-15

Rework the dspace-statistics-api to use the vanilla Python requests library instead of Solr client
- Tag version 1.0.0 and deploy it on DSpace Test
Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to iostat 1 10, though I haven't had any Linode alerts in a few days
Abenet sent me a list of ILRI items that don't have CRPs added to them
- The spreadsheet only had Handles (no IDs), so I'm experimenting with using Python in OpenRefine to get the IDs
- I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:

import json
import re
import urllib
import urllib2

handle = re.findall('[0-9]+/[0-9]+', value)

url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
req = urllib2.Request(url)
req.add_header('User-agent', 'Alan Python bot')
res = urllib2.urlopen(req)
data = json.load(res)
item_id = data['id']

return item_id

Luckily none of the items already had CRPs, so I didn't have to worry about them getting removed
- It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs...
I ran a full Discovery indexing on CGSpace because I didn't do it after all the metadata updates last week:

$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    82m45.324s
user    7m33.446s
sys     2m13.463s

2019-04-16

Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something

2019-04-17

Reading an interesting blog post about Solr caching
Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (statistics and statistics-2018) and monitored the memory usage of Tomcat in VisualVM
4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
- Run 1:
  - Time: 3.11s user 0.44s system 0% cpu 13:45.07 total
  - Tomcat (not Solr) max JVM heap usage: 2.04 GiB
- Run 2:
  - Time: 3.23s user 0.43s system 0% cpu 13:46.10 total
  - Tomcat (not Solr) max JVM heap usage: 2.06 GiB
- Run 3:
  - Time: 3.23s user 0.42s system 0% cpu 13:14.70 total
  - Tomcat (not Solr) max JVM heap usage: 2.13 GiB
  - filterCache size: 482, cumulative_lookups: 7062712, cumulative_hits: 167903, cumulative_hitratio: 0.02
  - queryResultCache size: 2
4GB heap, CMS GC, 1024 filter cache, 512 query cache, with 28 million documents in two shards
- Run 1:
  - Time: 2.92s user 0.39s system 0% cpu 12:33.08 total
  - Tomcat (not Solr) max JVM heap usage: 2.16 GiB
- Run 2:
  - Time: 3.10s user 0.39s system 0% cpu 12:25.32 total
  - Tomcat (not Solr) max JVM heap usage: 2.07 GiB
- Run 3:
  - Time: 3.29s user 0.36s system 0% cpu 11:53.47 total
  - Tomcat (not Solr) max JVM heap usage: 2.08 GiB
  - filterCache size: 951, cumulative_lookups: 7062712, cumulative_hits: 254379, cumulative_hitratio: 0.04
4GB heap, CMS GC, 2048 filter cache, 512 query cache, with 28 million documents in two shards
- Run 1:
  - Time: 2.90s user 0.48s system 0% cpu 10:37.31 total
  - Tomcat max JVM heap usage: 1.96 GiB
  - filterCache size: 1901, cumulative_lookups: 2354237, cumulative_hits: 180111, cumulative_hitratio: 0.08
- Run 2:
  - Time: 2.97s user 0.39s system 0% cpu 10:40.06 total
  - Tomcat max JVM heap usage: 2.09 GiB
  - filterCache size: 1901, cumulative_lookups: 4708473, cumulative_hits: 360068, cumulative_hitratio: 0.08
- Run 3:
  - Time: 3.28s user 0.37s system 0% cpu 10:49.56 total
  - Tomcat max JVM heap usage: 2.05 GiB
  - filterCache size: 1901, cumulative_lookups: 7062712, cumulative_hits: 540020, cumulative_hitratio: 0.08
4GB heap, CMS GC, 4096 filter cache, 512 query cache, with 28 million documents in two shards
- Run 1:
  - Time: 2.88s user 0.35s system 0% cpu 8:29.55 total
  - Tomcat max JVM heap usage: 2.15 GiB
  - filterCache size: 3770, cumulative_lookups: 2354237, cumulative_hits: 414512, cumulative_hitratio: 0.18
- Run 2:
  - Time: 3.01s user 0.38s system 0% cpu 9:15.65 total
  - Tomcat max JVM heap usage: 2.17 GiB
  - filterCache size: 3945, cumulative_lookups: 4708473, cumulative_hits: 829093, cumulative_hitratio: 0.18
- Run 3:
  - Time: 3.01s user 0.40s system 0% cpu 9:01.31 total
  - Tomcat max JVM heap usage: 2.07 GiB
  - filterCache size: 3770, cumulative_lookups: 7062712, cumulative_hits: 1243632, cumulative_hitratio: 0.18
The biggest takeaway I have is that this workload benefits from a larger filterCache (for Solr fq parameter), but barely uses the queryResultCache (for Solr q parameter) at all
- The number of hits goes up and the time taken decreases when we increase the filterCache, and total JVM heap memory doesn't seem to increase much at all
- I guess the queryResultCache size is always 2 because I'm only doing two queries: type:0 and type:2 (downloads and views, respectively)
Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:

I ran one test with a filterCache of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it drastically increased the performance and memory usage of the dspace-statistics-api indexer
4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
- Run 1:
  - Time: 2.85s user 0.42s system 2% cpu 2:28.92 total
  - Tomcat max JVM heap usage: 1.90 GiB
  - filterCache size: 14851, cumulative_lookups: 2354237, cumulative_hits: 2331186, cumulative_hitratio: 0.99
- Run 2:
  - Time: 2.90s user 0.37s system 2% cpu 2:23.50 total
  - Tomcat max JVM heap usage: 1.27 GiB
  - filterCache size: 15834, cumulative_lookups: 4708476, cumulative_hits: 4664762, cumulative_hitratio: 0.99
- Run 3:
  - Time: 2.93s user 0.39s system 2% cpu 2:26.17 total
  - Tomcat max JVM heap usage: 1.05 GiB
  - filterCache size: 15248, cumulative_lookups: 7062715, cumulative_hits: 6998267, cumulative_hitratio: 0.99
The JVM garbage collection graph is MUCH flatter, and memory usage is much lower (not to mention a drop in GC-related CPU usage)!

36 KiB Raw Blame History Unescape Escape

2019-04-01

2019-04-02

2019-04-03

2019-04-05

2019-04-06

2019-04-07

2019-04-08

2019-04-09

2019-04-10

2019-04-11

2019-04-13

2019-04-14

2019-04-15

2019-04-16

2019-04-17

36 KiB

Raw Blame History