--- title: "April, 2019" date: 2019-04-01T09:00:43+03:00 author: "Alan Orth" tags: ["Notes"] --- ## 2019-04-01 - Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc - They asked if we had plans to enable RDF support in CGSpace - There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today - I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200! ``` # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5 4432 200 ``` - In the last two weeks there have been 47,000 downloads of this *same exact PDF* by these three IP addresses - Apply country and region corrections and deletions on DSpace Test and CGSpace: ``` $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d ``` ## 2019-04-02 - CTA says the Amazon IPs are AWS gateways for real user traffic - I was trying to add Felix Shaw's account back to the Administrators group on DSpace Test, but I couldn't find his name in the user search of the groups page - If I searched for "Felix" or "Shaw" I saw other matches, included one for his personal email address! - I ended up finding him via searching for his email address ## 2019-04-03 - Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary - First I need to extract the ones that are unique from their list compared to our existing one: ``` $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt ``` - We currently have 1177 unique ORCID identifiers, and this brings our total to 1237! - Next I will resolve all their names using my `resolve-orcids.py` script: ``` $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d ``` - After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim - One user's name has changed so I will update those using my `fix-metadata-values.py` script: ``` $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d ``` - I created a pull request and merged the changes to the 5_x-prod branch ([#417](https://github.com/ilri/DSpace/pull/417)) - A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it's still going: ``` 2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018 ``` - Interestingly, there are 5666 occurences, and they are mostly for the 2018 core: ``` $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c 1 3 http://localhost:8081/solr//statistics-2017 5662 http://localhost:8081/solr//statistics-2018 ``` - I will have to keep an eye on it because nothing should be updating 2018 stats in 2019... ## 2019-04-05 - Uptime Robot reported that CGSpace (linode18) went down tonight - I see there are lots of PostgreSQL connections: ``` $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 5 dspaceApi 10 dspaceCli 250 dspaceWeb ``` - I still see those weird messages about updating the statistics-2018 Solr core: ``` 2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018 ``` - Looking at `iostat 1 10` I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs: ![CPU usage week](/cgspace-notes/2019/04/cpu-week.png) - The other thing visible there is that the past few days the load has spiked to 500% and I don't think it's a coincidence that the Solr updating thing is happening... - I ran all system updates and rebooted the server - The load was lower on the server after reboot, but Solr didn't come back up properly according to the Solr Admin UI: ``` statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher ``` - I restarted it again and all the Solr cores came up properly... ## 2019-04-06 - Udana asked why item [10568/91278](https://cgspace.cgiar.org/handle/10568/91278) didn't have an Altmetric badge on CGSpace, but on the [WLE website](https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity) it does - I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all - I tweeted the item and I assume this will link the Handle with the DOI in the system - Twenty minutes later I see the same Altmetric score (9) on CGSpace - Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time: ``` # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 222 18.195.78.144 245 207.46.13.58 303 207.46.13.194 328 66.249.79.33 564 207.46.13.210 566 66.249.79.62 575 40.77.167.66 1803 66.249.79.59 2834 2a01:4f8:140:3192::2 9623 45.5.184.72 # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 31 66.249.79.62 41 207.46.13.210 42 40.77.167.66 54 42.113.50.219 132 66.249.79.59 785 2001:41d0:d:1990:: 1164 45.5.184.72 2014 50.116.102.77 4267 45.5.186.2 4893 205.186.128.185 ``` - `45.5.184.72` is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection: ``` GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014 ``` - Their user agent is the one I added to the badbots list in nginx last week: "GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1" - They made 22,000 requests to Discover on this collection today alone (and it's only 11AM): ``` # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 22077 /handle/10568/72970/discover ``` - Yesterday they made 43,000 requests and we actually blocked most of them: ``` # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 43631 /handle/10568/72970/discover # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 142 200 43489 503 ``` - I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover - Maria from Bioversity recommended that we use the phrase "AGROVOC subject" instead of "Subject" in Listings and Reports - I made a pull request to update this and merged it to the `5_x-prod` branch ([#418](https://github.com/ilri/DSpace/pull/418)) ## 2019-04-07 - Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day* - Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams - I am trying to see if these are registered as downloads in Solr or not - I see 96,925 downloads from their AWS gateway IPs in 2019-03: ``` $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true' { "response": { "docs": [], "numFound": 96925, "start": 0 }, "responseHeader": { "QTime": 1, "params": { "fq": [ "statistics_type:view", "bundleName:ORIGINAL", "dateYearMonth:2019-03" ], "indent": "true", "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)", "rows": "0", "wt": "json" }, "status": 0 } } ``` - Strangely I don't see many hits in 2019-04: ``` $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true' { "response": { "docs": [], "numFound": 38, "start": 0 }, "responseHeader": { "QTime": 1, "params": { "fq": [ "statistics_type:view", "bundleName:ORIGINAL", "dateYearMonth:2019-04" ], "indent": "true", "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)", "rows": "0", "wt": "json" }, "status": 0 } } ``` - Making some tests on GET vs HEAD requests on the [CTA Spore 192 item](https://dspacetest.cgiar.org/handle/10568/100289) on DSpace Test: ``` $ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: dspacetest.cgiar.org User-Agent: HTTPie/1.0.2 HTTP/1.1 200 OK Connection: keep-alive Content-Language: en-US Content-Length: 2069158 Content-Type: application/pdf;charset=ISO-8859-1 Date: Sun, 07 Apr 2019 08:38:34 GMT Expires: Sun, 07 Apr 2019 09:38:34 GMT Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT Server: nginx Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly Vary: User-Agent X-Cocoon-Version: 2.2.0 X-Content-Type-Options: nosniff X-Frame-Options: SAMEORIGIN X-Robots-Tag: none X-XSS-Protection: 1; mode=block $ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: dspacetest.cgiar.org User-Agent: HTTPie/1.0.2 HTTP/1.1 200 OK Connection: keep-alive Content-Language: en-US Content-Length: 2069158 Content-Type: application/pdf;charset=ISO-8859-1 Date: Sun, 07 Apr 2019 08:39:01 GMT Expires: Sun, 07 Apr 2019 09:39:01 GMT Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT Server: nginx Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly Vary: User-Agent X-Cocoon-Version: 2.2.0 X-Content-Type-Options: nosniff X-Frame-Options: SAMEORIGIN X-Robots-Tag: none X-XSS-Protection: 1; mode=block ``` - And from the server side, the nginx logs show: ``` 78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2" 78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2" ``` - So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr - After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log: ``` 2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818 2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818 ``` - So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned - Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log: ``` 2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} ``` - Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are `statistics_type:view`... very weird - I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18) - I will try to re-deploy the `5_x-dev` branch and test again - According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics) the default commit time is after 15 minutes or 10,000 documents (see `solrconfig.xml`) - I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they *do* register as downloads (even though they are internal): ``` $ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true' { "response": { "docs": [], "numFound": 909, "start": 0 }, "responseHeader": { "QTime": 0, "params": { "fq": [ "statistics_type:view", "isInternal:true" ], "indent": "true", "q": "type:0 AND time:2019-04-07*", "rows": "0", "wt": "json" }, "status": 0 } } ``` - I confirmed the same on CGSpace itself after making one HEAD request - So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week - I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace - Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded - This leads me to believe there is something specifically wrong with DSpace Test (linode19) - The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC - Holy shit, all this is actually because of the GeoIP1 deprecation and a missing `GeoLiteCity.dat` - For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error! - See: [DS-3986](https://jira.duraspace.org/browse/DS-3986) - See: [DS-4020](https://jira.duraspace.org/browse/DS-4020) - See: [DS-3832](https://jira.duraspace.org/browse/DS-3832) - DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been *removed* from MaxMind's server as of 2018-04-01 - Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this - UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check `iostat 1 10` and I saw that CPU steal is around 10–30 percent right now... - The load average is super high right now, as I've noticed the last few times UptimeRobot said that CGSpace went down: ``` $ cat /proc/loadavg 10.70 9.17 8.85 18/633 4198 ``` - According to the server logs there is actually not much going on right now: ``` # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 118 18.195.78.144 128 207.46.13.219 129 167.114.64.100 159 207.46.13.129 179 207.46.13.33 188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142 195 66.249.79.59 363 40.77.167.21 740 2a01:4f8:140:3192::2 4823 45.5.184.72 # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 3 66.249.79.62 3 66.249.83.196 4 207.46.13.86 5 82.145.222.150 6 2a01:4f9:2b:1263::2 6 41.204.190.40 7 35.174.176.49 10 40.77.167.21 11 194.246.119.6 11 66.249.79.59 ``` - `45.5.184.72` is CIAT, who I already blocked and am waiting to hear from - `2a01:4f8:140:3192::2` is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve - `2408:8214:7a00:868f:7c1e:e0f3:20c6:c142` is some stupid Chinese bot making malicious POST requests - There are free database connections in the pool: ``` $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 5 dspaceApi 7 dspaceCli 23 dspaceWeb ``` - It seems that the issue with CGSpace being "down" is actually because of CPU steal again!!! - I opened a ticket with support and asked them to migrate the VM to a less busy host