November, 2019
2019-11-04
Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781
The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n 1 PUT 8 PROPFIND 283 OPTIONS 30102 POST 46581 HEAD 4594967 GET
Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)' 365288
Their user agent is one I’ve never seen before:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
Most of them seem to be to community or collection discover and browse results pages like
/handle/10568/103/discover
:
zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -o -E “GET /(bitstream|discover|handle)” | sort | uniq -c
6566 GET /bitstream 351928 GET /handle
zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c discover
214209
zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c browse
86874
- As far as I can tell, none of their requests are counted in the Solr statistics:
$ http –print b ‘http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
- Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load
- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):
$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:“Amazonbot/0.1”
- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
- First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
- Then I made some item and bitstream requests on DSpace Test using that user agent:
$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” $ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“iskanie”
- A bit later I checked Solr and found three requests from my IP with that user agent this month:
$ http –print b ‘http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
<?xml version=“1.0” encoding=“UTF-8”?>
- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” $ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“celestial”
- After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list...
- What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist...
spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
- I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
- Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
- I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true`
- According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory
- I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong
- Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF:
else if (line.hasOption(’m’)) { SolrLogger.markRobotsByIP(); }
- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere!
- It appears to be unimplemented...
- I sent a message to the dspace-tech mailing list to ask if I should file an issue
## 2019-11-05
- I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu1” $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu2”
- After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2":
$ http –print b ‘http://localhost:8081/solr/statistics/update?commit=true'
$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound
- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
- Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
- I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't:
$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“www.gnip.com” $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“www.gnyp.com”
- Then commit changes to Solr so we don't have to wait:
$ http –print b ‘http://localhost:8081/solr/statistics/update?commit=true'
$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound
- So the blocking seems to be working because "www\.gnip\.com" is one of the new patterns added to the spiders file...
## 2019-11-07
- CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
- They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty
- I have now merged the changes in to the `5_x-prod` branch ([#432](https://github.com/ilri/DSpace/pull/432))
- I am reconsidering the move of `cg.identifier.dataurl` to `cg.hasMetadata` in CG Core v2
- The values of this field are mostly links to data sets on Dataverse and partner sites
- I opened an [issue on GitHub](https://github.com/AgriculturalSemantics/cg-core/issues/10) to ask Marie-Angelique for clarification
- Looking into CGSpace statistics again
- I searched for hits in Solr from the BUbiNG bot and found 63,000 in the `statistics-2018` core:
$ http –print b ‘http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint –format - | grep numFound
- Similar for com.plumanalytics, Grammarly, and ltx71!
$ http –print b ‘http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
com.plumanalytics’ | xmllint –format - | grep numFound
- Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
$ http –print b ‘http://localhost:8081/solr/statistics-2018/update?stream.body=
- I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
- For years 2010 until 2019 there are 1.6 million hits from these spider user agents
- For 2019 alone there are 740,000, over half of which come from Unpaywall!
- Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:
$ curl -s ‘http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
12&q=userAgent:Unpaywall’ | xmllint –format - | less
…
- That answers Peter's question about why the stats jumped in October...
## 2019-11-08
- I saw a bunch of user agents that have the literal string `User-Agent` in their user agent HTTP header, for example:
- `User-Agent: Drupal (+http://drupal.org/)`
- `User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31`
- `User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;`
- `User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)`
- `User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;`
- `User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;`
- I filed [an issue](https://github.com/atmire/COUNTER-Robots/issues/27) on the COUNTER-Robots project to see if they agree to add `User-Agent:` to the list of robot user agents
## 2019-11-09
- Deploy the latest `5_x-prod` branch on CGSpace (linode19)
- This includes the updated CCAFS phase II project tags and the updated spider user agents
- Run all system updates on CGSpace and reboot the server
- After rebooting it seems that all Solr statistics cores came back up fine...
- I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace
- The script is called `check-spider-hits.sh`
- After a bunch of tests and checks I ran it for each statistics shard like so:
$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
- Open a [pull request](https://github.com/atmire/COUNTER-Robots/pull/28) against COUNTER-Robots to remove unnecessary escaping of dashes
## 2019-11-12
- Udana and Chandima emailed me to ask why [one of their WLE items](https://hdl.handle.net/10568/81236) that is mapped from IWMI only shows up in the IWMI "department" on the Altmetric dashboard
- A [search in the IWMI department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management)
- A [search in the WLE department shows no results](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management)
- I emailed Altmetric support to ask for help
- Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding `dc.rights`, fixing DOIs, adding ISSNs, etc) and noticed one issue with [an item](https://hdl.handle.net/10568/97087) that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)
- I tweeted the Handle to see if the score would get linked once Altmetric noticed it
## 2019-11-13
- The [item with a low Altmetric score for its Handle](https://hdl.handle.net/10568/97087) that I tweeted yesterday still hasn't linked with the DOI's score
- I tweeted it again with the Handle and the DOI
- Testing modifying some of the COUNTER-Robots patterns to use `[0-9]` instead of `\d` digit character type, as Solr's regex search can't use those
$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“Scrapoo/1”
$ http “http://localhost:8081/solr/statistics/update?commit=true"
$ http “http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint –format - | grep numFound
- Nice, so searching with regex in Solr with `//` syntax works for those digits!
- I realized that it's easier to search Solr from curl via POST using this syntax:
$ curl -s “http://localhost:8081/solr/statistics/select" -d “q=userAgent:Scrapoo&rows=0”)
- If the parameters include something like "[0-9]" then curl interprets it as a range and will make ten requests
- You can disable this using the `-g` option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:
$ curl -s ‘http://localhost:8081/solr/statistics/select' -d ‘q=userAgent:/Postgenomic(\s|+)v2/&rows=2’
- I updated the `check-spider-hits.sh` script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
## 2019-11-14
- IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
- I will merge them with our existing list and then resolve their names using my `resolve-orcids.py` script:
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE ‘[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}’ | sort | uniq > /tmp/2019-11-14-combined-orcids.txt $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
- I created a [pull request](https://github.com/ilri/DSpace/pull/437) and merged them into the `5_x-prod` branch
- I will deploy them to CGSpace in the next few days
- Greatly improve my `check-spider-hits.sh` script to handle regular expressions in the spider agents patterns file
- This allows me to detect and purge many more hits from the Solr statistics core
- I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores
## 2019-11-15
- Run the new version of `check-spider-hits.sh` on CGSpace's Solr statistics cores one by one, starting from the oldest just in case something goes wrong
- But then I noticed that some (all?) of the hits weren't actually getting purged, all of which were using regular expressions like:
- `MetaURI[\+\s]API\/[0-9]\.[0-9]`
- `FDM(\s|\+)[0-9]`
- `Goldfire(\s|\+)Server`
- `^Mozilla\/4\.0\+\(compatible;\)$`
- `^Mozilla\/4\.0\+\(compatible;\+ICS\)$`
- `^Mozilla\/4\.5\+\[en]\+\(Win98;\+I\)$`
- Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
- Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is
- I tried to do URL encoding of the +, double escaping, etc... but nothing worked
- I'm going to ignore regular expressions that have pluses for now
- I think I might also have to ignore patterns that have percent signs, like `^\%?default\%?$`
- After I added the ignores and did some more testing I finally ran the `check-spider-hits.sh` on all CGSpace Solr statistics cores and these are the number of hits purged from each core:
- statistics-2010: 113
- statistics-2011: 7235
- statistics-2012: 0
- statistics-2013: 0
- statistics-2014: 316
- statistics-2015: 16809
- statistics-2016: 41732
- statistics-2017: 39207
- statistics-2018: 295546
- statistics: 1043373
- That's 1.4 million hits in addition to the 2 million I purged earlier this week...
- For posterity, the major contributors to the hits on the statistics core were:
- Purging 812429 hits from curl\/ in statistics
- Purging 48206 hits from facebookexternalhit\/ in statistics
- Purging 72004 hits from PHP\/ in statistics
- Purging 76072 hits from Yeti\/[0-9] in statistics
- Most of the curl hits were from CIAT in mid-2019, where they were using [GuzzleHttp](https://guzzle3.readthedocs.io/http-client/client.html) from PHP, which uses something like this for its user agent:
Guzzle/
- Run system updates on DSpace Test and reboot the server