cgspace-notes/content/posts/2019-11.md

14 KiB

title date author categories
November, 2019 2019-11-04T12:20:30+02:00 Alan Orth
Notes

2019-11-04

  • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
    • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
1277694
  • So 4.6 million from XMLUI and another 1.2 million from API requests
  • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456 
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
106781
  • The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
      1 PUT
      8 PROPFIND
    283 OPTIONS
  30102 POST
  46581 HEAD
4594967 GET
  • Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
365288
  • Their user agent is one I've never seen before:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
  • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:

zcat --force /var/log/nginx/access.log..gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c

6566 GET /bitstream 351928 GET /handle

zcat --force /var/log/nginx/access.log..gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover

214209

zcat --force /var/log/nginx/access.log..gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse

86874


- As far as I can tell, none of their requests are counted in the Solr statistics:

$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'


- Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load
- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):

$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"


- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
  - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
  - Then I made some item and bitstream requests on DSpace Test using that user agent:

$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"


- A bit later I checked Solr and found three requests from my IP with that user agent this month:

$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'

01ip:73.178.9.24 AND userAgent:iskaniedateYearMonth:2019-110 ```
  • Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
  • After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list...
    • What's strange is that the Solr spider agent configuration in dspace/config/modules/solr-statistics.cfg points to a file that doesn't exist...
spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
  • Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
  • I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
    • Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in dspace/config/spiders/example and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
    • I restarted Tomcat and ran dspace stats-util -m and it did some stuff for awhile, but I still don't see any items in Solr with isBot:true
    • According to dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java the patterns for user agents are loaded from any file in the config/spiders/agents directory
    • I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran dspace stats-util -m and still there were no new items marked as being bots in Solr, so I think there is still something wrong
    • Jesus, the code in ./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java says that stats-util -m marks spider requests by their IPs, not by their user agents... WTF:
else if (line.hasOption('m'))
{
    SolrLogger.markRobotsByIP();
}
  • WTF again, there is actually a function called markRobotByUserAgent() that is never called anywhere!
    • It appears to be unimplemented...
    • I sent a message to the dspace-tech mailing list to ask if I should file an issue

2019-11-05

  • I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
  • After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2":
$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
  <result name="response" numFound="1" start="0">
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
  <result name="response" numFound="0" start="0"/>
  • So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
    • Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
  • I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
  • Then commit changes to Solr so we don't have to wait:
$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
  <result name="response" numFound="0" start="0"/>
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
  <result name="response" numFound="1" start="0">
  • So the blocking seems to be working because "www.gnip.com" is one of the new patterns added to the spiders file...

2019-11-07

  • CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
    • They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty
    • I have now merged the changes in to the 5_x-prod branch (#432)
  • I am reconsidering the move of cg.identifier.dataurl to cg.hasMetadata in CG Core v2
    • The values of this field are mostly links to data sets on Dataverse and partner sites
    • I opened an issue on GitHub to ask Marie-Angelique for clarification
  • Looking into CGSpace statistics again
    • I searched for hits in Solr from the BUbiNG bot and found 63,000 in the statistics-2018 core:
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
  <result name="response" numFound="62944" start="0">
  • Similar for com.plumanalytics, Grammarly, and ltx71!
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
*com.plumanalytics*' | xmllint --format - | grep numFound
  <result name="response" numFound="28256" start="0">
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
  <result name="response" numFound="6288" start="0">
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
  <result name="response" numFound="105663" start="0">
  • Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
  <result name="response" numFound="0" start="0"/>
  • I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
    • For years 2010 until 2019 there are 1.6 million hits from these spider user agents
    • For 2019 alone there are 740,000, over half of which come from Unpaywall!
    • Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:
$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
12&q=userAgent:*Unpaywall*' | xmllint --format - | less
...
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="dateYearMonth">
        <int name="2019-10">198624</int>
        <int name="2019-05">88422</int>
        <int name="2019-06">79911</int>
        <int name="2019-09">67065</int>
        <int name="2019-07">39026</int>
        <int name="2019-08">36889</int>
        <int name="2019-04">36512</int>
        <int name="2019-11">760</int>
      </lst>
    </lst>
  • That answers Peter's question about why the stats jumped in October...

2019-11-08

  • I saw a bunch of user agents that have the literal string User-Agent in their user agent HTTP header, for example:
    • User-Agent: Drupal (+http://drupal.org/)
    • User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
    • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
    • User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
    • User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;
    • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
  • I filed an issue on the COUNTER-Robots project to see if they agree to add User-Agent: to the list of robot user agents

2019-11-09

  • Deploy the latest 5_x-prod branch on CGSpace (linode19)
    • This includes the updated CCAFS phase II project tags and the updated spider user agents
  • Run all system updates on CGSpace and reboot the server
    • After rebooting it seems that all Solr statistics cores came back up fine...
  • I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace
    • The script is called check-spider-hits.sh
    • After a bunch of tests and checks I ran it for each statistics shard like so:
$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done