cgspace-notes/content/posts/2019-11.md

17 KiB

title date author categories
November, 2019 2019-11-04T12:20:30+02:00 Alan Orth
Notes

2019-11-04

  • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
    • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
1277694
  • So 4.6 million from XMLUI and another 1.2 million from API requests
  • Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456 
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
106781
  • The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
      1 PUT
      8 PROPFIND
    283 OPTIONS
  30102 POST
  46581 HEAD
4594967 GET
  • Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
365288
  • Their user agent is one I've never seen before:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
  • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:

zcat --force /var/log/nginx/access.log..gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c

6566 GET /bitstream 351928 GET /handle

zcat --force /var/log/nginx/access.log..gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover

214209

zcat --force /var/log/nginx/access.log..gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse

86874


- As far as I can tell, none of their requests are counted in the Solr statistics:

$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'


- Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load
- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):

$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"


- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
  - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
  - Then I made some item and bitstream requests on DSpace Test using that user agent:

$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"


- A bit later I checked Solr and found three requests from my IP with that user agent this month:

$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'

01ip:73.178.9.24 AND userAgent:iskaniedateYearMonth:2019-110 ```
  • Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
  • After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list...
    • What's strange is that the Solr spider agent configuration in dspace/config/modules/solr-statistics.cfg points to a file that doesn't exist...
spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
  • Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
  • I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
    • Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in dspace/config/spiders/example and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
    • I restarted Tomcat and ran dspace stats-util -m and it did some stuff for awhile, but I still don't see any items in Solr with isBot:true
    • According to dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java the patterns for user agents are loaded from any file in the config/spiders/agents directory
    • I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran dspace stats-util -m and still there were no new items marked as being bots in Solr, so I think there is still something wrong
    • Jesus, the code in ./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java says that stats-util -m marks spider requests by their IPs, not by their user agents... WTF:
else if (line.hasOption('m'))
{
    SolrLogger.markRobotsByIP();
}
  • WTF again, there is actually a function called markRobotByUserAgent() that is never called anywhere!
    • It appears to be unimplemented...
    • I sent a message to the dspace-tech mailing list to ask if I should file an issue

2019-11-05

  • I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
  • After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2":
$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
  <result name="response" numFound="1" start="0">
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
  <result name="response" numFound="0" start="0"/>
  • So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
    • Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
  • I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
  • Then commit changes to Solr so we don't have to wait:
$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
  <result name="response" numFound="0" start="0"/>
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
  <result name="response" numFound="1" start="0">
  • So the blocking seems to be working because "www.gnip.com" is one of the new patterns added to the spiders file...

2019-11-07

  • CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
    • They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty
    • I have now merged the changes in to the 5_x-prod branch (#432)
  • I am reconsidering the move of cg.identifier.dataurl to cg.hasMetadata in CG Core v2
    • The values of this field are mostly links to data sets on Dataverse and partner sites
    • I opened an issue on GitHub to ask Marie-Angelique for clarification
  • Looking into CGSpace statistics again
    • I searched for hits in Solr from the BUbiNG bot and found 63,000 in the statistics-2018 core:
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
  <result name="response" numFound="62944" start="0">
  • Similar for com.plumanalytics, Grammarly, and ltx71!
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
*com.plumanalytics*' | xmllint --format - | grep numFound
  <result name="response" numFound="28256" start="0">
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
  <result name="response" numFound="6288" start="0">
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
  <result name="response" numFound="105663" start="0">
  • Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
  <result name="response" numFound="0" start="0"/>
  • I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
    • For years 2010 until 2019 there are 1.6 million hits from these spider user agents
    • For 2019 alone there are 740,000, over half of which come from Unpaywall!
    • Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:
$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
12&q=userAgent:*Unpaywall*' | xmllint --format - | less
...
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="dateYearMonth">
        <int name="2019-10">198624</int>
        <int name="2019-05">88422</int>
        <int name="2019-06">79911</int>
        <int name="2019-09">67065</int>
        <int name="2019-07">39026</int>
        <int name="2019-08">36889</int>
        <int name="2019-04">36512</int>
        <int name="2019-11">760</int>
      </lst>
    </lst>
  • That answers Peter's question about why the stats jumped in October...

2019-11-08

  • I saw a bunch of user agents that have the literal string User-Agent in their user agent HTTP header, for example:
    • User-Agent: Drupal (+http://drupal.org/)
    • User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
    • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
    • User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
    • User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;
    • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
  • I filed an issue on the COUNTER-Robots project to see if they agree to add User-Agent: to the list of robot user agents

2019-11-09

  • Deploy the latest 5_x-prod branch on CGSpace (linode19)
    • This includes the updated CCAFS phase II project tags and the updated spider user agents
  • Run all system updates on CGSpace and reboot the server
    • After rebooting it seems that all Solr statistics cores came back up fine...
  • I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace
    • The script is called check-spider-hits.sh
    • After a bunch of tests and checks I ran it for each statistics shard like so:
$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
  • Open a pull request against COUNTER-Robots to remove unnecessary escaping of dashes

2019-11-12

  • Udana and Chandima emailed me to ask why one of their WLE items that is mapped from IWMI only shows up in the IWMI "department" on the Altmetric dashboard
  • Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding dc.rights, fixing DOIs, adding ISSNs, etc) and noticed one issue with an item that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)
    • I tweeted the Handle to see if the score would get linked once Altmetric noticed it

2019-11-13

  • The item with a low Altmetric score for its Handle that I tweeted yesterday still hasn't linked with the DOI's score
    • I tweeted it again with the Handle and the DOI
  • Testing modifying some of the COUNTER-Robots patterns to use [0-9] instead of \d digit character type, as Solr's regex search can't use those
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
$ http "http://localhost:8081/solr/statistics/update?commit=true"
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
  <result name="response" numFound="1" start="0">
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
  <result name="response" numFound="1" start="0">
  • Nice, so searching with regex in Solr with // syntax works for those digits!
  • I realized that it's easier to search Solr from curl via POST using this syntax:
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
  • If the parameters include something like "[0-9]" then curl interprets it as a range and will make ten requests
    • You can disable this using the -g option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:
$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
  • I updated the check-spider-hits.sh script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling