CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2019

2019-11-04

  • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics

    • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:

      # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
      4671942
      # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
      1277694
      
  • So 4.6 million from XMLUI and another 1.2 million from API requests

  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
    1183456 
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
    106781
    
  • The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
      1 PUT
      8 PROPFIND
    283 OPTIONS
    30102 POST
    46581 HEAD
    4594967 GET
    
  • Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
    365288
    
  • Their user agent is one I’ve never seen before:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
    
  • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
    6566 GET /bitstream
    351928 GET /handle
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
    214209
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
    86874
    
  • As far as I can tell, none of their requests are counted in the Solr statistics:

    $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
    
  • Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load

  • After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):

    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
    
  • On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in config/spiders/agents, perhaps by dropping a new list in from Atmire’s COUNTER-Robots project

    • First I checked for a user agent that is in COUNTER-Robots, but NOT in the current dspace/config/spiders/example list
    • Then I made some item and bitstream requests on DSpace Test using that user agent:

      $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
      $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
      $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
      
  • A bit later I checked Solr and found three requests from my IP with that user agent this month:

    $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
    </response>
    
  • Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:

    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
    
  • After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…

    • What’s strange is that the Solr spider agent configuration in dspace/config/modules/solr-statistics.cfg points to a file that doesn’t exist…

      spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
      
  • Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…

  • I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr

    • Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in dspace/config/spiders/example and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr
    • I restarted Tomcat and ran dspace stats-util -m and it did some stuff for awhile, but I still don’t see any items in Solr with isBot:true
    • According to dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java the patterns for user agents are loaded from any file in the config/spiders/agents directory
    • I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran dspace stats-util -m and still there were no new items marked as being bots in Solr, so I think there is still something wrong
    • Jesus, the code in ./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java says that stats-util -m marks spider requests by their IPs, not by their user agents… WTF:

      else if (line.hasOption('m'))
      {
      SolrLogger.markRobotsByIP();
      }
      
  • WTF again, there is actually a function called markRobotByUserAgent() that is never called anywhere!

    • It appears to be unimplemented…
    • I sent a message to the dspace-tech mailing list to ask if I should file an issue

2019-11-05

  • I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:

    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
    
  • After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:

    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    <result name="response" numFound="1" start="0">
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    <result name="response" numFound="0" start="0"/>
    
  • So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list

    • Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
  • I’m curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn’t:

    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
    
  • Then commit changes to Solr so we don’t have to wait:

    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
    <result name="response" numFound="0" start="0"/>
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    <result name="response" numFound="1" start="0">
    
  • So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…

2019-11-07

  • CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
    • They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty
    • I have now merged the changes in to the 5_x-prod branch (#432)
  • I am reconsidering the move of cg.identifier.dataurl to cg.hasMetadata in CG Core v2
    • The values of this field are mostly links to data sets on Dataverse and partner sites
    • I opened an issue on GitHub to ask Marie-Angelique for clarification
  • Looking into CGSpace statistics again

    • I searched for hits in Solr from the BUbiNG bot and found 63,000 in the statistics-2018 core:

      $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
      <result name="response" numFound="62944" start="0">
      
    • Similar for com.plumanalytics, Grammarly, and ltx71!

      $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
      *com.plumanalytics*' | xmllint --format - | grep numFound
      <result name="response" numFound="28256" start="0">
      $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
      <result name="response" numFound="6288" start="0">
      $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
      <result name="response" numFound="105663" start="0">
      
  • Deleting these seems to work, for example the 105,000 ltx71 records from 2018:

    $ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
    <result name="response" numFound="0" start="0"/>
    
  • I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores

    • For years 2010 until 2019 there are 1.6 million hits from these spider user agents
    • For 2019 alone there are 740,000, over half of which come from Unpaywall!
    • Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:

      $ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
      12&q=userAgent:*Unpaywall*' | xmllint --format - | less
      ...
      <lst name="facet_counts">
      <lst name="facet_queries"/>
      <lst name="facet_fields">
      <lst name="dateYearMonth">
      <int name="2019-10">198624</int>
      <int name="2019-05">88422</int>
      <int name="2019-06">79911</int>
      <int name="2019-09">67065</int>
      <int name="2019-07">39026</int>
      <int name="2019-08">36889</int>
      <int name="2019-04">36512</int>
      <int name="2019-11">760</int>
      </lst>
      </lst>
      
  • That answers Peter’s question about why the stats jumped in October…

2019-11-08

  • I saw a bunch of user agents that have the literal string User-Agent in their user agent HTTP header, for example:
    • User-Agent: Drupal (+http://drupal.org/)
    • User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
    • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
    • User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
    • User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;
    • User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
  • I filed an issue on the COUNTER-Robots project to see if they agree to add User-Agent: to the list of robot user agents

2019-11-09

  • Deploy the latest 5_x-prod branch on CGSpace (linode19)
    • This includes the updated CCAFS phase II project tags and the updated spider user agents
  • Run all system updates on CGSpace and reboot the server
    • After rebooting it seems that all Solr statistics cores came back up fine…
  • I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace

    • The script is called check-spider-hits.sh
    • After a bunch of tests and checks I ran it for each statistics shard like so:

      $ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
      istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
      
  • Open a pull request against COUNTER-Robots to remove unnecessary escaping of dashes

2019-11-12

  • Udana and Chandima emailed me to ask why one of their WLE items that is mapped from IWMI only shows up in the IWMI “department” on the Altmetric dashboard
  • Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding dc.rights, fixing DOIs, adding ISSNs, etc) and noticed one issue with an item that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)
    • I tweeted the Handle to see if the score would get linked once Altmetric noticed it

2019-11-13

  • The item with a low Altmetric score for its Handle that I tweeted yesterday still hasn’t linked with the DOI’s score
    • I tweeted it again with the Handle and the DOI
  • Testing modifying some of the COUNTER-Robots patterns to use [0-9] instead of \d digit character type, as Solr’s regex search can’t use those

    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
    $ http "http://localhost:8081/solr/statistics/update?commit=true"
    $ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
    <result name="response" numFound="1" start="0">
    $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
    <result name="response" numFound="1" start="0">
    
  • Nice, so searching with regex in Solr with // syntax works for those digits!

  • I realized that it’s easier to search Solr from curl via POST using this syntax:

    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
    
  • If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests

    • You can disable this using the -g option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:

      $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
      
  • I updated the check-spider-hits.sh script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling

2019-11-14

  • IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
  • I will merge them with our existing list and then resolve their names using my resolve-orcids.py script:

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
    $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
    # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    
  • I created a pull request and merged them into the 5_x-prod branch

    • I will deploy them to CGSpace in the next few days
  • Greatly improve my check-spider-hits.sh script to handle regular expressions in the spider agents patterns file

    • This allows me to detect and purge many more hits from the Solr statistics core
    • I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores

2019-11-15

  • Run the new version of check-spider-hits.sh on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong
  • But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like:
    • MetaURI[\+\s]API\/[0-9]\.[0-9]
    • FDM(\s|\+)[0-9]
    • Goldfire(\s|\+)Server
    • ^Mozilla\/4\.0\+\(compatible;\)$
    • ^Mozilla\/4\.0\+\(compatible;\+ICS\)$
    • ^Mozilla\/4\.5\+\[en]\+\(Win98;\+I\)$
  • Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
  • Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is
    • I tried to do URL encoding of the +, double escaping, etc… but nothing worked
    • I’m going to ignore regular expressions that have pluses for now
  • I think I might also have to ignore patterns that have percent signs, like ^\%?default\%?$
  • After I added the ignores and did some more testing I finally ran the check-spider-hits.sh on all CGSpace Solr statistics cores and these are the number of hits purged from each core:
    • statistics-2010: 113
    • statistics-2011: 7235
    • statistics-2012: 0
    • statistics-2013: 0
    • statistics-2014: 316
    • statistics-2015: 16809
    • statistics-2016: 41732
    • statistics-2017: 39207
    • statistics-2018: 295546
    • statistics: 1043373
  • That’s 1.4 million hits in addition to the 2 million I purged earlier this week…
  • For posterity, the major contributors to the hits on the statistics core were:
    • Purging 812429 hits from curl\/ in statistics
    • Purging 48206 hits from facebookexternalhit\/ in statistics
    • Purging 72004 hits from PHP\/ in statistics
    • Purging 76072 hits from Yeti\/[0-9] in statistics
  • Most of the curl hits were from CIAT in mid-2019, where they were using GuzzleHttp from PHP, which uses something like this for its user agent:

    Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
    
  • Run system updates on DSpace Test and reboot the server

2019-11-17

  • Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet
  • I finally decided to revert cg.hasMetadata back to cg.identifier.dataurl in my CG Core v2 branch (see #10)
  • Regarding the WLE item that has a much lower score than its DOI…
    • I tweeted the item twice last week and the score never got linked
    • Then I noticed that I had already made a note about the same issue in 2019-04, when I also tweeted it several times…
    • I will ask Altmetric support for help with that
  • Finally deploy 5_x-cgcorev2 branch on DSpace Test

2019-11-18

  • I sent a mail to the CGSpace partners in Addis about the CG Core v2 changes on DSpace Test
  • Then I filed an issue on the CG Core GitHub to let the metadata people know about our progress
  • It seems like I will do a session about CG Core v2 implementation and limitations in DSpace for the data workshop in December in Nairobi (?)

2019-11-19

  • Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
    • I had previously sent them an export in 2019-04
  • Atmire merged my pull request regarding unnecessary escaping of dashes in regular expressions, as well as my suggestion of adding “User-Agent” to the list of patterns
  • I made another pull request to fix invalid escaping of one of their new patterns
  • I ran my check-spider-hits.sh script again with these new patterns and found a bunch more statistics requests that match, for example:
    • Found 39560 hits from ^Buck\/[0-9] in statistics
    • Found 5471 hits from ^User-Agent in statistics
    • Found 2994 hits from ^Buck\/[0-9] in statistics-2018
    • Found 14076 hits from ^User-Agent in statistics-2018
    • Found 16310 hits from ^User-Agent in statistics-2017
    • Found 4429 hits from ^User-Agent in statistics-2016
  • Buck is one I’ve never heard of before, its user agent is:

    Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
    
  • All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week