--- title: "November, 2019" date: 2019-11-04T12:20:30+02:00 author: "Alan Orth" categories: ["Notes"] --- ## 2019-11-04 - Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics - I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs: ``` # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 ``` - So 4.6 million from XMLUI and another 1.2 million from API requests - Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): ``` # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 ``` - The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log) ``` # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n 1 PUT 8 PROPFIND 283 OPTIONS 30102 POST 46581 HEAD 4594967 GET ``` - Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October: ``` # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)' 365288 ``` - Their user agent is one I've never seen before: ``` Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) ``` - Most of them seem to be to community or collection discover and browse results pages like `/handle/10568/103/discover`: # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c 6566 GET /bitstream 351928 GET /handle # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover 214209 # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse 86874 ``` - As far as I can tell, none of their requests are counted in the Solr statistics: ``` $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true' ``` - Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load - After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503): ``` $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1" ``` - On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list - Then I made some item and bitstream requests on DSpace Test using that user agent: ``` $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie" ``` - A bit later I checked Solr and found three requests from my IP with that user agent this month: ``` $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0' 01ip:73.178.9.24 AND userAgent:iskaniedateYearMonth:2019-110 ``` - Now I want to make similar requests with a user agent that is included in DSpace's current user agent list: ``` $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial" ``` - After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list... - What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist... ``` spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt ``` - Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file... - I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr - Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr - I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true` - According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong - Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF: ``` else if (line.hasOption('m')) { SolrLogger.markRobotsByIP(); } ``` - WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere! - It appears to be unimplemented... - I sent a message to the dspace-tech mailing list to ask if I should file an issue ## 2019-11-05 - I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test: ``` $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2" ``` - After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2": ``` $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound ``` - So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list - Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents - I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't: ``` $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com" ``` - Then commit changes to Solr so we don't have to wait: ``` $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound ``` - So the blocking seems to be working because "www\.gnip\.com" is one of the new patterns added to the spiders file... ## 2019-11-07 - CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate - They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty - I have now merged the changes in to the `5_x-prod` branch ([#432](https://github.com/ilri/DSpace/pull/432)) - I am reconsidering the move of `cg.identifier.dataurl` to `cg.hasMetadata` in CG Core v2 - The values of this field are mostly links to data sets on Dataverse and partner sites - I opened an [issue on GitHub](https://github.com/AgriculturalSemantics/cg-core/issues/10) to ask Marie-Angelique for clarification - Looking into CGSpace statistics again - I searched for hits in Solr from the BUbiNG bot and found 63,000 in the `statistics-2018` core: ``` $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound ``` - Similar for com.plumanalytics, Grammarly, and ltx71! ``` $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent: *com.plumanalytics*' | xmllint --format - | grep numFound $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound ``` - Deleting these seems to work, for example the 105,000 ltx71 records from 2018: ``` $ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=userAgent:*ltx71*type:0&commit=true' $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound ``` - I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores - For years 2010 until 2019 there are 1.6 million hits from these spider user agents - For 2019 alone there are 740,000, over half of which come from Unpaywall! - Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10: ``` $ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit= 12&q=userAgent:*Unpaywall*' | xmllint --format - | less ... 198624 88422 79911 67065 39026 36889 36512 760 ``` - That answers Peter's question about why the stats jumped in October... ## 2019-11-08 - I saw a bunch of user agents that have the literal string `User-Agent` in their user agent HTTP header, for example: - `User-Agent: Drupal (+http://drupal.org/)` - `User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31` - `User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;` - `User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)` - `User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;` - `User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;` - I filed [an issue](https://github.com/atmire/COUNTER-Robots/issues/27) on the COUNTER-Robots project to see if they agree to add `User-Agent:` to the list of robot user agents ## 2019-11-09 - Deploy the latest `5_x-prod` branch on CGSpace (linode19) - This includes the updated CCAFS phase II project tags and the updated spider user agents - Run all system updates on CGSpace and reboot the server - After rebooting it seems that all Solr statistics cores came back up fine... - I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace - The script is called `check-spider-hits.sh` - After a bunch of tests and checks I ran it for each statistics shard like so: ``` $ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done ``` - Open a [pull request](https://github.com/atmire/COUNTER-Robots/pull/28) against COUNTER-Robots to remove unnecessary escaping of dashes ## 2019-11-12 - Udana and Chandima emailed me to ask why [one of their WLE items](https://hdl.handle.net/10568/81236) that is mapped from IWMI only shows up in the IWMI "department" on the Altmetric dashboard - A [search in the IWMI department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management) - A [search in the WLE department shows no results](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management) - I emailed Altmetric support to ask for help - Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding `dc.rights`, fixing DOIs, adding ISSNs, etc) and noticed one issue with [an item](https://hdl.handle.net/10568/97087) that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score) - I tweeted the Handle to see if the score would get linked once Altmetric noticed it ## 2019-11-13 - The [item with a low Altmetric score for its Handle](https://hdl.handle.net/10568/97087) that I tweeted yesterday still hasn't linked with the DOI's score - I tweeted it again with the Handle and the DOI - Testing modifying some of the COUNTER-Robots patterns to use `[0-9]` instead of `\d` digit character type, as Solr's regex search can't use those ``` $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1" $ http "http://localhost:8081/solr/statistics/update?commit=true" $ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound ``` - Nice, so searching with regex in Solr with `//` syntax works for those digits! - I realized that it's easier to search Solr from curl via POST using this syntax: ``` $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0") ``` - If the parameters include something like "[0-9]" then curl interprets it as a range and will make ten requests - You can disable this using the `-g` option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search: ``` $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2' ``` - I updated the `check-spider-hits.sh` script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling ## 2019-11-14 - IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary - I will merge them with our existing list and then resolve their names using my `resolve-orcids.py` script: ``` $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents) $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml ``` - I created a [pull request](https://github.com/ilri/DSpace/pull/437) and merged them into the `5_x-prod` branch - I will deploy them to CGSpace in the next few days - Greatly improve my `check-spider-hits.sh` script to handle regular expressions in the spider agents patterns file - This allows me to detect and purge many more hits from the Solr statistics core - I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores