November, 2019
2019-11-04
Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781
The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n 1 PUT 8 PROPFIND 283 OPTIONS 30102 POST 46581 HEAD 4594967 GET
Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)' 365288
Their user agent is one I’ve never seen before:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
Most of them seem to be to community or collection discover and browse results pages like
/handle/10568/103/discover
:# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c 6566 GET /bitstream 351928 GET /handle # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover 214209 # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse 86874
As far as I can tell, none of their requests are counted in the Solr statistics:
$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load
After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in
config/spiders/agents
, perhaps by dropping a new list in from Atmire’s COUNTER-Robots project- First I checked for a user agent that is in COUNTER-Robots, but NOT in the current
dspace/config/spiders/example
list Then I made some item and bitstream requests on DSpace Test using that user agent:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
- First I checked for a user agent that is in COUNTER-Robots, but NOT in the current
A bit later I checked Solr and found three requests from my IP with that user agent this month:
$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0' <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result> </response>
Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…
What’s strange is that the Solr spider agent configuration in
dspace/config/modules/solr-statistics.cfg
points to a file that doesn’t exist…spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…
I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr
- Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in
dspace/config/spiders/example
and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr - I restarted Tomcat and ran
dspace stats-util -m
and it did some stuff for awhile, but I still don’t see any items in Solr withisBot:true
- According to
dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java
the patterns for user agents are loaded from any file in theconfig/spiders/agents
directory - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran
dspace stats-util -m
and still there were no new items marked as being bots in Solr, so I think there is still something wrong Jesus, the code in
./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java
says thatstats-util -m
marks spider requests by their IPs, not by their user agents… WTF:else if (line.hasOption('m')) { SolrLogger.markRobotsByIP(); }
- Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in
WTF again, there is actually a function called
markRobotByUserAgent()
that is never called anywhere!- It appears to be unimplemented…
- I sent a message to the dspace-tech mailing list to ask if I should file an issue
2019-11-05
I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:
$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="1" start="0"> $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="0" start="0"/>
So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list
- Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
I’m curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn’t:
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
Then commit changes to Solr so we don’t have to wait:
$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="0" start="0"/> $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="1" start="0">
So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…
2019-11-07
- CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
- They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty
- I have now merged the changes in to the
5_x-prod
branch (#432)
- I am reconsidering the move of
cg.identifier.dataurl
tocg.hasMetadata
in CG Core v2- The values of this field are mostly links to data sets on Dataverse and partner sites
- I opened an issue on GitHub to ask Marie-Angelique for clarification
Looking into CGSpace statistics again
I searched for hits in Solr from the BUbiNG bot and found 63,000 in the
statistics-2018
core:$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound <result name="response" numFound="62944" start="0">
Similar for com.plumanalytics, Grammarly, and ltx71!
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent: *com.plumanalytics*' | xmllint --format - | grep numFound <result name="response" numFound="28256" start="0"> $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound <result name="response" numFound="6288" start="0"> $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound <result name="response" numFound="105663" start="0">
Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true' $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound <result name="response" numFound="0" start="0"/>
I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
- For years 2010 until 2019 there are 1.6 million hits from these spider user agents
- For 2019 alone there are 740,000, over half of which come from Unpaywall!
Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:
$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit= 12&q=userAgent:*Unpaywall*' | xmllint --format - | less ... <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2019-10">198624</int> <int name="2019-05">88422</int> <int name="2019-06">79911</int> <int name="2019-09">67065</int> <int name="2019-07">39026</int> <int name="2019-08">36889</int> <int name="2019-04">36512</int> <int name="2019-11">760</int> </lst> </lst>
That answers Peter’s question about why the stats jumped in October…
2019-11-08
- I saw a bunch of user agents that have the literal string
User-Agent
in their user agent HTTP header, for example:User-Agent: Drupal (+http://drupal.org/)
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;
User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;
- I filed an issue on the COUNTER-Robots project to see if they agree to add
User-Agent:
to the list of robot user agents
2019-11-09
- Deploy the latest
5_x-prod
branch on CGSpace (linode19)- This includes the updated CCAFS phase II project tags and the updated spider user agents
- Run all system updates on CGSpace and reboot the server
- After rebooting it seems that all Solr statistics cores came back up fine…
I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace
- The script is called
check-spider-hits.sh
After a bunch of tests and checks I ran it for each statistics shard like so:
$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
- The script is called
Open a pull request against COUNTER-Robots to remove unnecessary escaping of dashes
2019-11-12
- Udana and Chandima emailed me to ask why one of their WLE items that is mapped from IWMI only shows up in the IWMI “department” on the Altmetric dashboard
- A search in the IWMI department shows the item
- A search in the WLE department shows no results
- I emailed Altmetric support to ask for help
- Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding
dc.rights
, fixing DOIs, adding ISSNs, etc) and noticed one issue with an item that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)- I tweeted the Handle to see if the score would get linked once Altmetric noticed it
2019-11-13
- The item with a low Altmetric score for its Handle that I tweeted yesterday still hasn’t linked with the DOI’s score
- I tweeted it again with the Handle and the DOI
Testing modifying some of the COUNTER-Robots patterns to use
[0-9]
instead of\d
digit character type, as Solr’s regex search can’t use those$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1" $ http "http://localhost:8081/solr/statistics/update?commit=true" $ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound <result name="response" numFound="1" start="0"> $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound <result name="response" numFound="1" start="0">
Nice, so searching with regex in Solr with
//
syntax works for those digits!I realized that it’s easier to search Solr from curl via POST using this syntax:
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
You can disable this using the
-g
option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
I updated the
check-spider-hits.sh
script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
2019-11-14
- IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
I will merge them with our existing list and then resolve their names using my
resolve-orcids.py
script:$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents) $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
I created a pull request and merged them into the
5_x-prod
branch- I will deploy them to CGSpace in the next few days
Greatly improve my
check-spider-hits.sh
script to handle regular expressions in the spider agents patterns file- This allows me to detect and purge many more hits from the Solr statistics core
- I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores
2019-11-15
- Run the new version of
check-spider-hits.sh
on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong - But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like:
MetaURI[\+\s]API\/[0-9]\.[0-9]
FDM(\s|\+)[0-9]
Goldfire(\s|\+)Server
^Mozilla\/4\.0\+\(compatible;\)$
^Mozilla\/4\.0\+\(compatible;\+ICS\)$
^Mozilla\/4\.5\+\[en]\+\(Win98;\+I\)$
- Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
- Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is
- I tried to do URL encoding of the +, double escaping, etc… but nothing worked
- I’m going to ignore regular expressions that have pluses for now
- I think I might also have to ignore patterns that have percent signs, like
^\%?default\%?$
- After I added the ignores and did some more testing I finally ran the
check-spider-hits.sh
on all CGSpace Solr statistics cores and these are the number of hits purged from each core:- statistics-2010: 113
- statistics-2011: 7235
- statistics-2012: 0
- statistics-2013: 0
- statistics-2014: 316
- statistics-2015: 16809
- statistics-2016: 41732
- statistics-2017: 39207
- statistics-2018: 295546
- statistics: 1043373
- That’s 1.4 million hits in addition to the 2 million I purged earlier this week…
- For posterity, the major contributors to the hits on the statistics core were:
- Purging 812429 hits from curl\/ in statistics
- Purging 48206 hits from facebookexternalhit\/ in statistics
- Purging 72004 hits from PHP\/ in statistics
- Purging 76072 hits from Yeti\/[0-9] in statistics
Most of the curl hits were from CIAT in mid-2019, where they were using GuzzleHttp from PHP, which uses something like this for its user agent:
Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
Run system updates on DSpace Test and reboot the server
2019-11-17
- Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet
- I told her no, that the department is several years old, and the item was added in 2017
- Then I looked again at the dashboard for each department and I see the item in both departments now… shit.
- A search in the IWMI department shows the item
- A search in the WLE department shows the item
- I finally decided to revert
cg.hasMetadata
back tocg.identifier.dataurl
in my CG Core v2 branch (see #10) - Regarding the WLE item that has a much lower score than its DOI…
- I tweeted the item twice last week and the score never got linked
- Then I noticed that I had already made a note about the same issue in 2019-04, when I also tweeted it several times…
- I will ask Altmetric support for help with that
- Finally deploy
5_x-cgcorev2
branch on DSpace Test
2019-11-18
- I sent a mail to the CGSpace partners in Addis about the CG Core v2 changes on DSpace Test
- Then I filed an issue on the CG Core GitHub to let the metadata people know about our progress
- It seems like I will do a session about CG Core v2 implementation and limitations in DSpace for the data workshop in December in Nairobi (?)
2019-11-19
- Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
- I had previously sent them an export in 2019-04
- Atmire merged my pull request regarding unnecessary escaping of dashes in regular expressions, as well as my suggestion of adding “User-Agent” to the list of patterns
- I made another pull request to fix invalid escaping of one of their new patterns
- I ran my
check-spider-hits.sh
script again with these new patterns and found a bunch more statistics requests that match, for example:- Found 39560 hits from ^Buck\/[0-9] in statistics
- Found 5471 hits from ^User-Agent in statistics
- Found 2994 hits from ^Buck\/[0-9] in statistics-2018
- Found 14076 hits from ^User-Agent in statistics-2018
- Found 16310 hits from ^User-Agent in statistics-2017
- Found 4429 hits from ^User-Agent in statistics-2016
Buck is one I’ve never heard of before, its user agent is:
Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week