diff --git a/content/posts/2019-11.md b/content/posts/2019-11.md index be28c35e2..24733535b 100644 --- a/content/posts/2019-11.md +++ b/content/posts/2019-11.md @@ -56,6 +56,7 @@ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like - Most of them seem to be to community or collection discover and browse results pages like `/handle/10568/103/discover`: +``` # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c 6566 GET /bitstream 351928 GET /handle diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index dc3f9565d..953706870 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -34,7 +34,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t - + @@ -73,9 +73,9 @@ Let’s see how many of the REST API requests were for bitstreams (because t "@type": "BlogPosting", "headline": "November, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-11\/", - "wordCount": "2866", + "wordCount": "2795", "datePublished": "2019-11-04T12:20:30+02:00", - "dateModified": "2019-11-17T15:39:10+02:00", + "dateModified": "2019-11-19T17:17:28+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -193,384 +193,449 @@ Let’s see how many of the REST API requests were for bitstreams (because t
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
 
-
  • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:

  • - +
  • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:

    -

    zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -o -E “GET /(bitstream|discover|handle)” | sort | uniq -c

    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
    +6566 GET /bitstream
    +351928 GET /handle
    +# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
    +214209
    +# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
    +86874
    +
  • -

    6566 GET /bitstream - 351928 GET /handle

    +
  • As far as I can tell, none of their requests are counted in the Solr statistics:

    -

    zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c discover

    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
    +
  • -

    214209

    +
  • Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load

  • -

    zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c browse

    +
  • After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):

    -

    86874

    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
    +
  • -
    
    -- As far as I can tell, none of their requests are counted in the Solr statistics:
    -
    -
    - -

    $ http –print b ‘http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'

    - -
    
    -- Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load
    -- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):
    -
    -
    - -

    $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:“Amazonbot/0.1”

    - -
    
    -- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
    -  - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
    -  - Then I made some item and bitstream requests on DSpace Test using that user agent:
    -
    -
    - -

    $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” -$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” -$ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“iskanie”

    - -
    
    -- A bit later I checked Solr and found three requests from my IP with that user agent this month:
    -
    -
    - -

    $ http –print b ‘http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0' -<?xml version=“1.0” encoding=“UTF-8”?> - -01ip:73.178.9.24 AND userAgent:iskaniedateYearMonth:2019-110 -

    - -
    
    -- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
    -
    -
    - -

    $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” -$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” -$ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“celestial”

    - -
    
    -- After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list...
    -  - What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist...
    -
    -
    - -

    spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt

    - -
    
    -- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
    -- I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
    -  - Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
    -  - I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true`
    -  - According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory
    -  - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong
    -  - Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF:
    -
    -
    - -

    else if (line.hasOption(’m’)) -{ - SolrLogger.markRobotsByIP(); -}

    - -
    
    -- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere!
    -  - It appears to be unimplemented...
    -  - I sent a message to the dspace-tech mailing list to ask if I should file an issue
    -
    -## 2019-11-05
    -
    -- I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
    -
    -
    - -

    $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu1” -$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu2”

    - -
    
    -- After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2":
    -
    -
    - -

    $ http –print b ‘http://localhost:8081/solr/statistics/update?commit=true' -$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound - -$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound -

    - -
    
    -- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
    -  - Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
    -- I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't:
    -
    -
    - -

    $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“www.gnip.com” -$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“www.gnyp.com”

    - -
    
    -- Then commit changes to Solr so we don't have to wait:
    -
    -
    - -

    $ http –print b ‘http://localhost:8081/solr/statistics/update?commit=true' -$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound - -$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound -

    - -
    
    -- So the blocking seems to be working because "www\.gnip\.com" is one of the new patterns added to the spiders file...
    -
    -## 2019-11-07
    -
    -- CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
    -  - They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty
    -  - I have now merged the changes in to the `5_x-prod` branch ([#432](https://github.com/ilri/DSpace/pull/432))
    -- I am reconsidering the move of `cg.identifier.dataurl` to `cg.hasMetadata` in CG Core v2
    -  - The values of this field are mostly links to data sets on Dataverse and partner sites
    -  - I opened an [issue on GitHub](https://github.com/AgriculturalSemantics/cg-core/issues/10) to ask Marie-Angelique for clarification
    -- Looking into CGSpace statistics again
    -  - I searched for hits in Solr from the BUbiNG bot and found 63,000 in the `statistics-2018` core:
    -
    -
    - -

    $ http –print b ‘http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint –format - | grep numFound -

    - -
    
    -  - Similar for com.plumanalytics, Grammarly, and ltx71!
    -
    -
    - -

    $ http –print b ‘http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent: -com.plumanalytics’ | xmllint –format - | grep numFound - -$ http –print b ‘http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint –format - | grep numFound - -$ http –print b ‘http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint –format - | grep numFound -

    - -
    
    -- Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
    -
    -
    - -

    $ http –print b ‘http://localhost:8081/solr/statistics-2018/update?stream.body=userAgent:ltx71type:0&commit=true’ -$ http –print b ‘http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint –format - | grep numFound -

    - -
    
    -- I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
    -  - For years 2010 until 2019 there are 1.6 million hits from these spider user agents
    -  - For 2019 alone there are 740,000, over half of which come from Unpaywall!
    -  - Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:
    -
    -
    - -

    $ curl -s ‘http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit= -12&q=userAgent:Unpaywall’ | xmllint –format - | less -… - - - - - 198624 - 88422 - 79911 - 67065 - 39026 - 36889 - 36512 - 760 - -

    - -
    
    -- That answers Peter's question about why the stats jumped in October...
    -
    -## 2019-11-08
    -
    -- I saw a bunch of user agents that have the literal string `User-Agent` in their user agent HTTP header, for example:
    -  - `User-Agent: Drupal (+http://drupal.org/)`
    -  - `User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31`
    -  - `User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;`
    -  - `User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)`
    -  - `User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;`
    -  - `User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;`
    -- I filed [an issue](https://github.com/atmire/COUNTER-Robots/issues/27) on the COUNTER-Robots project to see if they agree to add `User-Agent:` to the list of robot user agents
    -
    -## 2019-11-09
    -
    -- Deploy the latest `5_x-prod` branch on CGSpace (linode19)
    -  - This includes the updated CCAFS phase II project tags and the updated spider user agents
    -- Run all system updates on CGSpace and reboot the server
    -  - After rebooting it seems that all Solr statistics cores came back up fine...
    -- I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace
    -  - The script is called `check-spider-hits.sh`
    -  - After a bunch of tests and checks I ran it for each statistics shard like so:
    -
    -
    - -

    $ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat -istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done

    - -
    
    -- Open a [pull request](https://github.com/atmire/COUNTER-Robots/pull/28) against COUNTER-Robots to remove unnecessary escaping of dashes
    -
    -## 2019-11-12
    -
    -- Udana and Chandima emailed me to ask why [one of their WLE items](https://hdl.handle.net/10568/81236) that is mapped from IWMI only shows up in the IWMI "department" on the Altmetric dashboard
    -  - A [search in the IWMI department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management)
    -  - A [search in the WLE department shows no results](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management)
    -  - I emailed Altmetric support to ask for help
    -- Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding `dc.rights`, fixing DOIs, adding ISSNs, etc) and noticed one issue with [an item](https://hdl.handle.net/10568/97087) that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)
    -  - I tweeted the Handle to see if the score would get linked once Altmetric noticed it
    -
    -## 2019-11-13
    -
    -- The [item with a low Altmetric score for its Handle](https://hdl.handle.net/10568/97087) that I tweeted yesterday still hasn't linked with the DOI's score
    -  - I tweeted it again with the Handle and the DOI
    -- Testing modifying some of the COUNTER-Robots patterns to use `[0-9]` instead of `\d` digit character type, as Solr's regex search can't use those
    -
    -
    - -

    $ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“Scrapoo/1” -$ http “http://localhost:8081/solr/statistics/update?commit=true" -$ http “http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint –format - | grep numFound - -$ http “http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo/[0-9]/" | xmllint –format - | grep numFound -

    - -
    
    -- Nice, so searching with regex in Solr with `//` syntax works for those digits!
    -- I realized that it's easier to search Solr from curl via POST using this syntax:
    -
    -
    - -

    $ curl -s “http://localhost:8081/solr/statistics/select" -d “q=userAgent:Scrapoo&rows=0”)

    - -
    
    -- If the parameters include something like "[0-9]" then curl interprets it as a range and will make ten requests
    -  - You can disable this using the `-g` option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:
    -
    -
    - -

    $ curl -s ‘http://localhost:8081/solr/statistics/select' -d ‘q=userAgent:/Postgenomic(\s|+)v2/&rows=2’

    - -
    
    -- I updated the `check-spider-hits.sh` script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
    -
    -## 2019-11-14
    -
    -- IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
    -- I will merge them with our existing list and then resolve their names using my `resolve-orcids.py` script:
    -
    -
    - -

    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE ‘[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}’ | sort | uniq > /tmp/2019-11-14-combined-orcids.txt -$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d

    - -

    sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)

    - -

    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml

    - -
    
    -- I created a [pull request](https://github.com/ilri/DSpace/pull/437) and merged them into the `5_x-prod` branch
    -  - I will deploy them to CGSpace in the next few days
    -- Greatly improve my `check-spider-hits.sh` script to handle regular expressions in the spider agents patterns file
    -  - This allows me to detect and purge many more hits from the Solr statistics core
    -  - I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores
    -
    -## 2019-11-15
    -
    -- Run the new version of `check-spider-hits.sh` on CGSpace's Solr statistics cores one by one, starting from the oldest just in case something goes wrong
    -- But then I noticed that some (all?) of the hits weren't actually getting purged, all of which were using regular expressions like:
    -  - `MetaURI[\+\s]API\/[0-9]\.[0-9]`
    -  - `FDM(\s|\+)[0-9]`
    -  - `Goldfire(\s|\+)Server`
    -  - `^Mozilla\/4\.0\+\(compatible;\)$`
    -  - `^Mozilla\/4\.0\+\(compatible;\+ICS\)$`
    -  - `^Mozilla\/4\.5\+\[en]\+\(Win98;\+I\)$`
    -- Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
    -- Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is
    -  - I tried to do URL encoding of the +, double escaping, etc... but nothing worked 
    -  - I'm going to ignore regular expressions that have pluses for now
    -- I think I might also have to ignore patterns that have percent signs, like `^\%?default\%?$`
    -- After I added the ignores and did some more testing I finally ran the `check-spider-hits.sh` on all CGSpace Solr statistics cores and these are the number of hits purged from each core:
    -  - statistics-2010: 113
    -  - statistics-2011: 7235
    -  - statistics-2012: 0
    -  - statistics-2013: 0
    -  - statistics-2014: 316
    -  - statistics-2015: 16809
    -  - statistics-2016: 41732
    -  - statistics-2017: 39207
    -  - statistics-2018: 295546
    -  - statistics: 1043373
    -- That's 1.4 million hits in addition to the 2 million I purged earlier this week...
    -- For posterity, the major contributors to the hits on the statistics core were:
    -  - Purging 812429 hits from curl\/ in statistics
    -  - Purging 48206 hits from facebookexternalhit\/ in statistics
    -  - Purging 72004 hits from PHP\/ in statistics
    -  - Purging 76072 hits from Yeti\/[0-9] in statistics
    -- Most of the curl hits were from CIAT in mid-2019, where they were using [GuzzleHttp](https://guzzle3.readthedocs.io/http-client/client.html) from PHP, which uses something like this for its user agent:
    -
    -
    - -

    Guzzle/ curl/ PHP/

    - -
    
    -- Run system updates on DSpace Test and reboot the server
    -
    -## 2019-11-17
    -
    -- Altmetric support responded about our dashboard question, asking if the second "department" (aka WLE's collection) was added recently and might have not been in the last harvesting yet
    -  - I told her no, that the department is several years old, and the item was added in 2017
    -  - Then I looked again at the dashboard for each department and I see the item in both departments now... shit.
    -  - A [search in the IWMI department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management)
    -  - A [search in the WLE department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management)
    -- I finally decided to revert `cg.hasMetadata` back to `cg.identifier.dataurl` in my CG Core v2 branch (see [#10](https://github.com/AgriculturalSemantics/cg-core/issues/10))
    -- Regarding the [WLE item](https://hdl.handle.net/10568/97087) that has a much lower score than its DOI...
    -  - I tweeted the item twice last week and the score never got linked
    -  - Then I noticed that I had already made a note about the same issue in 2019-04, when I also tweeted it several times...
    -  - I will ask Altmetric support for help with that
    -- Finally deploy `5_x-cgcorev2` branch on DSpace Test
    -
    -## 2019-11-18
    -
    -- I sent a mail to the CGSpace partners in Addis about the CG Core v2 changes on DSpace Test
    -- Then I filed an [issue on the CG Core GitHub](https://github.com/AgriculturalSemantics/cg-core/issues/11) to let the metadata people know about our progress
    -- It seems like I will do a session about CG Core v2 implementation and limitations in DSpace for the data workshop in December in Nairobi (?)
    -
    -## 2019-11-19
    -
    -- Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
    -  - I had previously sent them an export in 2019-04
    -- Atmire merged my [pull request regarding unnecessary escaping of dashes](https://github.com/atmire/COUNTER-Robots/pull/28) in regular expressions, as well as [my suggestion of adding "User-Agent" to the list of patterns](https://github.com/atmire/COUNTER-Robots/issues/27)
    -- I made another [pull request to fix invalid escaping of one of their new patterns](https://github.com/atmire/COUNTER-Robots/pull/29)
    -- I ran my `check-spider-hits.sh` script again with these new patterns and found a bunch more statistics requests that match, for example:
    -  - Found 39560 hits from ^Buck\/[0-9] in statistics
    -  - Found 5471 hits from ^User-Agent in statistics
    -  - Found 2994 hits from ^Buck\/[0-9] in statistics-2018
    -  - Found 14076 hits from ^User-Agent in statistics-2018
    -  - Found 16310 hits from ^User-Agent in statistics-2017
    -  - Found 4429 hits from ^User-Agent in statistics-2016
    -- Buck is one I've never heard of before, its user agent is:
    -
    -
    - -

    Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html) -```

    +
  • On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in config/spiders/agents, perhaps by dropping a new list in from Atmire’s COUNTER-Robots project

  • + +
  • A bit later I checked Solr and found three requests from my IP with that user agent this month:

    + +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
    +<?xml version="1.0" encoding="UTF-8"?>
    +<response>
    +<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
    +</response>
    +
  • + +
  • Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:

    + +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    +$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
    +
  • + +
  • After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…

    + +
  • + +
  • Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…

  • + +
  • I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr

    + +
  • + +
  • WTF again, there is actually a function called markRobotByUserAgent() that is never called anywhere!

    + +
  • + + +

    2019-11-05

    + + + +

    2019-11-07

    + + + +

    2019-11-08

    + + + +

    2019-11-09

    + + + +

    2019-11-12

    + + + +

    2019-11-13

    + + + +

    2019-11-14

    + + + +

    2019-11-15

    + + + +

    2019-11-17

    + + + +

    2019-11-18

    + + + +

    2019-11-19

    + + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index cea56644a..5b83560d9 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2019-11-17T15:39:10+02:00 + 2019-11-19T17:17:28+02:00 https://alanorth.github.io/cgspace-notes/ - 2019-11-17T15:39:10+02:00 + 2019-11-19T17:17:28+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2019-11-17T15:39:10+02:00 + 2019-11-19T17:17:28+02:00 https://alanorth.github.io/cgspace-notes/2019-11/ - 2019-11-17T15:39:10+02:00 + 2019-11-19T17:17:28+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2019-11-17T15:39:10+02:00 + 2019-11-19T17:17:28+02:00