From ab0e83bfcc25bc1f98fe7c51f61396e0950052ff Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Tue, 5 Nov 2019 10:37:16 +0200 Subject: [PATCH] Add notes for 2019-11-05 --- content/posts/2019-11.md | 76 +++++++++++++++++++++++++++++++ docs/2019-11/index.html | 96 ++++++++++++++++++++++++++++++++++++++-- docs/sitemap.xml | 10 ++--- 3 files changed, 173 insertions(+), 9 deletions(-) diff --git a/content/posts/2019-11.md b/content/posts/2019-11.md index 9cb2555e5..5aa26a41e 100644 --- a/content/posts/2019-11.md +++ b/content/posts/2019-11.md @@ -78,4 +78,80 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4. $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1" ``` +- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project + - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list + - Then I made some item and bitstream requests on DSpace Test using that user agent: + +``` +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" +$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie" +``` + +- A bit later I checked Solr and found three requests from my IP with that user agent this month: + +``` +$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0' + + +01ip:73.178.9.24 AND userAgent:iskaniedateYearMonth:2019-110 + +``` + +- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list: + +``` +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" +$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial" +``` + +- After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list... + - What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist... + +``` +spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt +``` + +- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file... +- I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr + - Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr + - I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true` + - According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory + - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong + - Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF: + +``` +else if (line.hasOption('m')) +{ + SolrLogger.markRobotsByIP(); +} +``` + +- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere! + - It appears to be unimplemented... + - I sent a message to the dspace-tech mailing list to ask if I should file an issue + +## 2019-11-05 + +- I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test: + +``` +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1" +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2" +``` + +- After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2": + +``` +$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound + +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound + +``` + +- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list + - Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents + diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index d0ed70d95..abb83d001 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -34,7 +34,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t - + @@ -73,9 +73,9 @@ Let’s see how many of the REST API requests were for bitstreams (because t "@type": "BlogPosting", "headline": "November, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-11\/", - "wordCount": "385", + "wordCount": "931", "datePublished": "2019-11-04T12:20:30+02:00", - "dateModified": "2019-11-04T12:20:30+02:00", + "dateModified": "2019-11-04T16:41:19+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -222,9 +222,97 @@ Let’s see how many of the REST API requests were for bitstreams (because t -

$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:“Amazonbot/0.1” +

$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:“Amazonbot/0.1”

+ +

+- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
+  - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
+  - Then I made some item and bitstream requests on DSpace Test using that user agent:
+
+
+ +

$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” +$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” +$ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“iskanie”

+ +

+- A bit later I checked Solr and found three requests from my IP with that user agent this month:
+
+
+ +

$ http –print b ‘http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0' +<?xml version=“1.0” encoding=“UTF-8”?> + +01ip:73.178.9.24 AND userAgent:iskaniedateYearMonth:2019-110 +

+ +

+- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
+
+
+ +

$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” +$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” +$ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“celestial”

+ +

+- After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list...
+  - What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist...
+
+
+ +

spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt

+ +

+- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
+- I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
+  - Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
+  - I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true`
+  - According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory
+  - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong
+  - Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF:
+
+
+ +

else if (line.hasOption(’m’)) +{ + SolrLogger.markRobotsByIP(); +}

+ +

+- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere!
+  - It appears to be unimplemented...
+  - I sent a message to the dspace-tech mailing list to ask if I should file an issue
+
+## 2019-11-05
+
+- I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
+
+
+ +

$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu1” +$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu2”

+ +

+- After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2":
+
+
+ +

$ http –print b ‘http://localhost:8081/solr/statistics/update?commit=true' +$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound + +$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound + ```

+
    +
  • So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list + +
      +
    • Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
    • +
  • +
+ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index a6d569109..7a4348348 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2019-11-04T12:20:30+02:00 + 2019-11-04T16:41:19+02:00 https://alanorth.github.io/cgspace-notes/ - 2019-11-04T12:20:30+02:00 + 2019-11-04T16:41:19+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2019-11-04T12:20:30+02:00 + 2019-11-04T16:41:19+02:00 https://alanorth.github.io/cgspace-notes/2019-11/ - 2019-11-04T12:20:30+02:00 + 2019-11-04T16:41:19+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2019-11-04T12:20:30+02:00 + 2019-11-04T16:41:19+02:00