$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:“Amazonbot/0.1” +
$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:“Amazonbot/0.1”
+ +
+- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
+ - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
+ - Then I made some item and bitstream requests on DSpace Test using that user agent:
+
+
+
+$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” +$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“iskanie” +$ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“iskanie”
+ +
+- A bit later I checked Solr and found three requests from my IP with that user agent this month:
+
+
+
+$ http –print b ‘http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
+<?xml version=“1.0” encoding=“UTF-8”?>
+
+- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
+
+
+
+$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” +$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“celestial” +$ http –print Hh ‘https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:“celestial”
+ +
+- After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list...
+ - What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist...
+
+
+
+spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
+ +
+- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
+- I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
+ - Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
+ - I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true`
+ - According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory
+ - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong
+ - Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF:
+
+
+
+else if (line.hasOption(’m’)) +{ + SolrLogger.markRobotsByIP(); +}
+ +
+- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere!
+ - It appears to be unimplemented...
+ - I sent a message to the dspace-tech mailing list to ask if I should file an issue
+
+## 2019-11-05
+
+- I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
+
+
+
+$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu1” +$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“alanfuuu2”
+ +
+- After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2":
+
+
+
+$ http –print b ‘http://localhost:8081/solr/statistics/update?commit=true'
+$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound
+