Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +
- Most of them seem to be to community or collection discover and browse results pages like `/handle/10568/103/discover`:
- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots]( project
- First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
- Then I made some item and bitstream requests on DSpace Test using that user agent:
- A bit later I checked Solr and found three requests from my IP with that user agent this month:
$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:'
<?xml version="1.0" encoding="UTF-8"?>
<lstname="responseHeader"><intname="status">0</int><intname="QTime">1</int><lstname="params"><strname="q">ip: AND userAgent:iskanie</str><strname="fq">dateYearMonth:2019-11</str><strname="rows">0</str></lst></lst><resultname="response"numFound="3"start="0"></result>
- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
- I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
- Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
- I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true`
- According to `dspace-api/src/main/java/org/dspace/statistics/util/` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory
- I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong
- Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF:
else if (line.hasOption('m'))
- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere!
- It appears to be unimplemented...
- I sent a message to the dspace-tech mailing list to ask if I should file an issue
## 2019-11-05
- I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
- Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
- I'm curious how the special character matching is in Solr, so I will test two requests: one with "" which is in the spider list, and one with "" which isn't:
- Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
- I saw a bunch of user agents that have the literal string `User-Agent` in their user agent HTTP header, for example:
-`User-Agent: Drupal (+`
-`User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31`
-`User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/;IKUCID/IKU;`
-`User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)`
-`User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/;IKUCID/IKU;IKU/;IKUCID/IKU;`
-`User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/;IKUCID/IKU;`
- I filed [an issue]( on the COUNTER-Robots project to see if they agree to add `User-Agent:` to the list of robot user agents