Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
```
- Most of them seem to be to community or collection discover and browse results pages like `/handle/10568/103/discover`:
- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
- First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
- Then I made some item and bitstream requests on DSpace Test using that user agent:
- A bit later I checked Solr and found three requests from my IP with that user agent this month:
```
$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lstname="responseHeader"><intname="status">0</int><intname="QTime">1</int><lstname="params"><strname="q">ip:73.178.9.24 AND userAgent:iskanie</str><strname="fq">dateYearMonth:2019-11</str><strname="rows">0</str></lst></lst><resultname="response"numFound="3"start="0"></result>
</response>
```
- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:
- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
- I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr
- Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr
- I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true`
- According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory
- I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong
- Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF:
```
else if (line.hasOption('m'))
{
SolrLogger.markRobotsByIP();
}
```
- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere!
- It appears to be unimplemented...
- I sent a message to the dspace-tech mailing list to ask if I should file an issue
## 2019-11-05
- I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
- Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
- I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't:
- Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
```
$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
- I saw a bunch of user agents that have the literal string `User-Agent` in their user agent HTTP header, for example:
-`User-Agent: Drupal (+http://drupal.org/)`
-`User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31`
-`User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;`
-`User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)`
-`User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;`
-`User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;`
- I filed [an issue](https://github.com/atmire/COUNTER-Robots/issues/27) on the COUNTER-Robots project to see if they agree to add `User-Agent:` to the list of robot user agents
- Open a [pull request](https://github.com/atmire/COUNTER-Robots/pull/28) against COUNTER-Robots to remove unnecessary escaping of dashes
## 2019-11-12
- Udana and Chandima emailed me to ask why [one of their WLE items](https://hdl.handle.net/10568/81236) that is mapped from IWMI only shows up in the IWMI "department" on the Altmetric dashboard
- A [search in the IWMI department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management)
- A [search in the WLE department shows no results](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management)
- Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding `dc.rights`, fixing DOIs, adding ISSNs, etc) and noticed one issue with [an item](https://hdl.handle.net/10568/97087) that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)
- The [item with a low Altmetric score for its Handle](https://hdl.handle.net/10568/97087) that I tweeted yesterday still hasn't linked with the DOI's score
- I tweeted it again with the Handle and the DOI
- Testing modifying some of the COUNTER-Robots patterns to use `[0-9]` instead of `\d` digit character type, as Solr's regex search can't use those
- If the parameters include something like "[0-9]" then curl interprets it as a range and will make ten requests
- You can disable this using the `-g` option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:
- I updated the `check-spider-hits.sh` script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
- Run the new version of `check-spider-hits.sh` on CGSpace's Solr statistics cores one by one, starting from the oldest just in case something goes wrong
- But then I noticed that some (all?) of the hits weren't actually getting purged, all of which were using regular expressions like:
-`MetaURI[\+\s]API\/[0-9]\.[0-9]`
-`FDM(\s|\+)[0-9]`
-`Goldfire(\s|\+)Server`
-`^Mozilla\/4\.0\+\(compatible;\)$`
-`^Mozilla\/4\.0\+\(compatible;\+ICS\)$`
-`^Mozilla\/4\.5\+\[en]\+\(Win98;\+I\)$`
- Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!
- Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is
- I tried to do URL encoding of the +, double escaping, etc... but nothing worked
- I'm going to ignore regular expressions that have pluses for now
- I think I might also have to ignore patterns that have percent signs, like `^\%?default\%?$`
- After I added the ignores and did some more testing I finally ran the `check-spider-hits.sh` on all CGSpace Solr statistics cores and these are the number of hits purged from each core:
- statistics-2010: 113
- statistics-2011: 7235
- statistics-2012: 0
- statistics-2013: 0
- statistics-2014: 316
- statistics-2015: 16809
- statistics-2016: 41732
- statistics-2017: 39207
- statistics-2018: 295546
- statistics: 1043373
- That's 1.4 million hits in addition to the 2 million I purged earlier this week...
- For posterity, the major contributors to the hits on the statistics core were:
- Purging 812429 hits from curl\/ in statistics
- Purging 48206 hits from facebookexternalhit\/ in statistics
- Purging 72004 hits from PHP\/ in statistics
- Purging 76072 hits from Yeti\/[0-9] in statistics
- Most of the curl hits were from CIAT in mid-2019, where they were using [GuzzleHttp](https://guzzle3.readthedocs.io/http-client/client.html) from PHP, which uses something like this for its user agent:
- Altmetric support responded about our dashboard question, asking if the second "department" (aka WLE's collection) was added recently and might have not been in the last harvesting yet
- I told her no, that the department is several years old, and the item was added in 2017
- Then I looked again at the dashboard for each department and I see the item in both departments now... shit.
- A [search in the IWMI department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management)
- A [search in the WLE department shows the item](https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management)
- I finally decided to revert `cg.hasMetadata` back to `cg.identifier.dataurl` in my CG Core v2 branch (see [#10](https://github.com/AgriculturalSemantics/cg-core/issues/10))
- Finally deploy `5_x-cgcorev2` branch on DSpace Test
## 2019-11-18
- I sent a mail to the CGSpace partners in Addis about the CG Core v2 changes on DSpace Test
- Then I filed an [issue on the CG Core GitHub](https://github.com/AgriculturalSemantics/cg-core/issues/11) to let the metadata people know about our progress
- It seems like I will do a session about CG Core v2 implementation and limitations in DSpace for the data workshop in December in Nairobi (?)
## 2019-11-19
- Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
- I had previously sent them an export in 2019-04
- Atmire merged my [pull request regarding unnecessary escaping of dashes](https://github.com/atmire/COUNTER-Robots/pull/28) in regular expressions, as well as [my suggestion of adding "User-Agent" to the list of patterns](https://github.com/atmire/COUNTER-Robots/issues/27)
- I made another [pull request to fix invalid escaping of one of their new patterns](https://github.com/atmire/COUNTER-Robots/pull/29)
- I ran my `check-spider-hits.sh` script again with these new patterns and found a bunch more statistics requests that match, for example:
- Found 39560 hits from ^Buck\/[0-9] in statistics
- Found 5471 hits from ^User-Agent in statistics
- Found 2994 hits from ^Buck\/[0-9] in statistics-2018
- Found 14076 hits from ^User-Agent in statistics-2018
- Found 16310 hits from ^User-Agent in statistics-2017
- Found 4429 hits from ^User-Agent in statistics-2016
- Buck is one I've never heard of before, its user agent is:
- Discuss bugs and issues with AReS v2 that are limiting its adoption
- BUG: If you search for items between year 2012 and 2019, then remove some years from the "info product analysis", they are still present in the search results and export
- FEATURE: Ability to add month to date filter?
- FEATURE: Add "review status", "series", and "usage rights" to search filters
- FEATURE: Downloads and views are not included in exports
- FEATURE: Add more fields to exports (Abenet will clarify)
- As for the larger features to focus on in the future ToRs:
- FEATURE: Unique, linkable URL for a set of search results (discussed with Moayad, he has a plan for this)
- FEATURE: Reporting that we talked about in Amman in January, 2019.
- We have a meeting about AReS future developments with Jane, Abenet, Peter, and Enrico tomorrow
- Minor updates on the [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api)
- Introduce isort for import sorting
- Introduce black for code formatting according to PEP8
- Fix some minor issues raised by flake8
- Release [version 1.1.1](https://github.com/ilri/dspace-statistics-api/releases/tag/v1.1.1) and deploy to DSpace Test (linode19)
- I realize that I never deployed version 1.1.0 (with falcon 2.0.0) on CGSpace (linode18) so I did that as well
- File a ticket (242418) with Altmetric about DCTERMS migration to see if there is anything we need to be careful about
- Make a pull request against cg-core schema to fix inconsistent references to `cg.embargoDate` ([#13](https://github.com/AgriculturalSemantics/cg-core/pull/13))
- Review the AReS feedback again after Peter made some comments
- I standardized the GitHub issue labels in both OpenRXV and AReS issue trackers, using labels like "P-low" for priority
- I filed another handful of issues in both trackers and added them to the spreadsheet
- I need to ask Marie-Angelique about the `cg.peer-reviewed` field
- We currently use `dc.description.version` with values like "Internal Review" and "Peer Review", and CG Core v2 currently recommends using "True" if the field is peer reviewed
- File an issue with CG Core v2 project to ask Marie-Angelique about expanding the scope of `cg.peer-reviewed` to include other types of review, and possibly to change the field name to something more generic like `cg.review-status` ([#14](https://github.com/AgriculturalSemantics/cg-core/issues/14))
- More review of AReS feedback
- I clarified some of the feedback
- I added status of "Issue Filed", "Duplicate" and "No Action Required" to several items
- I filed a handful more GitHub issues in AReS and OpenRXV GitHub trackers