<li><p>Their user agent is one I’ve never seen before:</p>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
<li><p>As far as I can tell, none of their requests are counted in the Solr statistics:</p>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
</code></pre></li>
<li><p>Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load</p></li>
<li><p>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</p>
<li><p>On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <ahref="https://github.com/atmire/COUNTER-Robots">Atmire’s COUNTER-Robots</a> project</p>
<ul>
<li>First I checked for a user agent that is in COUNTER-Robots, but NOT in the current <code>dspace/config/spiders/example</code> list</li>
<li><p>Then I made some item and bitstream requests on DSpace Test using that user agent:</p>
<li><p>A bit later I checked Solr and found three requests from my IP with that user agent this month:</p>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
<li><p>After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…</p>
<ul>
<li><p>What’s strange is that the Solr spider agent configuration in <code>dspace/config/modules/solr-statistics.cfg</code> points to a file that doesn’t exist…</p>
<li><p>Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…</p></li>
<li><p>I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr</p>
<ul>
<li>Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in <code>dspace/config/spiders/example</code> and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr</li>
<li>I restarted Tomcat and ran <code>dspace stats-util -m</code> and it did some stuff for awhile, but I still don’t see any items in Solr with <code>isBot:true</code></li>
<li>According to <code>dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java</code> the patterns for user agents are loaded from any file in the <code>config/spiders/agents</code> directory</li>
<li>I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran <code>dspace stats-util -m</code> and still there were no new items marked as being bots in Solr, so I think there is still something wrong</li>
<li><p>Jesus, the code in <code>./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java</code> says that <code>stats-util -m</code> marks spider requests by their IPs, not by their user agents… WTF:</p>
<pre><code>else if (line.hasOption('m'))
{
SolrLogger.markRobotsByIP();
}
</code></pre></li>
</ul></li>
<li><p>WTF again, there is actually a function called <code>markRobotByUserAgent()</code> that is never called anywhere!</p>
<ul>
<li>It appears to be unimplemented…</li>
<li>I sent a message to the dspace-tech mailing list to ask if I should file an issue</li>
<li>Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents</li>
<li><p>I’m curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn’t:</p>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
<li><code>User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31</code></li>
<li><code>User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;</code></li>
<li><code>User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)</code></li>
<li><code>User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;</code></li>
<li><code>User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;</code></li>
</ul></li>
<li>I filed <ahref="https://github.com/atmire/COUNTER-Robots/issues/27">an issue</a> on the COUNTER-Robots project to see if they agree to add <code>User-Agent:</code> to the list of robot user agents</li>
<li><p>Open a <ahref="https://github.com/atmire/COUNTER-Robots/pull/28">pull request</a> against COUNTER-Robots to remove unnecessary escaping of dashes</p></li>
<li>Udana and Chandima emailed me to ask why <ahref="https://hdl.handle.net/10568/81236">one of their WLE items</a> that is mapped from IWMI only shows up in the IWMI “department” on the Altmetric dashboard
<li>A <ahref="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management">search in the IWMI department shows the item</a></li>
<li>A <ahref="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management">search in the WLE department shows no results</a></li>
<li>I emailed Altmetric support to ask for help</li>
</ul></li>
<li>Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding <code>dc.rights</code>, fixing DOIs, adding ISSNs, etc) and noticed one issue with <ahref="https://hdl.handle.net/10568/97087">an item</a> that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)
<li>The <ahref="https://hdl.handle.net/10568/97087">item with a low Altmetric score for its Handle</a> that I tweeted yesterday still hasn’t linked with the DOI’s score
<li><p>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr’s regex search can’t use those</p>
<li><p>You can disable this using the <code>-g</code> option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:</p>
<li><p>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</p></li>
</ul>
<h2id="2019-11-14">2019-11-14</h2>
<ul>
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li><p>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</p>
<li>This allows me to detect and purge many more hits from the Solr statistics core</li>
<li>I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores</li>
</ul></li>
</ul>
<h2id="2019-11-15">2019-11-15</h2>
<ul>
<li>Run the new version of <code>check-spider-hits.sh</code> on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong</li>
<li>But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like:
<li>Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!</li>
<li>Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is
<ul>
<li>I tried to do URL encoding of the +, double escaping, etc… but nothing worked</li>
<li>I’m going to ignore regular expressions that have pluses for now</li>
</ul></li>
<li>I think I might also have to ignore patterns that have percent signs, like <code>^\%?default\%?$</code></li>
<li>After I added the ignores and did some more testing I finally ran the <code>check-spider-hits.sh</code> on all CGSpace Solr statistics cores and these are the number of hits purged from each core:
<ul>
<li>statistics-2010: 113</li>
<li>statistics-2011: 7235</li>
<li>statistics-2012: 0</li>
<li>statistics-2013: 0</li>
<li>statistics-2014: 316</li>
<li>statistics-2015: 16809</li>
<li>statistics-2016: 41732</li>
<li>statistics-2017: 39207</li>
<li>statistics-2018: 295546</li>
<li>statistics: 1043373</li>
</ul></li>
<li>That’s 1.4 million hits in addition to the 2 million I purged earlier this week…</li>
<li>For posterity, the major contributors to the hits on the statistics core were:
<ul>
<li>Purging 812429 hits from curl\/ in statistics</li>
<li>Purging 48206 hits from facebookexternalhit\/ in statistics</li>
<li>Purging 72004 hits from PHP\/ in statistics</li>
<li>Purging 76072 hits from Yeti\/[0-9] in statistics</li>
</ul></li>
<li><p>Most of the curl hits were from CIAT in mid-2019, where they were using <ahref="https://guzzle3.readthedocs.io/http-client/client.html">GuzzleHttp</a> from PHP, which uses something like this for its user agent:</p>
<li><p>Run system updates on DSpace Test and reboot the server</p></li>
</ul>
<h2id="2019-11-17">2019-11-17</h2>
<ul>
<li>Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet
<ul>
<li>I told her no, that the department is several years old, and the item was added in 2017</li>
<li>Then I looked again at the dashboard for each department and I see the item in both departments now… shit.</li>
<li>A <ahref="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management">search in the IWMI department shows the item</a></li>
<li>A <ahref="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management">search in the WLE department shows the item</a></li>
</ul></li>
<li>I finally decided to revert <code>cg.hasMetadata</code> back to <code>cg.identifier.dataurl</code> in my CG Core v2 branch (see <ahref="https://github.com/AgriculturalSemantics/cg-core/issues/10">#10</a>)</li>
<li>Regarding the <ahref="https://hdl.handle.net/10568/97087">WLE item</a> that has a much lower score than its DOI…
<ul>
<li>I tweeted the item twice last week and the score never got linked</li>
<li>Then I noticed that I had already made a note about the same issue in 2019-04, when I also tweeted it several times…</li>
<li>I will ask Altmetric support for help with that</li>
</ul></li>
<li>Finally deploy <code>5_x-cgcorev2</code> branch on DSpace Test</li>
</ul>
<h2id="2019-11-18">2019-11-18</h2>
<ul>
<li>I sent a mail to the CGSpace partners in Addis about the CG Core v2 changes on DSpace Test</li>
<li>Then I filed an <ahref="https://github.com/AgriculturalSemantics/cg-core/issues/11">issue on the CG Core GitHub</a> to let the metadata people know about our progress</li>
<li>It seems like I will do a session about CG Core v2 implementation and limitations in DSpace for the data workshop in December in Nairobi (?)</li>
</ul>
<h2id="2019-11-19">2019-11-19</h2>
<ul>
<li>Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
<ul>
<li>I had previously sent them an export in 2019-04</li>
</ul></li>
<li>Atmire merged my <ahref="https://github.com/atmire/COUNTER-Robots/pull/28">pull request regarding unnecessary escaping of dashes</a> in regular expressions, as well as <ahref="https://github.com/atmire/COUNTER-Robots/issues/27">my suggestion of adding “User-Agent” to the list of patterns</a></li>
<li>I made another <ahref="https://github.com/atmire/COUNTER-Robots/pull/29">pull request to fix invalid escaping of one of their new patterns</a></li>
<li>I ran my <code>check-spider-hits.sh</code> script again with these new patterns and found a bunch more statistics requests that match, for example:
<ul>
<li>Found 39560 hits from ^Buck\/[0-9] in statistics</li>
<li>Found 5471 hits from ^User-Agent in statistics</li>
<li>Found 2994 hits from ^Buck\/[0-9] in statistics-2018</li>
<li>Found 14076 hits from ^User-Agent in statistics-2018</li>
<li>Found 16310 hits from ^User-Agent in statistics-2017</li>
<li>Found 4429 hits from ^User-Agent in statistics-2016</li>
</ul></li>
<li><p>Buck is one I’ve never heard of before, its user agent is:</p>