Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -15,17 +15,17 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
106781
" />
<meta property="og:type" content="article" />
@ -45,20 +45,20 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -152,22 +152,22 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
</code></pre><ul>
<li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | awk '{print $6}' | sed 's/&quot;//' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | awk &#39;{print $6}&#39; | sed &#39;s/&#34;//&#39; | sort | uniq -c | sort -n
1 PUT
8 PROPFIND
283 OPTIONS
@ -177,7 +177,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</code></pre><ul>
<li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#39;(34\.224\.4\.16|34\.234\.204\.152)&#39;
365288
</code></pre><ul>
<li>Their user agent is one I&rsquo;ve never seen before:</li>
@ -186,22 +186,22 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</code></pre><ul>
<li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -o -E &quot;GET /(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -o -E &#34;GET /(bitstream|discover|handle)&#34; | sort | uniq -c
6566 GET /bitstream
351928 GET /handle
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -E &quot;GET /(bitstream|discover|handle)&quot; | grep -c discover
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -E &#34;GET /(bitstream|discover|handle)&#34; | grep -c discover
214209
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -E &quot;GET /(bitstream|discover|handle)&quot; | grep -c browse
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -E &#34;GET /(bitstream|discover|handle)&#34; | grep -c browse
86874
</code></pre><ul>
<li>As far as I can tell, none of their requests are counted in the Solr statistics:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true&#39;
</code></pre><ul>
<li>Still, those requests are CPU intensive so I will add their user agent to the &ldquo;badbots&rdquo; rate limiting in nginx to reduce the impact on server load</li>
<li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:&quot;Amazonbot/0.1&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/1/discover&#39; User-Agent:&#34;Amazonbot/0.1&#34;
</code></pre><ul>
<li>On the topic of spiders, I have been wanting to update DSpace&rsquo;s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire&rsquo;s COUNTER-Robots</a> project
<ul>
@ -210,23 +210,23 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;iskanie&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;iskanie&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;iskanie&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y&#39; User-Agent:&#34;iskanie&#34;
</code></pre><ul>
<li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0'
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0&#39;
&lt;?xml version=&#34;1.0&#34; encoding=&#34;UTF-8&#34;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;1&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&quot;fq&quot;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;3&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
&lt;lst name=&#34;responseHeader&#34;&gt;&lt;int name=&#34;status&#34;&gt;0&lt;/int&gt;&lt;int name=&#34;QTime&#34;&gt;1&lt;/int&gt;&lt;lst name=&#34;params&#34;&gt;&lt;str name=&#34;q&#34;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&#34;fq&#34;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&#34;rows&#34;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&#34;response&#34; numFound=&#34;3&#34; start=&#34;0&#34;&gt;&lt;/result&gt;
&lt;/response&gt;
</code></pre><ul>
<li>Now I want to make similar requests with a user agent that is included in DSpace&rsquo;s current user agent list:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;celestial&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;celestial&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;celestial&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y&#39; User-Agent:&#34;celestial&#34;
</code></pre><ul>
<li>After twenty minutes I didn&rsquo;t see any requests in Solr, so I assume they did not get logged because they matched a bot list&hellip;
<ul>
@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
</ul>
</li>
</ul>
<pre tabindex="0"><code>else if (line.hasOption('m'))
<pre tabindex="0"><code>else if (line.hasOption(&#39;m&#39;))
{
SolrLogger.markRobotsByIP();
}
@ -263,16 +263,16 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
<ul>
<li>I added &ldquo;alanfuu2&rdquo; to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu1&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu2&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;alanfuuu1&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;alanfuuu2&#34;
</code></pre><ul>
<li>After committing the changes in Solr I saw one request for &ldquo;alanfuu1&rdquo; and no requests for &ldquo;alanfuu2&rdquo;:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/update?commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
</code></pre><ul>
<li>So basically it seems like a win to update the example file with the latest one from Atmire&rsquo;s COUNTER-Robots list
<ul>
@ -281,16 +281,16 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
</li>
<li>I&rsquo;m curious how the special character matching is in Solr, so I will test two requests: one with &ldquo;<a href="http://www.gnip.com">www.gnip.com</a>&rdquo; which is in the spider list, and one with &ldquo;<a href="http://www.gnyp.com">www.gnyp.com</a>&rdquo; which isn&rsquo;t:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnip.com&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnyp.com&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;www.gnip.com&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;www.gnyp.com&#34;
</code></pre><ul>
<li>Then commit changes to Solr so we don&rsquo;t have to wait:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/update?commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>So the blocking seems to be working because &ldquo;www.gnip.com&rdquo; is one of the new patterns added to the spiders file&hellip;</li>
</ul>
@ -314,24 +314,24 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;62944&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;62944&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Similar for com.plumanalytics, Grammarly, and ltx71!</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
*com.plumanalytics*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;28256&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;6288&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;105663&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
*com.plumanalytics*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;28256&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*Grammarly*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;6288&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;105663&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
</code></pre><ul>
<li>I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
<ul>
@ -341,21 +341,21 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
12&amp;q=userAgent:*Unpaywall*' | xmllint --format - | less
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
12&amp;q=userAgent:*Unpaywall*&#39; | xmllint --format - | less
...
&lt;lst name=&quot;facet_counts&quot;&gt;
&lt;lst name=&quot;facet_queries&quot;/&gt;
&lt;lst name=&quot;facet_fields&quot;&gt;
&lt;lst name=&quot;dateYearMonth&quot;&gt;
&lt;int name=&quot;2019-10&quot;&gt;198624&lt;/int&gt;
&lt;int name=&quot;2019-05&quot;&gt;88422&lt;/int&gt;
&lt;int name=&quot;2019-06&quot;&gt;79911&lt;/int&gt;
&lt;int name=&quot;2019-09&quot;&gt;67065&lt;/int&gt;
&lt;int name=&quot;2019-07&quot;&gt;39026&lt;/int&gt;
&lt;int name=&quot;2019-08&quot;&gt;36889&lt;/int&gt;
&lt;int name=&quot;2019-04&quot;&gt;36512&lt;/int&gt;
&lt;int name=&quot;2019-11&quot;&gt;760&lt;/int&gt;
&lt;lst name=&#34;facet_counts&#34;&gt;
&lt;lst name=&#34;facet_queries&#34;/&gt;
&lt;lst name=&#34;facet_fields&#34;&gt;
&lt;lst name=&#34;dateYearMonth&#34;&gt;
&lt;int name=&#34;2019-10&#34;&gt;198624&lt;/int&gt;
&lt;int name=&#34;2019-05&#34;&gt;88422&lt;/int&gt;
&lt;int name=&#34;2019-06&#34;&gt;79911&lt;/int&gt;
&lt;int name=&#34;2019-09&#34;&gt;67065&lt;/int&gt;
&lt;int name=&#34;2019-07&#34;&gt;39026&lt;/int&gt;
&lt;int name=&#34;2019-08&#34;&gt;36889&lt;/int&gt;
&lt;int name=&#34;2019-04&#34;&gt;36512&lt;/int&gt;
&lt;int name=&#34;2019-11&#34;&gt;760&lt;/int&gt;
&lt;/lst&gt;
&lt;/lst&gt;
</code></pre><ul>
@ -423,17 +423,17 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
</li>
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr&rsquo;s regex search can&rsquo;t use those</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;Scrapoo/1&quot;
$ http &quot;http://localhost:8081/solr/statistics/update?commit=true&quot;
$ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*&quot; | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
$ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/&quot; | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;Scrapoo/1&#34;
$ http &#34;http://localhost:8081/solr/statistics/update?commit=true&#34;
$ http &#34;http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*&#34; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
$ http &#34;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/&#34; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
<li>I realized that it&rsquo;s easier to search Solr from curl via POST using this syntax:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:*Scrapoo*&amp;rows=0&quot;)
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/select&#34; -d &#34;q=userAgent:*Scrapoo*&amp;rows=0&#34;)
</code></pre><ul>
<li>If the parameters include something like &ldquo;[0-9]&rdquo; then curl interprets it as a range and will make ten requests
<ul>
@ -441,7 +441,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2'
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select&#39; -d &#39;q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2&#39;
</code></pre><ul>
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I&rsquo;m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
</ul>
@ -450,7 +450,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml