mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -58,7 +58,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -152,7 +152,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
4671942
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
1277694
|
||||
@ -160,14 +160,14 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
|
||||
<li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
|
||||
</ul>
|
||||
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
</code></pre><ul>
|
||||
<li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li>
|
||||
</ul>
|
||||
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
|
||||
1 PUT
|
||||
8 PROPFIND
|
||||
283 OPTIONS
|
||||
@ -177,16 +177,16 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</code></pre><ul>
|
||||
<li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li>
|
||||
</ul>
|
||||
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
||||
365288
|
||||
</code></pre><ul>
|
||||
<li>Their user agent is one I’ve never seen before:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
|
||||
</code></pre><ul>
|
||||
<li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li>
|
||||
</ul>
|
||||
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
|
||||
6566 GET /bitstream
|
||||
351928 GET /handle
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
|
||||
@ -196,12 +196,12 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</code></pre><ul>
|
||||
<li>As far as I can tell, none of their requests are counted in the Solr statistics:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
|
||||
</code></pre><ul>
|
||||
<li>Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load</li>
|
||||
<li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
||||
</code></pre><ul>
|
||||
<li>On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire’s COUNTER-Robots</a> project
|
||||
<ul>
|
||||
@ -210,13 +210,13 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
|
||||
</code></pre><ul>
|
||||
<li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<response>
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
|
||||
@ -224,7 +224,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
|
||||
</code></pre><ul>
|
||||
<li>Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
|
||||
</code></pre><ul>
|
||||
@ -234,7 +234,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
|
||||
<pre tabindex="0"><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
|
||||
</code></pre><ul>
|
||||
<li>Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…</li>
|
||||
<li>I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr
|
||||
@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>else if (line.hasOption('m'))
|
||||
<pre tabindex="0"><code>else if (line.hasOption('m'))
|
||||
{
|
||||
SolrLogger.markRobotsByIP();
|
||||
}
|
||||
@ -263,12 +263,12 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
|
||||
<ul>
|
||||
<li>I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
|
||||
</code></pre><ul>
|
||||
<li>After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
@ -281,12 +281,12 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
|
||||
</li>
|
||||
<li>I’m curious how the special character matching is in Solr, so I will test two requests: one with “<a href="http://www.gnip.com">www.gnip.com</a>” which is in the spider list, and one with “<a href="http://www.gnyp.com">www.gnyp.com</a>” which isn’t:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
|
||||
</code></pre><ul>
|
||||
<li>Then commit changes to Solr so we don’t have to wait:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
@ -314,12 +314,12 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="62944" start="0">
|
||||
</code></pre><ul>
|
||||
<li>Similar for com.plumanalytics, Grammarly, and ltx71!</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
|
||||
*com.plumanalytics*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="28256" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
|
||||
@ -329,7 +329,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
|
||||
</code></pre><ul>
|
||||
<li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
</code></pre><ul>
|
||||
@ -341,7 +341,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
|
||||
12&q=userAgent:*Unpaywall*' | xmllint --format - | less
|
||||
...
|
||||
<lst name="facet_counts">
|
||||
@ -394,7 +394,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
|
||||
<pre tabindex="0"><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
|
||||
istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
|
||||
</code></pre><ul>
|
||||
<li>Open a <a href="https://github.com/atmire/COUNTER-Robots/pull/28">pull request</a> against COUNTER-Robots to remove unnecessary escaping of dashes</li>
|
||||
@ -423,7 +423,7 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
|
||||
</li>
|
||||
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr’s regex search can’t use those</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
|
||||
$ http "http://localhost:8081/solr/statistics/update?commit=true"
|
||||
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
@ -433,7 +433,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
|
||||
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
|
||||
<li>I realized that it’s easier to search Solr from curl via POST using this syntax:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
|
||||
</code></pre><ul>
|
||||
<li>If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
|
||||
<ul>
|
||||
@ -441,7 +441,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
|
||||
</code></pre><ul>
|
||||
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
|
||||
</ul>
|
||||
@ -450,7 +450,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
|
||||
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
|
||||
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
|
||||
</ul>
|
||||
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
|
||||
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
|
||||
$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
|
||||
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
|
||||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
||||
@ -513,7 +513,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</li>
|
||||
<li>Most of the curl hits were from CIAT in mid-2019, where they were using <a href="https://guzzle3.readthedocs.io/http-client/client.html">GuzzleHttp</a> from PHP, which uses something like this for its user agent:</li>
|
||||
</ul>
|
||||
<pre><code>Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
|
||||
<pre tabindex="0"><code>Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
|
||||
</code></pre><ul>
|
||||
<li>Run system updates on DSpace Test and reboot the server</li>
|
||||
</ul>
|
||||
@ -564,7 +564,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</li>
|
||||
<li>Buck is one I’ve never heard of before, its user agent is:</li>
|
||||
</ul>
|
||||
<pre><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
|
||||
<pre tabindex="0"><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
|
||||
</code></pre><ul>
|
||||
<li>All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
|
||||
</ul>
|
||||
|
Reference in New Issue
Block a user