mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -15,17 +15,17 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
|
||||
|
||||
|
||||
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
4671942
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
1277694
|
||||
|
||||
So 4.6 million from XMLUI and another 1.2 million from API requests
|
||||
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||||
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
" />
|
||||
<meta property="og:type" content="article" />
|
||||
@ -45,20 +45,20 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
|
||||
|
||||
|
||||
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
4671942
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
1277694
|
||||
|
||||
So 4.6 million from XMLUI and another 1.2 million from API requests
|
||||
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||||
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -152,22 +152,22 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
4671942
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
1277694
|
||||
</code></pre><ul>
|
||||
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
|
||||
<li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
</code></pre><ul>
|
||||
<li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
|
||||
1 PUT
|
||||
8 PROPFIND
|
||||
283 OPTIONS
|
||||
@ -177,7 +177,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</code></pre><ul>
|
||||
<li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
||||
365288
|
||||
</code></pre><ul>
|
||||
<li>Their user agent is one I’ve never seen before:</li>
|
||||
@ -186,22 +186,22 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</code></pre><ul>
|
||||
<li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
|
||||
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
|
||||
6566 GET /bitstream
|
||||
351928 GET /handle
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
|
||||
214209
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
|
||||
86874
|
||||
</code></pre><ul>
|
||||
<li>As far as I can tell, none of their requests are counted in the Solr statistics:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
|
||||
</code></pre><ul>
|
||||
<li>Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load</li>
|
||||
<li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
||||
</code></pre><ul>
|
||||
<li>On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire’s COUNTER-Robots</a> project
|
||||
<ul>
|
||||
@ -210,23 +210,23 @@ Let’s see how many of the REST API requests were for bitstreams (because t
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
|
||||
</code></pre><ul>
|
||||
<li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<response>
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
|
||||
</response>
|
||||
</code></pre><ul>
|
||||
<li>Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
|
||||
</code></pre><ul>
|
||||
<li>After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…
|
||||
<ul>
|
||||
@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>else if (line.hasOption('m'))
|
||||
<pre tabindex="0"><code>else if (line.hasOption('m'))
|
||||
{
|
||||
SolrLogger.markRobotsByIP();
|
||||
}
|
||||
@ -263,16 +263,16 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
|
||||
<ul>
|
||||
<li>I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
|
||||
</code></pre><ul>
|
||||
<li>After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
</code></pre><ul>
|
||||
<li>So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list
|
||||
<ul>
|
||||
@ -281,16 +281,16 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
|
||||
</li>
|
||||
<li>I’m curious how the special character matching is in Solr, so I will test two requests: one with “<a href="http://www.gnip.com">www.gnip.com</a>” which is in the spider list, and one with “<a href="http://www.gnyp.com">www.gnyp.com</a>” which isn’t:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
|
||||
</code></pre><ul>
|
||||
<li>Then commit changes to Solr so we don’t have to wait:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
</code></pre><ul>
|
||||
<li>So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…</li>
|
||||
</ul>
|
||||
@ -314,24 +314,24 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="62944" start="0">
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="62944" start="0">
|
||||
</code></pre><ul>
|
||||
<li>Similar for com.plumanalytics, Grammarly, and ltx71!</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
|
||||
*com.plumanalytics*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="28256" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="6288" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="105663" start="0">
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
|
||||
*com.plumanalytics*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="28256" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="6288" start="0">
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="105663" start="0">
|
||||
</code></pre><ul>
|
||||
<li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
</code></pre><ul>
|
||||
<li>I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
|
||||
<ul>
|
||||
@ -341,21 +341,21 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
|
||||
12&q=userAgent:*Unpaywall*' | xmllint --format - | less
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
|
||||
12&q=userAgent:*Unpaywall*' | xmllint --format - | less
|
||||
...
|
||||
<lst name="facet_counts">
|
||||
<lst name="facet_queries"/>
|
||||
<lst name="facet_fields">
|
||||
<lst name="dateYearMonth">
|
||||
<int name="2019-10">198624</int>
|
||||
<int name="2019-05">88422</int>
|
||||
<int name="2019-06">79911</int>
|
||||
<int name="2019-09">67065</int>
|
||||
<int name="2019-07">39026</int>
|
||||
<int name="2019-08">36889</int>
|
||||
<int name="2019-04">36512</int>
|
||||
<int name="2019-11">760</int>
|
||||
<lst name="facet_counts">
|
||||
<lst name="facet_queries"/>
|
||||
<lst name="facet_fields">
|
||||
<lst name="dateYearMonth">
|
||||
<int name="2019-10">198624</int>
|
||||
<int name="2019-05">88422</int>
|
||||
<int name="2019-06">79911</int>
|
||||
<int name="2019-09">67065</int>
|
||||
<int name="2019-07">39026</int>
|
||||
<int name="2019-08">36889</int>
|
||||
<int name="2019-04">36512</int>
|
||||
<int name="2019-11">760</int>
|
||||
</lst>
|
||||
</lst>
|
||||
</code></pre><ul>
|
||||
@ -423,17 +423,17 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
|
||||
</li>
|
||||
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr’s regex search can’t use those</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
|
||||
$ http "http://localhost:8081/solr/statistics/update?commit=true"
|
||||
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
|
||||
$ http "http://localhost:8081/solr/statistics/update?commit=true"
|
||||
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="1" start="0">
|
||||
</code></pre><ul>
|
||||
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
|
||||
<li>I realized that it’s easier to search Solr from curl via POST using this syntax:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
|
||||
</code></pre><ul>
|
||||
<li>If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
|
||||
<ul>
|
||||
@ -441,7 +441,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
|
||||
</code></pre><ul>
|
||||
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
|
||||
</ul>
|
||||
@ -450,7 +450,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
|
||||
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
|
||||
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
|
||||
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
|
||||
$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
|
||||
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
|
||||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
||||
|
Reference in New Issue
Block a user