Add notes for 2020-02-25

This commit is contained in:
Alan Orth 2020-02-25 21:05:22 +02:00
parent e0c862d5ea
commit b990e4da33
3 changed files with 199 additions and 9 deletions

View File

@ -928,6 +928,8 @@ EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
```
## 2020-02-25
- And what's the 950,000 hits from Online.net IPs with the following user agent:
```
@ -935,5 +937,97 @@ Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3
```
- Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I'm purging them all
- I looked deeper in the Solr statistics and found a bunch more weird user agents:
```
LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
EventMachine HttpClient
ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
Biosphere EcoSearch http://search.ecointernet.org/
Typhoeus - https://github.com/typhoeus/typhoeus
Citoid (Wikimedia tool; learn more at https://www.mediawiki.org/wiki/Citoid)
node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
7Siters/1.08 (+https://7ooo.ru/siters/)
sqlmap/1.0-dev-nongit-20190527 (http://sqlmap.org)
sqlmap/1.3.4.14#dev (http://sqlmap.org)
lua-resty-http/0.10 (Lua) ngx_lua/10000
omgili/0.5 +http://omgili.com
IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)
Twurly v1.1 (https://twurly.org)
okhttp/3.11.0
okhttp/3.10.0
Pattern/2.6 +http://www.clips.ua.ac.be/pattern
Link Check; EPrints 3.3.x;
CyotekWebCopy/1.7 CyotekHTTP/2.0
Adestra Link Checker: http://www.adestra.co.uk
HTTPie/1.0.2
```
- I notice that some of these would be matched by the COUNTER-Robots list when DSpace uses it in Java because there we have more robust (and case-insensitive) matching
- I created a temporary file of some of the patterns and converted them to use capitalization so I could run them through `check-spider-hits.sh`
```
Link.?Check
Http.?Client
ecointernet
```
- That removes another 500,000 or so:
```
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
Purging 253 hits from Jersey\/[0-9] in statistics
Purging 7302 hits from Link.?Check in statistics
Purging 85574 hits from Http.?Client in statistics
Purging 495 hits from HTTPie\/[0-9] in statistics
Purging 56726 hits from ecointernet in statistics
Total number of bot hits purged: 150350
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2018 -p
Purging 3442 hits from Link.?Check in statistics-2018
Purging 21922 hits from Http.?Client in statistics-2018
Purging 2120 hits from HTTPie\/[0-9] in statistics-2018
Purging 10 hits from ecointernet in statistics-2018
Total number of bot hits purged: 27494
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2017 -p
Purging 6416 hits from Link.?Check in statistics-2017
Purging 403402 hits from Http.?Client in statistics-2017
Purging 12 hits from HTTPie\/[0-9] in statistics-2017
Purging 6 hits from ecointernet in statistics-2017
Total number of bot hits purged: 409836
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2016 -p
Purging 2348 hits from Link.?Check in statistics-2016
Purging 225664 hits from Http.?Client in statistics-2016
Purging 15 hits from HTTPie\/[0-9] in statistics-2016
Total number of bot hits purged: 228027
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2015 -p
Purging 3459 hits from Link.?Check in statistics-2015
Purging 263 hits from Http.?Client in statistics-2015
Purging 15 hits from HTTPie\/[0-9] in statistics-2015
Total number of bot hits purged: 3737
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2014 -p
Purging 5 hits from Link.?Check in statistics-2014
Purging 8 hits from Http.?Client in statistics-2014
Purging 4 hits from HTTPie\/[0-9] in statistics-2014
Total number of bot hits purged: 17
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2011 -p
Purging 159 hits from Http.?Client in statistics-2011
Total number of bot hits purged: 159
```
- Make pull requests for issues with user agents in the COUNTER-Robots repository:
- [Fix okhttp](https://github.com/atmire/COUNTER-Robots/pull/33)
- [Add new bots](https://github.com/atmire/COUNTER-Robots/pull/34)
- One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can't remember how big it was before that
- I wonder if the sharding process would work now...
<!-- vim: set sw=2 ts=2: -->

View File

@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
<meta property="article:modified_time" content="2020-02-24T19:15:05+02:00" />
<meta property="article:modified_time" content="2020-02-25T09:14:30+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2020"/>
@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install
"@type": "BlogPosting",
"headline": "February, 2020",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
"wordCount": "6024",
"wordCount": "6499",
"datePublished": "2020-02-02T11:56:30+02:00",
"dateModified": "2020-02-24T19:15:05+02:00",
"dateModified": "2020-02-25T09:14:30+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -1055,12 +1055,108 @@ Total number of bot hits purged: 14110
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
</code></pre><ul>
</code></pre><h2 id="2020-02-25">2020-02-25</h2>
<ul>
<li>And what&rsquo;s the 950,000 hits from Online.net IPs with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I&rsquo;m purging them all</li>
<li>I looked deeper in the Solr statistics and found a bunch more weird user agents:</li>
</ul>
<pre><code>LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
EventMachine HttpClient
ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
Biosphere EcoSearch http://search.ecointernet.org/
Typhoeus - https://github.com/typhoeus/typhoeus
Citoid (Wikimedia tool; learn more at https://www.mediawiki.org/wiki/Citoid)
node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
7Siters/1.08 (+https://7ooo.ru/siters/)
sqlmap/1.0-dev-nongit-20190527 (http://sqlmap.org)
sqlmap/1.3.4.14#dev (http://sqlmap.org)
lua-resty-http/0.10 (Lua) ngx_lua/10000
omgili/0.5 +http://omgili.com
IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)
Twurly v1.1 (https://twurly.org)
okhttp/3.11.0
okhttp/3.10.0
Pattern/2.6 +http://www.clips.ua.ac.be/pattern
Link Check; EPrints 3.3.x;
CyotekWebCopy/1.7 CyotekHTTP/2.0
Adestra Link Checker: http://www.adestra.co.uk
HTTPie/1.0.2
</code></pre><ul>
<li>I notice that some of these would be matched by the COUNTER-Robots list when DSpace uses it in Java because there we have more robust (and case-insensitive) matching
<ul>
<li>I created a temporary file of some of the patterns and converted them to use capitalization so I could run them through <code>check-spider-hits.sh</code></li>
</ul>
</li>
</ul>
<pre><code>Link.?Check
Http.?Client
ecointernet
</code></pre><ul>
<li>That removes another 500,000 or so:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
Purging 253 hits from Jersey\/[0-9] in statistics
Purging 7302 hits from Link.?Check in statistics
Purging 85574 hits from Http.?Client in statistics
Purging 495 hits from HTTPie\/[0-9] in statistics
Purging 56726 hits from ecointernet in statistics
Total number of bot hits purged: 150350
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2018 -p
Purging 3442 hits from Link.?Check in statistics-2018
Purging 21922 hits from Http.?Client in statistics-2018
Purging 2120 hits from HTTPie\/[0-9] in statistics-2018
Purging 10 hits from ecointernet in statistics-2018
Total number of bot hits purged: 27494
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2017 -p
Purging 6416 hits from Link.?Check in statistics-2017
Purging 403402 hits from Http.?Client in statistics-2017
Purging 12 hits from HTTPie\/[0-9] in statistics-2017
Purging 6 hits from ecointernet in statistics-2017
Total number of bot hits purged: 409836
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2016 -p
Purging 2348 hits from Link.?Check in statistics-2016
Purging 225664 hits from Http.?Client in statistics-2016
Purging 15 hits from HTTPie\/[0-9] in statistics-2016
Total number of bot hits purged: 228027
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2015 -p
Purging 3459 hits from Link.?Check in statistics-2015
Purging 263 hits from Http.?Client in statistics-2015
Purging 15 hits from HTTPie\/[0-9] in statistics-2015
Total number of bot hits purged: 3737
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2014 -p
Purging 5 hits from Link.?Check in statistics-2014
Purging 8 hits from Http.?Client in statistics-2014
Purging 4 hits from HTTPie\/[0-9] in statistics-2014
Total number of bot hits purged: 17
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2011 -p
Purging 159 hits from Http.?Client in statistics-2011
Total number of bot hits purged: 159
</code></pre><ul>
<li>Make pull requests for issues with user agents in the COUNTER-Robots repository:
<ul>
<li><a href="https://github.com/atmire/COUNTER-Robots/pull/33">Fix okhttp</a></li>
<li><a href="https://github.com/atmire/COUNTER-Robots/pull/34">Add new bots</a></li>
</ul>
</li>
<li>One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can&rsquo;t remember how big it was before that
<ul>
<li>I wonder if the sharding process would work now&hellip;</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-02/</loc>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
</url>
<url>