mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 00:18:21 +01:00
Add notes for 2020-02-25
This commit is contained in:
parent
e0c862d5ea
commit
b990e4da33
@ -928,6 +928,8 @@ EcoInternet http://www.ecointernet.org/
|
||||
EcoInternet http://ecointernet.org/
|
||||
```
|
||||
|
||||
## 2020-02-25
|
||||
|
||||
- And what's the 950,000 hits from Online.net IPs with the following user agent:
|
||||
|
||||
```
|
||||
@ -935,5 +937,97 @@ Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3
|
||||
```
|
||||
|
||||
- Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I'm purging them all
|
||||
- I looked deeper in the Solr statistics and found a bunch more weird user agents:
|
||||
|
||||
```
|
||||
LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
|
||||
EventMachine HttpClient
|
||||
ecolink (+https://search.ecointernet.org/)
|
||||
ecoweb (+https://search.ecointernet.org/)
|
||||
EcoInternet http://www.ecointernet.org/
|
||||
EcoInternet http://ecointernet.org/
|
||||
Biosphere EcoSearch http://search.ecointernet.org/
|
||||
Typhoeus - https://github.com/typhoeus/typhoeus
|
||||
Citoid (Wikimedia tool; learn more at https://www.mediawiki.org/wiki/Citoid)
|
||||
node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
|
||||
7Siters/1.08 (+https://7ooo.ru/siters/)
|
||||
sqlmap/1.0-dev-nongit-20190527 (http://sqlmap.org)
|
||||
sqlmap/1.3.4.14#dev (http://sqlmap.org)
|
||||
lua-resty-http/0.10 (Lua) ngx_lua/10000
|
||||
omgili/0.5 +http://omgili.com
|
||||
IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)
|
||||
Twurly v1.1 (https://twurly.org)
|
||||
okhttp/3.11.0
|
||||
okhttp/3.10.0
|
||||
Pattern/2.6 +http://www.clips.ua.ac.be/pattern
|
||||
Link Check; EPrints 3.3.x;
|
||||
CyotekWebCopy/1.7 CyotekHTTP/2.0
|
||||
Adestra Link Checker: http://www.adestra.co.uk
|
||||
HTTPie/1.0.2
|
||||
```
|
||||
|
||||
- I notice that some of these would be matched by the COUNTER-Robots list when DSpace uses it in Java because there we have more robust (and case-insensitive) matching
|
||||
- I created a temporary file of some of the patterns and converted them to use capitalization so I could run them through `check-spider-hits.sh`
|
||||
|
||||
```
|
||||
Link.?Check
|
||||
Http.?Client
|
||||
ecointernet
|
||||
```
|
||||
|
||||
- That removes another 500,000 or so:
|
||||
|
||||
```
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
|
||||
Purging 253 hits from Jersey\/[0-9] in statistics
|
||||
Purging 7302 hits from Link.?Check in statistics
|
||||
Purging 85574 hits from Http.?Client in statistics
|
||||
Purging 495 hits from HTTPie\/[0-9] in statistics
|
||||
Purging 56726 hits from ecointernet in statistics
|
||||
|
||||
Total number of bot hits purged: 150350
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2018 -p
|
||||
Purging 3442 hits from Link.?Check in statistics-2018
|
||||
Purging 21922 hits from Http.?Client in statistics-2018
|
||||
Purging 2120 hits from HTTPie\/[0-9] in statistics-2018
|
||||
Purging 10 hits from ecointernet in statistics-2018
|
||||
|
||||
Total number of bot hits purged: 27494
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2017 -p
|
||||
Purging 6416 hits from Link.?Check in statistics-2017
|
||||
Purging 403402 hits from Http.?Client in statistics-2017
|
||||
Purging 12 hits from HTTPie\/[0-9] in statistics-2017
|
||||
Purging 6 hits from ecointernet in statistics-2017
|
||||
|
||||
Total number of bot hits purged: 409836
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2016 -p
|
||||
Purging 2348 hits from Link.?Check in statistics-2016
|
||||
Purging 225664 hits from Http.?Client in statistics-2016
|
||||
Purging 15 hits from HTTPie\/[0-9] in statistics-2016
|
||||
|
||||
Total number of bot hits purged: 228027
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2015 -p
|
||||
Purging 3459 hits from Link.?Check in statistics-2015
|
||||
Purging 263 hits from Http.?Client in statistics-2015
|
||||
Purging 15 hits from HTTPie\/[0-9] in statistics-2015
|
||||
|
||||
Total number of bot hits purged: 3737
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2014 -p
|
||||
Purging 5 hits from Link.?Check in statistics-2014
|
||||
Purging 8 hits from Http.?Client in statistics-2014
|
||||
Purging 4 hits from HTTPie\/[0-9] in statistics-2014
|
||||
|
||||
Total number of bot hits purged: 17
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2011 -p
|
||||
Purging 159 hits from Http.?Client in statistics-2011
|
||||
|
||||
Total number of bot hits purged: 159
|
||||
```
|
||||
|
||||
- Make pull requests for issues with user agents in the COUNTER-Robots repository:
|
||||
- [Fix okhttp](https://github.com/atmire/COUNTER-Robots/pull/33)
|
||||
- [Add new bots](https://github.com/atmire/COUNTER-Robots/pull/34)
|
||||
- One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can't remember how big it was before that
|
||||
- I wonder if the sharding process would work now...
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
|
||||
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
|
||||
<meta property="article:modified_time" content="2020-02-24T19:15:05+02:00" />
|
||||
<meta property="article:modified_time" content="2020-02-25T09:14:30+02:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="February, 2020"/>
|
||||
@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install
|
||||
"@type": "BlogPosting",
|
||||
"headline": "February, 2020",
|
||||
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
|
||||
"wordCount": "6024",
|
||||
"wordCount": "6499",
|
||||
"datePublished": "2020-02-02T11:56:30+02:00",
|
||||
"dateModified": "2020-02-24T19:15:05+02:00",
|
||||
"dateModified": "2020-02-25T09:14:30+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -1055,12 +1055,108 @@ Total number of bot hits purged: 14110
|
||||
ecoweb (+https://search.ecointernet.org/)
|
||||
EcoInternet http://www.ecointernet.org/
|
||||
EcoInternet http://ecointernet.org/
|
||||
</code></pre><ul>
|
||||
</code></pre><h2 id="2020-02-25">2020-02-25</h2>
|
||||
<ul>
|
||||
<li>And what’s the 950,000 hits from Online.net IPs with the following user agent:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
</code></pre><ul>
|
||||
<li>Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I’m purging them all</li>
|
||||
<li>I looked deeper in the Solr statistics and found a bunch more weird user agents:</li>
|
||||
</ul>
|
||||
<pre><code>LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
|
||||
EventMachine HttpClient
|
||||
ecolink (+https://search.ecointernet.org/)
|
||||
ecoweb (+https://search.ecointernet.org/)
|
||||
EcoInternet http://www.ecointernet.org/
|
||||
EcoInternet http://ecointernet.org/
|
||||
Biosphere EcoSearch http://search.ecointernet.org/
|
||||
Typhoeus - https://github.com/typhoeus/typhoeus
|
||||
Citoid (Wikimedia tool; learn more at https://www.mediawiki.org/wiki/Citoid)
|
||||
node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
|
||||
7Siters/1.08 (+https://7ooo.ru/siters/)
|
||||
sqlmap/1.0-dev-nongit-20190527 (http://sqlmap.org)
|
||||
sqlmap/1.3.4.14#dev (http://sqlmap.org)
|
||||
lua-resty-http/0.10 (Lua) ngx_lua/10000
|
||||
omgili/0.5 +http://omgili.com
|
||||
IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)
|
||||
Twurly v1.1 (https://twurly.org)
|
||||
okhttp/3.11.0
|
||||
okhttp/3.10.0
|
||||
Pattern/2.6 +http://www.clips.ua.ac.be/pattern
|
||||
Link Check; EPrints 3.3.x;
|
||||
CyotekWebCopy/1.7 CyotekHTTP/2.0
|
||||
Adestra Link Checker: http://www.adestra.co.uk
|
||||
HTTPie/1.0.2
|
||||
</code></pre><ul>
|
||||
<li>I notice that some of these would be matched by the COUNTER-Robots list when DSpace uses it in Java because there we have more robust (and case-insensitive) matching
|
||||
<ul>
|
||||
<li>I created a temporary file of some of the patterns and converted them to use capitalization so I could run them through <code>check-spider-hits.sh</code></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Link.?Check
|
||||
Http.?Client
|
||||
ecointernet
|
||||
</code></pre><ul>
|
||||
<li>That removes another 500,000 or so:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
|
||||
Purging 253 hits from Jersey\/[0-9] in statistics
|
||||
Purging 7302 hits from Link.?Check in statistics
|
||||
Purging 85574 hits from Http.?Client in statistics
|
||||
Purging 495 hits from HTTPie\/[0-9] in statistics
|
||||
Purging 56726 hits from ecointernet in statistics
|
||||
|
||||
Total number of bot hits purged: 150350
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2018 -p
|
||||
Purging 3442 hits from Link.?Check in statistics-2018
|
||||
Purging 21922 hits from Http.?Client in statistics-2018
|
||||
Purging 2120 hits from HTTPie\/[0-9] in statistics-2018
|
||||
Purging 10 hits from ecointernet in statistics-2018
|
||||
|
||||
Total number of bot hits purged: 27494
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2017 -p
|
||||
Purging 6416 hits from Link.?Check in statistics-2017
|
||||
Purging 403402 hits from Http.?Client in statistics-2017
|
||||
Purging 12 hits from HTTPie\/[0-9] in statistics-2017
|
||||
Purging 6 hits from ecointernet in statistics-2017
|
||||
|
||||
Total number of bot hits purged: 409836
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2016 -p
|
||||
Purging 2348 hits from Link.?Check in statistics-2016
|
||||
Purging 225664 hits from Http.?Client in statistics-2016
|
||||
Purging 15 hits from HTTPie\/[0-9] in statistics-2016
|
||||
|
||||
Total number of bot hits purged: 228027
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2015 -p
|
||||
Purging 3459 hits from Link.?Check in statistics-2015
|
||||
Purging 263 hits from Http.?Client in statistics-2015
|
||||
Purging 15 hits from HTTPie\/[0-9] in statistics-2015
|
||||
|
||||
Total number of bot hits purged: 3737
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2014 -p
|
||||
Purging 5 hits from Link.?Check in statistics-2014
|
||||
Purging 8 hits from Http.?Client in statistics-2014
|
||||
Purging 4 hits from HTTPie\/[0-9] in statistics-2014
|
||||
|
||||
Total number of bot hits purged: 17
|
||||
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2011 -p
|
||||
Purging 159 hits from Http.?Client in statistics-2011
|
||||
|
||||
Total number of bot hits purged: 159
|
||||
</code></pre><ul>
|
||||
<li>Make pull requests for issues with user agents in the COUNTER-Robots repository:
|
||||
<ul>
|
||||
<li><a href="https://github.com/atmire/COUNTER-Robots/pull/33">Fix okhttp</a></li>
|
||||
<li><a href="https://github.com/atmire/COUNTER-Robots/pull/34">Add new bots</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can’t remember how big it was before that
|
||||
<ul>
|
||||
<li>I wonder if the sharding process would work now…</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
@ -4,27 +4,27 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2020-02/</loc>
|
||||
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
|
Loading…
Reference in New Issue
Block a user