Update notes for 2020-02-24

This commit is contained in:
Alan Orth 2020-02-25 09:14:30 +02:00
parent 08adaae29b
commit e0c862d5ea
3 changed files with 340 additions and 9 deletions

View File

@ -769,4 +769,171 @@ Purging 1752548 hits from 104.196.152.243 in statistics-2017
Total number of bot hits purged: 1752548
```
- I looked in the REST API logs for the past month and found a few more IPs:
- 95.110.154.135 (BioversityBot)
- 34.209.213.122 (IITA? bot)
- The client at 3.225.28.105 is using the following user agent:
```
Apache-HttpClient/4.3.4 (java 1.5)
```
- But I don't see any hits for it in the statistics core for some reason
- Looking more into the 2015 statistics I see some questionable IPs:
- 50.115.121.196 has a DNS of saltlakecity2tr.monitis.com
- 70.32.99.142 has userAgent Drupal
- 104.130.164.111 was some scraper on Rackspace.com that made ~30,000 requests per month
- 45.56.65.158 was some scraper on Linode that made ~30,000 requests per month
- 23.97.198.40 was some scraper with an IP owned by Microsoft that made ~4,000 requests per month and had no user agent
- 180.76.15.6 and *dozens* of other IPs with DNS like baiduspider-180-76-15-6.crawl.baidu.com. (and they were using a Mozilla/5.0 user agent!)
- For the IPs I purged them using `check-spider-ip-hits.sh`:
```
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 11478 hits from 95.110.154.135 in statistics
Purging 1208 hits from 34.209.213.122 in statistics
Purging 10 hits from 54.184.39.242 in statistics
Total number of bot hits purged: 12696
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 12572 hits from 95.110.154.135 in statistics-2018
Purging 233 hits from 34.209.213.122 in statistics-2018
Total number of bot hits purged: 12805
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 37503 hits from 95.110.154.135 in statistics-2017
Purging 25 hits from 34.209.213.122 in statistics-2017
Purging 8621 hits from 23.97.198.40 in statistics-2017
Total number of bot hits purged: 46149
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
Purging 1476 hits from 95.110.154.135 in statistics-2016
Purging 10490 hits from 70.32.99.142 in statistics-2016
Purging 29519 hits from 50.115.121.196 in statistics-2016
Purging 175758 hits from 45.56.65.158 in statistics-2016
Purging 26279 hits from 23.97.198.40 in statistics-2016
Total number of bot hits purged: 243522
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
Purging 49351 hits from 70.32.99.142 in statistics-2015
Purging 30278 hits from 50.115.121.196 in statistics-2015
Purging 172292 hits from 104.130.164.111 in statistics-2015
Purging 78571 hits from 45.56.65.158 in statistics-2015
Purging 16069 hits from 23.97.198.40 in statistics-2015
Total number of bot hits purged: 346561
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2014 -p
Purging 462 hits from 70.32.99.142 in statistics-2014
Purging 1766 hits from 50.115.121.196 in statistics-2014
Total number of bot hits purged: 2228
```
- Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn't have a proper user agent and the only way to identify them was via DNS:
```
$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
```
- Jesus, the more I keep looking, the more I see ridiculous stuff...
- In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network...
- 79.173.222.114
- 149.200.141.57
- 86.108.89.91
- And others...
- Also I see a CIAT IP 45.5.186.2 that was making hundreds of thousands of requests (and 100/sec at one point in 2019)
- Also I see some IP on Hetzner making 10,000 requests per month: 2a01:4f8:210:51ef::2
- Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130
- I purged a bunch more from all cores:
```
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 109965 hits from 45.5.186.2 in statistics
Purging 78648 hits from 79.173.222.114 in statistics
Purging 49032 hits from 149.200.141.57 in statistics
Purging 26897 hits from 86.108.89.91 in statistics
Purging 80898 hits from 2a01:4f8:210:51ef::2 in statistics
Purging 130831 hits from 143.233.242.130 in statistics
Purging 46489 hits from 83.103.94.48 in statistics
Total number of bot hits purged: 522760
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 41574 hits from 45.5.186.2 in statistics-2018
Purging 39620 hits from 2a01:4f8:210:51ef::2 in statistics-2018
Purging 19325 hits from 83.103.94.48 in statistics-2018
Total number of bot hits purged: 100519
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017
Found 296 hits from 45.5.186.2 in statistics-2017
Found 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
Found 16086 hits from 83.103.94.48 in statistics-2017
Total number of hits from bots: 16772
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 296 hits from 45.5.186.2 in statistics-2017
Purging 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
Purging 16086 hits from 83.103.94.48 in statistics-2017
Total number of bot hits purged: 16772
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
Purging 394 hits from 2a01:4f8:210:51ef::2 in statistics-2016
Purging 26519 hits from 83.103.94.48 in statistics-2016
Total number of bot hits purged: 26913
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
Purging 1 hits from 143.233.242.130 in statistics-2015
Purging 14109 hits from 83.103.94.48 in statistics-2015
Total number of bot hits purged: 14110
```
- Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like "Microsoft Office Word 2014"
- Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:
```
# zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
1 Microsoft Office Word 2014
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
24 GuzzleHttp/6.3.3 curl/7.59.0 PHP/7.0.31
34 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
98 Apache-HttpClient/4.3.4 (java 1.5)
54850 -
```
- I see lots of requests coming from the following user agents:
```
"Apache-HttpClient/4.5.7 (Java/11.0.3)"
"Apache-HttpClient/4.5.7 (Java/11.0.2)"
"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
"EventMachine HttpClient"
```
- I should definitely add HttpClient to the bot user agents...
- Also, while `bot`, `spider`, and `crawl` are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can't do case-insensitive matching in Solr with `check-spider-hits.sh`
- I need to add `Bot`, `Spider`, and `Crawl` to my local user agent file to purge them
- Also, I see lots of hits from "Indy Library", which we've been blocking for a long time, but somehow these got through (I think it's the Greek guys using Delphi)
- Somehow my regex conversion isn't working in check-spider-hits.sh, but "*Indy*" will work for now
- Purging just these case-sensitive patterns removed ~1 million more hits from 2011 to 2020
- More weird user agents in 2019:
```
ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
```
- And what's the 950,000 hits from Online.net IPs with the following user agent:
```
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
```
- Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I'm purging them all
<!-- vim: set sw=2 ts=2: -->

View File

@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
<meta property="article:modified_time" content="2020-02-24T14:11:56+02:00" />
<meta property="article:modified_time" content="2020-02-24T19:15:05+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2020"/>
@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install
"@type": "BlogPosting",
"headline": "February, 2020",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
"wordCount": "4888",
"wordCount": "6024",
"datePublished": "2020-02-02T11:56:30+02:00",
"dateModified": "2020-02-24T14:11:56+02:00",
"dateModified": "2020-02-24T19:15:05+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -898,7 +898,171 @@ $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s sta
Purging 1752548 hits from 104.196.152.243 in statistics-2017
Total number of bot hits purged: 1752548
</code></pre><!-- raw HTML omitted -->
</code></pre><ul>
<li>I looked in the REST API logs for the past month and found a few more IPs:
<ul>
<li>95.110.154.135 (BioversityBot)</li>
<li>34.209.213.122 (IITA? bot)</li>
</ul>
</li>
<li>The client at 3.225.28.105 is using the following user agent:</li>
</ul>
<pre><code>Apache-HttpClient/4.3.4 (java 1.5)
</code></pre><ul>
<li>But I don&rsquo;t see any hits for it in the statistics core for some reason</li>
<li>Looking more into the 2015 statistics I see some questionable IPs:
<ul>
<li>50.115.121.196 has a DNS of saltlakecity2tr.monitis.com</li>
<li>70.32.99.142 has userAgent Drupal</li>
<li>104.130.164.111 was some scraper on Rackspace.com that made ~30,000 requests per month</li>
<li>45.56.65.158 was some scraper on Linode that made ~30,000 requests per month</li>
<li>23.97.198.40 was some scraper with an IP owned by Microsoft that made ~4,000 requests per month and had no user agent</li>
<li>180.76.15.6 and <em>dozens</em> of other IPs with DNS like baiduspider-180-76-15-6.crawl.baidu.com. (and they were using a Mozilla/5.0 user agent!)</li>
</ul>
</li>
<li>For the IPs I purged them using <code>check-spider-ip-hits.sh</code>:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 11478 hits from 95.110.154.135 in statistics
Purging 1208 hits from 34.209.213.122 in statistics
Purging 10 hits from 54.184.39.242 in statistics
Total number of bot hits purged: 12696
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 12572 hits from 95.110.154.135 in statistics-2018
Purging 233 hits from 34.209.213.122 in statistics-2018
Total number of bot hits purged: 12805
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 37503 hits from 95.110.154.135 in statistics-2017
Purging 25 hits from 34.209.213.122 in statistics-2017
Purging 8621 hits from 23.97.198.40 in statistics-2017
Total number of bot hits purged: 46149
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
Purging 1476 hits from 95.110.154.135 in statistics-2016
Purging 10490 hits from 70.32.99.142 in statistics-2016
Purging 29519 hits from 50.115.121.196 in statistics-2016
Purging 175758 hits from 45.56.65.158 in statistics-2016
Purging 26279 hits from 23.97.198.40 in statistics-2016
Total number of bot hits purged: 243522
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
Purging 49351 hits from 70.32.99.142 in statistics-2015
Purging 30278 hits from 50.115.121.196 in statistics-2015
Purging 172292 hits from 104.130.164.111 in statistics-2015
Purging 78571 hits from 45.56.65.158 in statistics-2015
Purging 16069 hits from 23.97.198.40 in statistics-2015
Total number of bot hits purged: 346561
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2014 -p
Purging 462 hits from 70.32.99.142 in statistics-2014
Purging 1766 hits from 50.115.121.196 in statistics-2014
Total number of bot hits purged: 2228
</code></pre><ul>
<li>Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn&rsquo;t have a proper user agent and the only way to identify them was via DNS:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:*crawl.baidu.com.&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Jesus, the more I keep looking, the more I see ridiculous stuff&hellip;</li>
<li>In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network&hellip;
<ul>
<li>79.173.222.114</li>
<li>149.200.141.57</li>
<li>86.108.89.91</li>
<li>And others&hellip;</li>
</ul>
</li>
<li>Also I see a CIAT IP 45.5.186.2 that was making hundreds of thousands of requests (and 100/sec at one point in 2019)</li>
<li>Also I see some IP on Hetzner making 10,000 requests per month: 2a01:4f8:210:51ef::2</li>
<li>Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130</li>
<li>I purged a bunch more from all cores:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 109965 hits from 45.5.186.2 in statistics
Purging 78648 hits from 79.173.222.114 in statistics
Purging 49032 hits from 149.200.141.57 in statistics
Purging 26897 hits from 86.108.89.91 in statistics
Purging 80898 hits from 2a01:4f8:210:51ef::2 in statistics
Purging 130831 hits from 143.233.242.130 in statistics
Purging 46489 hits from 83.103.94.48 in statistics
Total number of bot hits purged: 522760
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 41574 hits from 45.5.186.2 in statistics-2018
Purging 39620 hits from 2a01:4f8:210:51ef::2 in statistics-2018
Purging 19325 hits from 83.103.94.48 in statistics-2018
Total number of bot hits purged: 100519
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017
Found 296 hits from 45.5.186.2 in statistics-2017
Found 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
Found 16086 hits from 83.103.94.48 in statistics-2017
Total number of hits from bots: 16772
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 296 hits from 45.5.186.2 in statistics-2017
Purging 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
Purging 16086 hits from 83.103.94.48 in statistics-2017
Total number of bot hits purged: 16772
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
Purging 394 hits from 2a01:4f8:210:51ef::2 in statistics-2016
Purging 26519 hits from 83.103.94.48 in statistics-2016
Total number of bot hits purged: 26913
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
Purging 1 hits from 143.233.242.130 in statistics-2015
Purging 14109 hits from 83.103.94.48 in statistics-2015
Total number of bot hits purged: 14110
</code></pre><ul>
<li>Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like &ldquo;Microsoft Office Word 2014&rdquo;</li>
<li>Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:</li>
</ul>
<pre><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\&quot; '{print $6}' | sort | uniq -c | sort -h
1 Microsoft Office Word 2014
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
24 GuzzleHttp/6.3.3 curl/7.59.0 PHP/7.0.31
34 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
98 Apache-HttpClient/4.3.4 (java 1.5)
54850 -
</code></pre><ul>
<li>I see lots of requests coming from the following user agents:</li>
</ul>
<pre><code>&quot;Apache-HttpClient/4.5.7 (Java/11.0.3)&quot;
&quot;Apache-HttpClient/4.5.7 (Java/11.0.2)&quot;
&quot;LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)&quot;
&quot;EventMachine HttpClient&quot;
</code></pre><ul>
<li>I should definitely add HttpClient to the bot user agents&hellip;</li>
<li>Also, while <code>bot</code>, <code>spider</code>, and <code>crawl</code> are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can&rsquo;t do case-insensitive matching in Solr with <code>check-spider-hits.sh</code>
<ul>
<li>I need to add <code>Bot</code>, <code>Spider</code>, and <code>Crawl</code> to my local user agent file to purge them</li>
<li>Also, I see lots of hits from &ldquo;Indy Library&rdquo;, which we&rsquo;ve been blocking for a long time, but somehow these got through (I think it&rsquo;s the Greek guys using Delphi)</li>
<li>Somehow my regex conversion isn&rsquo;t working in check-spider-hits.sh, but &ldquo;<em>Indy</em>&rdquo; will work for now</li>
<li>Purging just these case-sensitive patterns removed ~1 million more hits from 2011 to 2020</li>
</ul>
</li>
<li>More weird user agents in 2019:</li>
</ul>
<pre><code>ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
</code></pre><ul>
<li>And what&rsquo;s the 950,000 hits from Online.net IPs with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I&rsquo;m purging them all</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-02/</loc>
<lastmod>2020-02-24T14:11:56+02:00</lastmod>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
</url>
<url>