mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-29 09:58:22 +01:00
Update notes for 2020-02-24
This commit is contained in:
parent
08adaae29b
commit
e0c862d5ea
@ -769,4 +769,171 @@ Purging 1752548 hits from 104.196.152.243 in statistics-2017
|
|||||||
Total number of bot hits purged: 1752548
|
Total number of bot hits purged: 1752548
|
||||||
```
|
```
|
||||||
|
|
||||||
|
- I looked in the REST API logs for the past month and found a few more IPs:
|
||||||
|
- 95.110.154.135 (BioversityBot)
|
||||||
|
- 34.209.213.122 (IITA? bot)
|
||||||
|
- The client at 3.225.28.105 is using the following user agent:
|
||||||
|
|
||||||
|
```
|
||||||
|
Apache-HttpClient/4.3.4 (java 1.5)
|
||||||
|
```
|
||||||
|
|
||||||
|
- But I don't see any hits for it in the statistics core for some reason
|
||||||
|
- Looking more into the 2015 statistics I see some questionable IPs:
|
||||||
|
- 50.115.121.196 has a DNS of saltlakecity2tr.monitis.com
|
||||||
|
- 70.32.99.142 has userAgent Drupal
|
||||||
|
- 104.130.164.111 was some scraper on Rackspace.com that made ~30,000 requests per month
|
||||||
|
- 45.56.65.158 was some scraper on Linode that made ~30,000 requests per month
|
||||||
|
- 23.97.198.40 was some scraper with an IP owned by Microsoft that made ~4,000 requests per month and had no user agent
|
||||||
|
- 180.76.15.6 and *dozens* of other IPs with DNS like baiduspider-180-76-15-6.crawl.baidu.com. (and they were using a Mozilla/5.0 user agent!)
|
||||||
|
- For the IPs I purged them using `check-spider-ip-hits.sh`:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||||
|
Purging 11478 hits from 95.110.154.135 in statistics
|
||||||
|
Purging 1208 hits from 34.209.213.122 in statistics
|
||||||
|
Purging 10 hits from 54.184.39.242 in statistics
|
||||||
|
|
||||||
|
Total number of bot hits purged: 12696
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
|
||||||
|
Purging 12572 hits from 95.110.154.135 in statistics-2018
|
||||||
|
Purging 233 hits from 34.209.213.122 in statistics-2018
|
||||||
|
|
||||||
|
Total number of bot hits purged: 12805
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
|
||||||
|
Purging 37503 hits from 95.110.154.135 in statistics-2017
|
||||||
|
Purging 25 hits from 34.209.213.122 in statistics-2017
|
||||||
|
Purging 8621 hits from 23.97.198.40 in statistics-2017
|
||||||
|
|
||||||
|
Total number of bot hits purged: 46149
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
|
||||||
|
Purging 1476 hits from 95.110.154.135 in statistics-2016
|
||||||
|
Purging 10490 hits from 70.32.99.142 in statistics-2016
|
||||||
|
Purging 29519 hits from 50.115.121.196 in statistics-2016
|
||||||
|
Purging 175758 hits from 45.56.65.158 in statistics-2016
|
||||||
|
Purging 26279 hits from 23.97.198.40 in statistics-2016
|
||||||
|
|
||||||
|
Total number of bot hits purged: 243522
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
|
||||||
|
Purging 49351 hits from 70.32.99.142 in statistics-2015
|
||||||
|
Purging 30278 hits from 50.115.121.196 in statistics-2015
|
||||||
|
Purging 172292 hits from 104.130.164.111 in statistics-2015
|
||||||
|
Purging 78571 hits from 45.56.65.158 in statistics-2015
|
||||||
|
Purging 16069 hits from 23.97.198.40 in statistics-2015
|
||||||
|
|
||||||
|
Total number of bot hits purged: 346561
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2014 -p
|
||||||
|
Purging 462 hits from 70.32.99.142 in statistics-2014
|
||||||
|
Purging 1766 hits from 50.115.121.196 in statistics-2014
|
||||||
|
|
||||||
|
Total number of bot hits purged: 2228
|
||||||
|
```
|
||||||
|
|
||||||
|
- Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn't have a proper user agent and the only way to identify them was via DNS:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
|
||||||
|
```
|
||||||
|
|
||||||
|
- Jesus, the more I keep looking, the more I see ridiculous stuff...
|
||||||
|
- In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network...
|
||||||
|
- 79.173.222.114
|
||||||
|
- 149.200.141.57
|
||||||
|
- 86.108.89.91
|
||||||
|
- And others...
|
||||||
|
- Also I see a CIAT IP 45.5.186.2 that was making hundreds of thousands of requests (and 100/sec at one point in 2019)
|
||||||
|
- Also I see some IP on Hetzner making 10,000 requests per month: 2a01:4f8:210:51ef::2
|
||||||
|
- Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130
|
||||||
|
- I purged a bunch more from all cores:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||||
|
Purging 109965 hits from 45.5.186.2 in statistics
|
||||||
|
Purging 78648 hits from 79.173.222.114 in statistics
|
||||||
|
Purging 49032 hits from 149.200.141.57 in statistics
|
||||||
|
Purging 26897 hits from 86.108.89.91 in statistics
|
||||||
|
Purging 80898 hits from 2a01:4f8:210:51ef::2 in statistics
|
||||||
|
Purging 130831 hits from 143.233.242.130 in statistics
|
||||||
|
Purging 46489 hits from 83.103.94.48 in statistics
|
||||||
|
|
||||||
|
Total number of bot hits purged: 522760
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
|
||||||
|
Purging 41574 hits from 45.5.186.2 in statistics-2018
|
||||||
|
Purging 39620 hits from 2a01:4f8:210:51ef::2 in statistics-2018
|
||||||
|
Purging 19325 hits from 83.103.94.48 in statistics-2018
|
||||||
|
|
||||||
|
Total number of bot hits purged: 100519
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017
|
||||||
|
Found 296 hits from 45.5.186.2 in statistics-2017
|
||||||
|
Found 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
|
||||||
|
Found 16086 hits from 83.103.94.48 in statistics-2017
|
||||||
|
|
||||||
|
Total number of hits from bots: 16772
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
|
||||||
|
Purging 296 hits from 45.5.186.2 in statistics-2017
|
||||||
|
Purging 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
|
||||||
|
Purging 16086 hits from 83.103.94.48 in statistics-2017
|
||||||
|
|
||||||
|
Total number of bot hits purged: 16772
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
|
||||||
|
Purging 394 hits from 2a01:4f8:210:51ef::2 in statistics-2016
|
||||||
|
Purging 26519 hits from 83.103.94.48 in statistics-2016
|
||||||
|
|
||||||
|
Total number of bot hits purged: 26913
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
|
||||||
|
Purging 1 hits from 143.233.242.130 in statistics-2015
|
||||||
|
Purging 14109 hits from 83.103.94.48 in statistics-2015
|
||||||
|
|
||||||
|
Total number of bot hits purged: 14110
|
||||||
|
```
|
||||||
|
|
||||||
|
- Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like "Microsoft Office Word 2014"
|
||||||
|
- Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:
|
||||||
|
|
||||||
|
```
|
||||||
|
# zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
|
||||||
|
1 Microsoft Office Word 2014
|
||||||
|
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
|
||||||
|
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
|
||||||
|
1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
|
||||||
|
2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
|
||||||
|
3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
|
||||||
|
24 GuzzleHttp/6.3.3 curl/7.59.0 PHP/7.0.31
|
||||||
|
34 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
|
||||||
|
98 Apache-HttpClient/4.3.4 (java 1.5)
|
||||||
|
54850 -
|
||||||
|
```
|
||||||
|
|
||||||
|
- I see lots of requests coming from the following user agents:
|
||||||
|
|
||||||
|
```
|
||||||
|
"Apache-HttpClient/4.5.7 (Java/11.0.3)"
|
||||||
|
"Apache-HttpClient/4.5.7 (Java/11.0.2)"
|
||||||
|
"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
|
||||||
|
"EventMachine HttpClient"
|
||||||
|
```
|
||||||
|
|
||||||
|
- I should definitely add HttpClient to the bot user agents...
|
||||||
|
- Also, while `bot`, `spider`, and `crawl` are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can't do case-insensitive matching in Solr with `check-spider-hits.sh`
|
||||||
|
- I need to add `Bot`, `Spider`, and `Crawl` to my local user agent file to purge them
|
||||||
|
- Also, I see lots of hits from "Indy Library", which we've been blocking for a long time, but somehow these got through (I think it's the Greek guys using Delphi)
|
||||||
|
- Somehow my regex conversion isn't working in check-spider-hits.sh, but "*Indy*" will work for now
|
||||||
|
- Purging just these case-sensitive patterns removed ~1 million more hits from 2011 to 2020
|
||||||
|
- More weird user agents in 2019:
|
||||||
|
|
||||||
|
```
|
||||||
|
ecolink (+https://search.ecointernet.org/)
|
||||||
|
ecoweb (+https://search.ecointernet.org/)
|
||||||
|
EcoInternet http://www.ecointernet.org/
|
||||||
|
EcoInternet http://ecointernet.org/
|
||||||
|
```
|
||||||
|
|
||||||
|
- And what's the 950,000 hits from Online.net IPs with the following user agent:
|
||||||
|
|
||||||
|
```
|
||||||
|
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||||
|
```
|
||||||
|
|
||||||
|
- Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I'm purging them all
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install
|
|||||||
<meta property="og:type" content="article" />
|
<meta property="og:type" content="article" />
|
||||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
|
||||||
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
|
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
|
||||||
<meta property="article:modified_time" content="2020-02-24T14:11:56+02:00" />
|
<meta property="article:modified_time" content="2020-02-24T19:15:05+02:00" />
|
||||||
|
|
||||||
<meta name="twitter:card" content="summary"/>
|
<meta name="twitter:card" content="summary"/>
|
||||||
<meta name="twitter:title" content="February, 2020"/>
|
<meta name="twitter:title" content="February, 2020"/>
|
||||||
@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "February, 2020",
|
"headline": "February, 2020",
|
||||||
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
|
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
|
||||||
"wordCount": "4888",
|
"wordCount": "6024",
|
||||||
"datePublished": "2020-02-02T11:56:30+02:00",
|
"datePublished": "2020-02-02T11:56:30+02:00",
|
||||||
"dateModified": "2020-02-24T14:11:56+02:00",
|
"dateModified": "2020-02-24T19:15:05+02:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -898,7 +898,171 @@ $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s sta
|
|||||||
Purging 1752548 hits from 104.196.152.243 in statistics-2017
|
Purging 1752548 hits from 104.196.152.243 in statistics-2017
|
||||||
|
|
||||||
Total number of bot hits purged: 1752548
|
Total number of bot hits purged: 1752548
|
||||||
</code></pre><!-- raw HTML omitted -->
|
</code></pre><ul>
|
||||||
|
<li>I looked in the REST API logs for the past month and found a few more IPs:
|
||||||
|
<ul>
|
||||||
|
<li>95.110.154.135 (BioversityBot)</li>
|
||||||
|
<li>34.209.213.122 (IITA? bot)</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>The client at 3.225.28.105 is using the following user agent:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code>Apache-HttpClient/4.3.4 (java 1.5)
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>But I don’t see any hits for it in the statistics core for some reason</li>
|
||||||
|
<li>Looking more into the 2015 statistics I see some questionable IPs:
|
||||||
|
<ul>
|
||||||
|
<li>50.115.121.196 has a DNS of saltlakecity2tr.monitis.com</li>
|
||||||
|
<li>70.32.99.142 has userAgent Drupal</li>
|
||||||
|
<li>104.130.164.111 was some scraper on Rackspace.com that made ~30,000 requests per month</li>
|
||||||
|
<li>45.56.65.158 was some scraper on Linode that made ~30,000 requests per month</li>
|
||||||
|
<li>23.97.198.40 was some scraper with an IP owned by Microsoft that made ~4,000 requests per month and had no user agent</li>
|
||||||
|
<li>180.76.15.6 and <em>dozens</em> of other IPs with DNS like baiduspider-180-76-15-6.crawl.baidu.com. (and they were using a Mozilla/5.0 user agent!)</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>For the IPs I purged them using <code>check-spider-ip-hits.sh</code>:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||||
|
Purging 11478 hits from 95.110.154.135 in statistics
|
||||||
|
Purging 1208 hits from 34.209.213.122 in statistics
|
||||||
|
Purging 10 hits from 54.184.39.242 in statistics
|
||||||
|
|
||||||
|
Total number of bot hits purged: 12696
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
|
||||||
|
Purging 12572 hits from 95.110.154.135 in statistics-2018
|
||||||
|
Purging 233 hits from 34.209.213.122 in statistics-2018
|
||||||
|
|
||||||
|
Total number of bot hits purged: 12805
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
|
||||||
|
Purging 37503 hits from 95.110.154.135 in statistics-2017
|
||||||
|
Purging 25 hits from 34.209.213.122 in statistics-2017
|
||||||
|
Purging 8621 hits from 23.97.198.40 in statistics-2017
|
||||||
|
|
||||||
|
Total number of bot hits purged: 46149
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
|
||||||
|
Purging 1476 hits from 95.110.154.135 in statistics-2016
|
||||||
|
Purging 10490 hits from 70.32.99.142 in statistics-2016
|
||||||
|
Purging 29519 hits from 50.115.121.196 in statistics-2016
|
||||||
|
Purging 175758 hits from 45.56.65.158 in statistics-2016
|
||||||
|
Purging 26279 hits from 23.97.198.40 in statistics-2016
|
||||||
|
|
||||||
|
Total number of bot hits purged: 243522
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
|
||||||
|
Purging 49351 hits from 70.32.99.142 in statistics-2015
|
||||||
|
Purging 30278 hits from 50.115.121.196 in statistics-2015
|
||||||
|
Purging 172292 hits from 104.130.164.111 in statistics-2015
|
||||||
|
Purging 78571 hits from 45.56.65.158 in statistics-2015
|
||||||
|
Purging 16069 hits from 23.97.198.40 in statistics-2015
|
||||||
|
|
||||||
|
Total number of bot hits purged: 346561
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2014 -p
|
||||||
|
Purging 462 hits from 70.32.99.142 in statistics-2014
|
||||||
|
Purging 1766 hits from 50.115.121.196 in statistics-2014
|
||||||
|
|
||||||
|
Total number of bot hits purged: 2228
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn’t have a proper user agent and the only way to identify them was via DNS:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code>$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>Jesus, the more I keep looking, the more I see ridiculous stuff…</li>
|
||||||
|
<li>In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network…
|
||||||
|
<ul>
|
||||||
|
<li>79.173.222.114</li>
|
||||||
|
<li>149.200.141.57</li>
|
||||||
|
<li>86.108.89.91</li>
|
||||||
|
<li>And others…</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>Also I see a CIAT IP 45.5.186.2 that was making hundreds of thousands of requests (and 100/sec at one point in 2019)</li>
|
||||||
|
<li>Also I see some IP on Hetzner making 10,000 requests per month: 2a01:4f8:210:51ef::2</li>
|
||||||
|
<li>Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130</li>
|
||||||
|
<li>I purged a bunch more from all cores:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||||
|
Purging 109965 hits from 45.5.186.2 in statistics
|
||||||
|
Purging 78648 hits from 79.173.222.114 in statistics
|
||||||
|
Purging 49032 hits from 149.200.141.57 in statistics
|
||||||
|
Purging 26897 hits from 86.108.89.91 in statistics
|
||||||
|
Purging 80898 hits from 2a01:4f8:210:51ef::2 in statistics
|
||||||
|
Purging 130831 hits from 143.233.242.130 in statistics
|
||||||
|
Purging 46489 hits from 83.103.94.48 in statistics
|
||||||
|
|
||||||
|
Total number of bot hits purged: 522760
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
|
||||||
|
Purging 41574 hits from 45.5.186.2 in statistics-2018
|
||||||
|
Purging 39620 hits from 2a01:4f8:210:51ef::2 in statistics-2018
|
||||||
|
Purging 19325 hits from 83.103.94.48 in statistics-2018
|
||||||
|
|
||||||
|
Total number of bot hits purged: 100519
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017
|
||||||
|
Found 296 hits from 45.5.186.2 in statistics-2017
|
||||||
|
Found 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
|
||||||
|
Found 16086 hits from 83.103.94.48 in statistics-2017
|
||||||
|
|
||||||
|
Total number of hits from bots: 16772
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
|
||||||
|
Purging 296 hits from 45.5.186.2 in statistics-2017
|
||||||
|
Purging 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
|
||||||
|
Purging 16086 hits from 83.103.94.48 in statistics-2017
|
||||||
|
|
||||||
|
Total number of bot hits purged: 16772
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
|
||||||
|
Purging 394 hits from 2a01:4f8:210:51ef::2 in statistics-2016
|
||||||
|
Purging 26519 hits from 83.103.94.48 in statistics-2016
|
||||||
|
|
||||||
|
Total number of bot hits purged: 26913
|
||||||
|
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
|
||||||
|
Purging 1 hits from 143.233.242.130 in statistics-2015
|
||||||
|
Purging 14109 hits from 83.103.94.48 in statistics-2015
|
||||||
|
|
||||||
|
Total number of bot hits purged: 14110
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like “Microsoft Office Word 2014”</li>
|
||||||
|
<li>Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
|
||||||
|
1 Microsoft Office Word 2014
|
||||||
|
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
|
||||||
|
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
|
||||||
|
1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
|
||||||
|
2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
|
||||||
|
3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
|
||||||
|
24 GuzzleHttp/6.3.3 curl/7.59.0 PHP/7.0.31
|
||||||
|
34 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
|
||||||
|
98 Apache-HttpClient/4.3.4 (java 1.5)
|
||||||
|
54850 -
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>I see lots of requests coming from the following user agents:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code>"Apache-HttpClient/4.5.7 (Java/11.0.3)"
|
||||||
|
"Apache-HttpClient/4.5.7 (Java/11.0.2)"
|
||||||
|
"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
|
||||||
|
"EventMachine HttpClient"
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>I should definitely add HttpClient to the bot user agents…</li>
|
||||||
|
<li>Also, while <code>bot</code>, <code>spider</code>, and <code>crawl</code> are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can’t do case-insensitive matching in Solr with <code>check-spider-hits.sh</code>
|
||||||
|
<ul>
|
||||||
|
<li>I need to add <code>Bot</code>, <code>Spider</code>, and <code>Crawl</code> to my local user agent file to purge them</li>
|
||||||
|
<li>Also, I see lots of hits from “Indy Library”, which we’ve been blocking for a long time, but somehow these got through (I think it’s the Greek guys using Delphi)</li>
|
||||||
|
<li>Somehow my regex conversion isn’t working in check-spider-hits.sh, but “<em>Indy</em>” will work for now</li>
|
||||||
|
<li>Purging just these case-sensitive patterns removed ~1 million more hits from 2011 to 2020</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>More weird user agents in 2019:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code>ecolink (+https://search.ecointernet.org/)
|
||||||
|
ecoweb (+https://search.ecointernet.org/)
|
||||||
|
EcoInternet http://www.ecointernet.org/
|
||||||
|
EcoInternet http://ecointernet.org/
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>And what’s the 950,000 hits from Online.net IPs with the following user agent:</li>
|
||||||
|
</ul>
|
||||||
|
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||||
|
</code></pre><ul>
|
||||||
|
<li>Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I’m purging them all</li>
|
||||||
|
</ul>
|
||||||
|
<!-- raw HTML omitted -->
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -4,27 +4,27 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||||
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
|
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
|
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2020-02/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2020-02/</loc>
|
||||||
<lastmod>2020-02-24T14:11:56+02:00</lastmod>
|
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||||
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
|
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||||
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
|
<lastmod>2020-02-24T19:15:05+02:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
|
Loading…
Reference in New Issue
Block a user