From e0c862d5ead52b820a0a7f7c80b6962abdc752f3 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Tue, 25 Feb 2020 09:14:30 +0200 Subject: [PATCH] Update notes for 2020-02-24 --- content/posts/2020-02.md | 167 +++++++++++++++++++++++++++++++++++++ docs/2020-02/index.html | 172 ++++++++++++++++++++++++++++++++++++++- docs/sitemap.xml | 10 +-- 3 files changed, 340 insertions(+), 9 deletions(-) diff --git a/content/posts/2020-02.md b/content/posts/2020-02.md index 8b446eb43..af46bd4a8 100644 --- a/content/posts/2020-02.md +++ b/content/posts/2020-02.md @@ -769,4 +769,171 @@ Purging 1752548 hits from 104.196.152.243 in statistics-2017 Total number of bot hits purged: 1752548 ``` +- I looked in the REST API logs for the past month and found a few more IPs: + - 95.110.154.135 (BioversityBot) + - 34.209.213.122 (IITA? bot) +- The client at 3.225.28.105 is using the following user agent: + +``` +Apache-HttpClient/4.3.4 (java 1.5) +``` + +- But I don't see any hits for it in the statistics core for some reason +- Looking more into the 2015 statistics I see some questionable IPs: + - 50.115.121.196 has a DNS of saltlakecity2tr.monitis.com + - 70.32.99.142 has userAgent Drupal + - 104.130.164.111 was some scraper on Rackspace.com that made ~30,000 requests per month + - 45.56.65.158 was some scraper on Linode that made ~30,000 requests per month + - 23.97.198.40 was some scraper with an IP owned by Microsoft that made ~4,000 requests per month and had no user agent + - 180.76.15.6 and *dozens* of other IPs with DNS like baiduspider-180-76-15-6.crawl.baidu.com. (and they were using a Mozilla/5.0 user agent!) +- For the IPs I purged them using `check-spider-ip-hits.sh`: + +``` +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p +Purging 11478 hits from 95.110.154.135 in statistics +Purging 1208 hits from 34.209.213.122 in statistics +Purging 10 hits from 54.184.39.242 in statistics + +Total number of bot hits purged: 12696 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p +Purging 12572 hits from 95.110.154.135 in statistics-2018 +Purging 233 hits from 34.209.213.122 in statistics-2018 + +Total number of bot hits purged: 12805 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p +Purging 37503 hits from 95.110.154.135 in statistics-2017 +Purging 25 hits from 34.209.213.122 in statistics-2017 +Purging 8621 hits from 23.97.198.40 in statistics-2017 + +Total number of bot hits purged: 46149 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p +Purging 1476 hits from 95.110.154.135 in statistics-2016 +Purging 10490 hits from 70.32.99.142 in statistics-2016 +Purging 29519 hits from 50.115.121.196 in statistics-2016 +Purging 175758 hits from 45.56.65.158 in statistics-2016 +Purging 26279 hits from 23.97.198.40 in statistics-2016 + +Total number of bot hits purged: 243522 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p +Purging 49351 hits from 70.32.99.142 in statistics-2015 +Purging 30278 hits from 50.115.121.196 in statistics-2015 +Purging 172292 hits from 104.130.164.111 in statistics-2015 +Purging 78571 hits from 45.56.65.158 in statistics-2015 +Purging 16069 hits from 23.97.198.40 in statistics-2015 + +Total number of bot hits purged: 346561 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2014 -p +Purging 462 hits from 70.32.99.142 in statistics-2014 +Purging 1766 hits from 50.115.121.196 in statistics-2014 + +Total number of bot hits purged: 2228 +``` + +- Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn't have a proper user agent and the only way to identify them was via DNS: + +``` +$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "dns:*crawl.baidu.com." +``` + +- Jesus, the more I keep looking, the more I see ridiculous stuff... +- In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network... + - 79.173.222.114 + - 149.200.141.57 + - 86.108.89.91 + - And others... +- Also I see a CIAT IP 45.5.186.2 that was making hundreds of thousands of requests (and 100/sec at one point in 2019) +- Also I see some IP on Hetzner making 10,000 requests per month: 2a01:4f8:210:51ef::2 +- Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130 +- I purged a bunch more from all cores: + +``` +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p +Purging 109965 hits from 45.5.186.2 in statistics +Purging 78648 hits from 79.173.222.114 in statistics +Purging 49032 hits from 149.200.141.57 in statistics +Purging 26897 hits from 86.108.89.91 in statistics +Purging 80898 hits from 2a01:4f8:210:51ef::2 in statistics +Purging 130831 hits from 143.233.242.130 in statistics +Purging 46489 hits from 83.103.94.48 in statistics + +Total number of bot hits purged: 522760 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p +Purging 41574 hits from 45.5.186.2 in statistics-2018 +Purging 39620 hits from 2a01:4f8:210:51ef::2 in statistics-2018 +Purging 19325 hits from 83.103.94.48 in statistics-2018 + +Total number of bot hits purged: 100519 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 +Found 296 hits from 45.5.186.2 in statistics-2017 +Found 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017 +Found 16086 hits from 83.103.94.48 in statistics-2017 + +Total number of hits from bots: 16772 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p +Purging 296 hits from 45.5.186.2 in statistics-2017 +Purging 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017 +Purging 16086 hits from 83.103.94.48 in statistics-2017 + +Total number of bot hits purged: 16772 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p +Purging 394 hits from 2a01:4f8:210:51ef::2 in statistics-2016 +Purging 26519 hits from 83.103.94.48 in statistics-2016 + +Total number of bot hits purged: 26913 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p +Purging 1 hits from 143.233.242.130 in statistics-2015 +Purging 14109 hits from 83.103.94.48 in statistics-2015 + +Total number of bot hits purged: 14110 +``` + +- Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like "Microsoft Office Word 2014" +- Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent: + +``` +# zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h + 1 Microsoft Office Word 2014 + 1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office) + 1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729) + 1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 + 2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 + 3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36 + 24 GuzzleHttp/6.3.3 curl/7.59.0 PHP/7.0.31 + 34 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 + 98 Apache-HttpClient/4.3.4 (java 1.5) + 54850 - +``` + +- I see lots of requests coming from the following user agents: + +``` +"Apache-HttpClient/4.5.7 (Java/11.0.3)" +"Apache-HttpClient/4.5.7 (Java/11.0.2)" +"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)" +"EventMachine HttpClient" +``` + +- I should definitely add HttpClient to the bot user agents... +- Also, while `bot`, `spider`, and `crawl` are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can't do case-insensitive matching in Solr with `check-spider-hits.sh` + - I need to add `Bot`, `Spider`, and `Crawl` to my local user agent file to purge them + - Also, I see lots of hits from "Indy Library", which we've been blocking for a long time, but somehow these got through (I think it's the Greek guys using Delphi) + - Somehow my regex conversion isn't working in check-spider-hits.sh, but "*Indy*" will work for now + - Purging just these case-sensitive patterns removed ~1 million more hits from 2011 to 2020 +- More weird user agents in 2019: + +``` +ecolink (+https://search.ecointernet.org/) +ecoweb (+https://search.ecointernet.org/) +EcoInternet http://www.ecointernet.org/ +EcoInternet http://ecointernet.org/ +``` + +- And what's the 950,000 hits from Online.net IPs with the following user agent: + +``` +Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6) +``` + +- Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I'm purging them all + diff --git a/docs/2020-02/index.html b/docs/2020-02/index.html index 9064a6f1c..063de7803 100644 --- a/docs/2020-02/index.html +++ b/docs/2020-02/index.html @@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install - + @@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install "@type": "BlogPosting", "headline": "February, 2020", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/", - "wordCount": "4888", + "wordCount": "6024", "datePublished": "2020-02-02T11:56:30+02:00", - "dateModified": "2020-02-24T14:11:56+02:00", + "dateModified": "2020-02-24T19:15:05+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -898,7 +898,171 @@ $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s sta Purging 1752548 hits from 104.196.152.243 in statistics-2017 Total number of bot hits purged: 1752548 - + +
Apache-HttpClient/4.3.4 (java 1.5)
+
+
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
+Purging 11478 hits from 95.110.154.135 in statistics
+Purging 1208 hits from 34.209.213.122 in statistics
+Purging 10 hits from 54.184.39.242 in statistics
+
+Total number of bot hits purged: 12696
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
+Purging 12572 hits from 95.110.154.135 in statistics-2018
+Purging 233 hits from 34.209.213.122 in statistics-2018
+
+Total number of bot hits purged: 12805
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
+Purging 37503 hits from 95.110.154.135 in statistics-2017
+Purging 25 hits from 34.209.213.122 in statistics-2017
+Purging 8621 hits from 23.97.198.40 in statistics-2017
+
+Total number of bot hits purged: 46149
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
+Purging 1476 hits from 95.110.154.135 in statistics-2016
+Purging 10490 hits from 70.32.99.142 in statistics-2016
+Purging 29519 hits from 50.115.121.196 in statistics-2016
+Purging 175758 hits from 45.56.65.158 in statistics-2016
+Purging 26279 hits from 23.97.198.40 in statistics-2016
+
+Total number of bot hits purged: 243522
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
+Purging 49351 hits from 70.32.99.142 in statistics-2015
+Purging 30278 hits from 50.115.121.196 in statistics-2015
+Purging 172292 hits from 104.130.164.111 in statistics-2015
+Purging 78571 hits from 45.56.65.158 in statistics-2015
+Purging 16069 hits from 23.97.198.40 in statistics-2015
+
+Total number of bot hits purged: 346561
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2014 -p 
+Purging 462 hits from 70.32.99.142 in statistics-2014
+Purging 1766 hits from 50.115.121.196 in statistics-2014
+
+Total number of bot hits purged: 2228
+
+
$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
+
+
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p     
+Purging 109965 hits from 45.5.186.2 in statistics
+Purging 78648 hits from 79.173.222.114 in statistics
+Purging 49032 hits from 149.200.141.57 in statistics
+Purging 26897 hits from 86.108.89.91 in statistics
+Purging 80898 hits from 2a01:4f8:210:51ef::2 in statistics
+Purging 130831 hits from 143.233.242.130 in statistics
+Purging 46489 hits from 83.103.94.48 in statistics
+
+Total number of bot hits purged: 522760
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
+Purging 41574 hits from 45.5.186.2 in statistics-2018
+Purging 39620 hits from 2a01:4f8:210:51ef::2 in statistics-2018
+Purging 19325 hits from 83.103.94.48 in statistics-2018
+
+Total number of bot hits purged: 100519
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017   
+Found 296 hits from 45.5.186.2 in statistics-2017
+Found 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
+Found 16086 hits from 83.103.94.48 in statistics-2017
+
+Total number of hits from bots: 16772
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
+Purging 296 hits from 45.5.186.2 in statistics-2017
+Purging 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
+Purging 16086 hits from 83.103.94.48 in statistics-2017
+
+Total number of bot hits purged: 16772
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
+Purging 394 hits from 2a01:4f8:210:51ef::2 in statistics-2016
+Purging 26519 hits from 83.103.94.48 in statistics-2016
+
+Total number of bot hits purged: 26913
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
+Purging 1 hits from 143.233.242.130 in statistics-2015
+Purging 14109 hits from 83.103.94.48 in statistics-2015
+
+Total number of bot hits purged: 14110
+
+
# zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
+      1 Microsoft Office Word 2014
+      1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
+      1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
+      1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
+      2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
+      3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
+     24 GuzzleHttp/6.3.3 curl/7.59.0 PHP/7.0.31
+     34 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
+     98 Apache-HttpClient/4.3.4 (java 1.5)
+  54850 -
+
+
"Apache-HttpClient/4.5.7 (Java/11.0.3)"
+"Apache-HttpClient/4.5.7 (Java/11.0.2)"
+"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
+"EventMachine HttpClient"
+
+
ecolink (+https://search.ecointernet.org/)
+ecoweb (+https://search.ecointernet.org/)
+EcoInternet http://www.ecointernet.org/
+EcoInternet http://ecointernet.org/
+
+
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
+
+ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index a09ebc430..5a2a1ec17 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2020-02-24T18:07:35+02:00 + 2020-02-24T19:15:05+02:00 https://alanorth.github.io/cgspace-notes/ - 2020-02-24T18:07:35+02:00 + 2020-02-24T19:15:05+02:00 https://alanorth.github.io/cgspace-notes/2020-02/ - 2020-02-24T14:11:56+02:00 + 2020-02-24T19:15:05+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-02-24T18:07:35+02:00 + 2020-02-24T19:15:05+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-02-24T18:07:35+02:00 + 2020-02-24T19:15:05+02:00