diff --git a/content/posts/2020-02.md b/content/posts/2020-02.md index 2b2162ad9..8b446eb43 100644 --- a/content/posts/2020-02.md +++ b/content/posts/2020-02.md @@ -717,6 +717,56 @@ Total number of bot hits purged: 5535399 ![CGSpace stats by year since 2011 after the purge](/cgspace-notes/2020/02/cgspace-stats-years.png) -- I'm a little suspicious of the 2012, 2013, and 2014 numbers though, I should facet those years by IP and see if there were any large ones +- I'm a little suspicious of the 2012, 2013, and 2014 numbers, though + - I should facet those years by IP and see if any stand out... +- The next thing I need to do is figure out why the nginx IP to bot mapping isn't working... + - Actually, and I've probably learned this before, but the bot mapping is working, but nginx only logs the real user agent (of course!), as I'm only using the mapped one in the proxy pass... + - This trick for adding a header with the mapped "ua" variable is nice: + +``` +add_header X-debug-message "ua is $ua" always; +``` + +- Then in the HTTP response you see: + +``` +X-debug-message: ua is bot +``` + +- So the IP to bot mapping is working, phew. +- More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them! + - For example, ciat.cgiar.org was previously 104.196.152.243, but it is now 35.237.175.180, which I had noticed as a "mystery" client on Google Cloud in 2018-09 + - Others I should probably add to the nginx bot map list are: + - wle.cgiar.org (70.32.90.172) + - ccafs.cgiar.org (205.186.128.185) + - another CIAT scraper using the PHP GuzzleHttp library (45.5.184.72) + - macaronilab.com ([63.32.242.35](https://viewdns.info/reverseip/?host=63.32.242.35&t=1)) + - africa-rising.net ([162.243.171.159](https://viewdns.info/reverseip/?host=162.243.171.159&t=1) +- These IPs are all active in the REST API logs over the last few months and they account for *thirty-four million* more hits in the statistics! +- I purged them from CGSpace: + +``` +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p +Purging 15 hits from 104.196.152.243 in statistics +Purging 61064 hits from 35.237.175.180 in statistics +Purging 1378 hits from 70.32.90.172 in statistics +Purging 28880 hits from 205.186.128.185 in statistics +Purging 464613 hits from 63.32.242.35 in statistics +Purging 131 hits from 162.243.171.159 in statistics + +Total number of bot hits purged: 556081 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p +Purging 684888 hits from 104.196.152.243 in statistics-2018 +Purging 323737 hits from 35.227.26.162 in statistics-2018 +Purging 221091 hits from 35.237.175.180 in statistics-2018 +Purging 3834 hits from 205.186.128.185 in statistics-2018 +Purging 20337 hits from 63.32.242.35 in statistics-2018 + +Total number of bot hits purged: 1253887 +$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p +Purging 1752548 hits from 104.196.152.243 in statistics-2017 + +Total number of bot hits purged: 1752548 +``` diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html index c8d9f2bef..a657c630d 100644 --- a/docs/2018-10/index.html +++ b/docs/2018-10/index.html @@ -14,7 +14,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai - + @@ -35,7 +35,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2018-10\/", "wordCount": "4518", "datePublished": "2018-10-01T22:31:54+03:00", - "dateModified": "2019-10-28T13:39:25+02:00", + "dateModified": "2020-02-23T20:10:47+02:00", "author": { "@type": "Person", "name": "Alan Orth" diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html index bd260ecb7..24613748c 100644 --- a/docs/2019-05/index.html +++ b/docs/2019-05/index.html @@ -25,7 +25,7 @@ But after this I tried to delete the item from the XMLUI and it is still present - + @@ -57,7 +57,7 @@ But after this I tried to delete the item from the XMLUI and it is still present "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-05\/", "wordCount": "3190", "datePublished": "2019-05-01T07:37:43+03:00", - "dateModified": "2019-10-28T13:39:25+02:00", + "dateModified": "2020-02-24T18:07:35+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -349,7 +349,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o - 3 2a01:7e00::f03c:91ff:fe0a:d645 113 63.32.242.35 - +
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
+  "response":{"numFound":12,"start":0,"docs":[
+$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
+$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
+$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
+  "response":{"numFound":62,"start":0,"docs":[
+
+

2020-02-24

+ +
$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
+  "response":{"numFound":42395486,"start":0,"docs":[]
+
+
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
+Purging 22809216 hits from 34.218.226.147 in statistics
+Purging 19586270 hits from 172.104.229.92 in statistics
+Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
+Purging 271668 hits from 2a01:7e00::f03c:91ff:fe18:7396 in statistics
+
+Total number of bot hits purged: 42778291
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics-2018 -p
+Purging 5535399 hits from 34.218.226.147 in statistics-2018
+
+Total number of bot hits purged: 5535399
+
+

CGSpace stats for 2019 and 2020 before the purge

+

CGSpace stats for 2019 and 2020 after the purge

+ +

CGSpace stats by year since 2011 after the purge

+ +
add_header X-debug-message "ua is $ua" always;
+
+
X-debug-message: ua is bot
+
+
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
+Purging 15 hits from 104.196.152.243 in statistics
+Purging 61064 hits from 35.237.175.180 in statistics
+Purging 1378 hits from 70.32.90.172 in statistics
+Purging 28880 hits from 205.186.128.185 in statistics
+Purging 464613 hits from 63.32.242.35 in statistics
+Purging 131 hits from 162.243.171.159 in statistics
+
+Total number of bot hits purged: 556081
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
+Purging 684888 hits from 104.196.152.243 in statistics-2018
+Purging 323737 hits from 35.227.26.162 in statistics-2018
+Purging 221091 hits from 35.237.175.180 in statistics-2018
+Purging 3834 hits from 205.186.128.185 in statistics-2018
+Purging 20337 hits from 63.32.242.35 in statistics-2018
+
+Total number of bot hits purged: 1253887
+$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
+Purging 1752548 hits from 104.196.152.243 in statistics-2017
+
+Total number of bot hits purged: 1752548
+
diff --git a/docs/2020/02/cgspace-stats-after.png b/docs/2020/02/cgspace-stats-after.png new file mode 100644 index 000000000..850527067 Binary files /dev/null and b/docs/2020/02/cgspace-stats-after.png differ diff --git a/docs/2020/02/cgspace-stats-before.png b/docs/2020/02/cgspace-stats-before.png new file mode 100644 index 000000000..dc0caac85 Binary files /dev/null and b/docs/2020/02/cgspace-stats-before.png differ diff --git a/docs/2020/02/cgspace-stats-years.png b/docs/2020/02/cgspace-stats-years.png new file mode 100644 index 000000000..3216d4a4e Binary files /dev/null and b/docs/2020/02/cgspace-stats-years.png differ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 773560b7f..a09ebc430 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2020-02-23T09:16:50+02:00 + 2020-02-24T18:07:35+02:00 https://alanorth.github.io/cgspace-notes/ - 2020-02-23T09:16:50+02:00 + 2020-02-24T18:07:35+02:00 https://alanorth.github.io/cgspace-notes/2020-02/ - 2020-02-23T09:16:50+02:00 + 2020-02-24T14:11:56+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-02-23T09:16:50+02:00 + 2020-02-24T18:07:35+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-02-23T09:16:50+02:00 + 2020-02-24T18:07:35+02:00 @@ -84,7 +84,7 @@ https://alanorth.github.io/cgspace-notes/2019-05/ - 2019-10-28T13:39:25+02:00 + 2020-02-24T18:07:35+02:00 @@ -119,7 +119,7 @@ https://alanorth.github.io/cgspace-notes/2018-10/ - 2019-10-28T13:39:25+02:00 + 2020-02-23T20:10:47+02:00