Update notes for 2020-02-24

This commit is contained in:
2020-02-24 19:15:05 +02:00
parent 8e066fa46c
commit 08adaae29b
8 changed files with 176 additions and 17 deletions

View File

@ -717,6 +717,56 @@ Total number of bot hits purged: 5535399
![CGSpace stats by year since 2011 after the purge](/cgspace-notes/2020/02/cgspace-stats-years.png)
- I'm a little suspicious of the 2012, 2013, and 2014 numbers though, I should facet those years by IP and see if there were any large ones
- I'm a little suspicious of the 2012, 2013, and 2014 numbers, though
- I should facet those years by IP and see if any stand out...
- The next thing I need to do is figure out why the nginx IP to bot mapping isn't working...
- Actually, and I've probably learned this before, but the bot mapping is working, but nginx only logs the real user agent (of course!), as I'm only using the mapped one in the proxy pass...
- This trick for adding a header with the mapped "ua" variable is nice:
```
add_header X-debug-message "ua is $ua" always;
```
- Then in the HTTP response you see:
```
X-debug-message: ua is bot
```
- So the IP to bot mapping is working, phew.
- More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
- For example, ciat.cgiar.org was previously 104.196.152.243, but it is now 35.237.175.180, which I had noticed as a "mystery" client on Google Cloud in 2018-09
- Others I should probably add to the nginx bot map list are:
- wle.cgiar.org (70.32.90.172)
- ccafs.cgiar.org (205.186.128.185)
- another CIAT scraper using the PHP GuzzleHttp library (45.5.184.72)
- macaronilab.com ([63.32.242.35](https://viewdns.info/reverseip/?host=63.32.242.35&t=1))
- africa-rising.net ([162.243.171.159](https://viewdns.info/reverseip/?host=162.243.171.159&t=1)
- These IPs are all active in the REST API logs over the last few months and they account for *thirty-four million* more hits in the statistics!
- I purged them from CGSpace:
```
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 15 hits from 104.196.152.243 in statistics
Purging 61064 hits from 35.237.175.180 in statistics
Purging 1378 hits from 70.32.90.172 in statistics
Purging 28880 hits from 205.186.128.185 in statistics
Purging 464613 hits from 63.32.242.35 in statistics
Purging 131 hits from 162.243.171.159 in statistics
Total number of bot hits purged: 556081
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 684888 hits from 104.196.152.243 in statistics-2018
Purging 323737 hits from 35.227.26.162 in statistics-2018
Purging 221091 hits from 35.237.175.180 in statistics-2018
Purging 3834 hits from 205.186.128.185 in statistics-2018
Purging 20337 hits from 63.32.242.35 in statistics-2018
Total number of bot hits purged: 1253887
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 1752548 hits from 104.196.152.243 in statistics-2017
Total number of bot hits purged: 1752548
```
<!-- vim: set sw=2 ts=2: -->