mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2020-02-24
This commit is contained in:
@ -717,6 +717,56 @@ Total number of bot hits purged: 5535399
|
||||
|
||||

|
||||
|
||||
- I'm a little suspicious of the 2012, 2013, and 2014 numbers though, I should facet those years by IP and see if there were any large ones
|
||||
- I'm a little suspicious of the 2012, 2013, and 2014 numbers, though
|
||||
- I should facet those years by IP and see if any stand out...
|
||||
- The next thing I need to do is figure out why the nginx IP to bot mapping isn't working...
|
||||
- Actually, and I've probably learned this before, but the bot mapping is working, but nginx only logs the real user agent (of course!), as I'm only using the mapped one in the proxy pass...
|
||||
- This trick for adding a header with the mapped "ua" variable is nice:
|
||||
|
||||
```
|
||||
add_header X-debug-message "ua is $ua" always;
|
||||
```
|
||||
|
||||
- Then in the HTTP response you see:
|
||||
|
||||
```
|
||||
X-debug-message: ua is bot
|
||||
```
|
||||
|
||||
- So the IP to bot mapping is working, phew.
|
||||
- More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
|
||||
- For example, ciat.cgiar.org was previously 104.196.152.243, but it is now 35.237.175.180, which I had noticed as a "mystery" client on Google Cloud in 2018-09
|
||||
- Others I should probably add to the nginx bot map list are:
|
||||
- wle.cgiar.org (70.32.90.172)
|
||||
- ccafs.cgiar.org (205.186.128.185)
|
||||
- another CIAT scraper using the PHP GuzzleHttp library (45.5.184.72)
|
||||
- macaronilab.com ([63.32.242.35](https://viewdns.info/reverseip/?host=63.32.242.35&t=1))
|
||||
- africa-rising.net ([162.243.171.159](https://viewdns.info/reverseip/?host=162.243.171.159&t=1)
|
||||
- These IPs are all active in the REST API logs over the last few months and they account for *thirty-four million* more hits in the statistics!
|
||||
- I purged them from CGSpace:
|
||||
|
||||
```
|
||||
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||
Purging 15 hits from 104.196.152.243 in statistics
|
||||
Purging 61064 hits from 35.237.175.180 in statistics
|
||||
Purging 1378 hits from 70.32.90.172 in statistics
|
||||
Purging 28880 hits from 205.186.128.185 in statistics
|
||||
Purging 464613 hits from 63.32.242.35 in statistics
|
||||
Purging 131 hits from 162.243.171.159 in statistics
|
||||
|
||||
Total number of bot hits purged: 556081
|
||||
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
|
||||
Purging 684888 hits from 104.196.152.243 in statistics-2018
|
||||
Purging 323737 hits from 35.227.26.162 in statistics-2018
|
||||
Purging 221091 hits from 35.237.175.180 in statistics-2018
|
||||
Purging 3834 hits from 205.186.128.185 in statistics-2018
|
||||
Purging 20337 hits from 63.32.242.35 in statistics-2018
|
||||
|
||||
Total number of bot hits purged: 1253887
|
||||
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
|
||||
Purging 1752548 hits from 104.196.152.243 in statistics-2017
|
||||
|
||||
Total number of bot hits purged: 1752548
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user