Update notes for 2020-02-24

This commit is contained in:
Alan Orth 2020-02-24 14:11:56 +02:00
parent c88af71838
commit fe7c11b7ce
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
4 changed files with 55 additions and 0 deletions

View File

@ -663,5 +663,60 @@ $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+i
- `/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450`
- `/rest/handle/10568/28702?expand=all`
- Those are the requests AReS and ILRI servers are making... nearly 150,000 per day!
- Well that settles it!
```
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
"response":{"numFound":12,"start":0,"docs":[
$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
"response":{"numFound":62,"start":0,"docs":[
```
- A REST request with `limit=50` will make exactly fifty `statistics_type=view` statistics in the Solr core... fuck.
- So not only do I need to purge all these millions of hits, we need to add these IPs to the list of spider IPs so they don't get recorded
## 2020-02-24
- I tried to add some IPs to the DSpace spider list so they would not get recorded in Solr statistics, but it doesn't support IPv6
- A better method is actually to just use the nginx mapping logic we already have to reset the user agent for these requests to "bot"
- That, or to really insist that users harvesting us specify some kind of user agent
- I tried to add the IPs to our nginx IP bot mapping but it doesn't seem to work... WTF, why is everything broken?!
- Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:
```
$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
"response":{"numFound":42395486,"start":0,"docs":[]
```
- I modified my `check-spider-hits.sh` script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:
```
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
Purging 22809216 hits from 34.218.226.147 in statistics
Purging 19586270 hits from 172.104.229.92 in statistics
Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
Purging 271668 hits from 2a01:7e00::f03c:91ff:fe18:7396 in statistics
Total number of bot hits purged: 42778291
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics-2018 -p
Purging 5535399 hits from 34.218.226.147 in statistics-2018
Total number of bot hits purged: 5535399
```
- (The `statistics` core holds 2019 and 2020 stats, because the yearly sharding process failed this year)
- Attached is a before and after of the period from 2019-01 to 2020-02:
![CGSpace stats for 2019 and 2020 before the purge](/cgspace-notes/2020/02/cgspace-stats-before.png)
![CGSpace stats for 2019 and 2020 after the purge](/cgspace-notes/2020/02/cgspace-stats-after.png)
- And here is a graph of the stats by year since 2011:
![CGSpace stats by year since 2011 after the purge](/cgspace-notes/2020/02/cgspace-stats-years.png)
- I'm a little suspicious of the 2012, 2013, and 2014 numbers though, I should facet those years by IP and see if there were any large ones
<!-- vim: set sw=2 ts=2: -->

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.0 KiB