mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 06:35:03 +01:00
Update notes for 2020-02-24
This commit is contained in:
parent
c88af71838
commit
fe7c11b7ce
@ -663,5 +663,60 @@ $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+i
|
||||
- `/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450`
|
||||
- `/rest/handle/10568/28702?expand=all`
|
||||
- Those are the requests AReS and ILRI servers are making... nearly 150,000 per day!
|
||||
- Well that settles it!
|
||||
|
||||
```
|
||||
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
|
||||
"response":{"numFound":12,"start":0,"docs":[
|
||||
$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
|
||||
$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
|
||||
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
|
||||
"response":{"numFound":62,"start":0,"docs":[
|
||||
```
|
||||
|
||||
- A REST request with `limit=50` will make exactly fifty `statistics_type=view` statistics in the Solr core... fuck.
|
||||
- So not only do I need to purge all these millions of hits, we need to add these IPs to the list of spider IPs so they don't get recorded
|
||||
|
||||
## 2020-02-24
|
||||
|
||||
- I tried to add some IPs to the DSpace spider list so they would not get recorded in Solr statistics, but it doesn't support IPv6
|
||||
- A better method is actually to just use the nginx mapping logic we already have to reset the user agent for these requests to "bot"
|
||||
- That, or to really insist that users harvesting us specify some kind of user agent
|
||||
- I tried to add the IPs to our nginx IP bot mapping but it doesn't seem to work... WTF, why is everything broken?!
|
||||
- Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:
|
||||
|
||||
```
|
||||
$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
|
||||
"response":{"numFound":42395486,"start":0,"docs":[]
|
||||
```
|
||||
|
||||
- I modified my `check-spider-hits.sh` script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:
|
||||
|
||||
```
|
||||
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
|
||||
Purging 22809216 hits from 34.218.226.147 in statistics
|
||||
Purging 19586270 hits from 172.104.229.92 in statistics
|
||||
Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
|
||||
Purging 271668 hits from 2a01:7e00::f03c:91ff:fe18:7396 in statistics
|
||||
|
||||
Total number of bot hits purged: 42778291
|
||||
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics-2018 -p
|
||||
Purging 5535399 hits from 34.218.226.147 in statistics-2018
|
||||
|
||||
Total number of bot hits purged: 5535399
|
||||
```
|
||||
|
||||
- (The `statistics` core holds 2019 and 2020 stats, because the yearly sharding process failed this year)
|
||||
- Attached is a before and after of the period from 2019-01 to 2020-02:
|
||||
|
||||
![CGSpace stats for 2019 and 2020 before the purge](/cgspace-notes/2020/02/cgspace-stats-before.png)
|
||||
|
||||
![CGSpace stats for 2019 and 2020 after the purge](/cgspace-notes/2020/02/cgspace-stats-after.png)
|
||||
|
||||
- And here is a graph of the stats by year since 2011:
|
||||
|
||||
![CGSpace stats by year since 2011 after the purge](/cgspace-notes/2020/02/cgspace-stats-years.png)
|
||||
|
||||
- I'm a little suspicious of the 2012, 2013, and 2014 numbers though, I should facet those years by IP and see if there were any large ones
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
BIN
static/2020/02/cgspace-stats-after.png
Normal file
BIN
static/2020/02/cgspace-stats-after.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 3.8 KiB |
BIN
static/2020/02/cgspace-stats-before.png
Normal file
BIN
static/2020/02/cgspace-stats-before.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 3.5 KiB |
BIN
static/2020/02/cgspace-stats-years.png
Normal file
BIN
static/2020/02/cgspace-stats-years.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 4.0 KiB |
Loading…
Reference in New Issue
Block a user