mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2018-07-12
This commit is contained in:
@ -257,4 +257,62 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-
|
||||
- Peter told Sisay to test this controlled vocabulary
|
||||
- Discuss meeting in Nairobi in October
|
||||
|
||||
## 2018-07-12
|
||||
|
||||
- Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM
|
||||
- Here are the top ten IPs from last night and this morning:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
48 66.249.64.91
|
||||
50 35.227.26.162
|
||||
57 157.55.39.234
|
||||
59 157.55.39.71
|
||||
62 147.99.27.190
|
||||
82 95.108.181.88
|
||||
92 40.77.167.90
|
||||
97 183.128.40.185
|
||||
97 240e:f0:44:fa53:745a:8afe:d221:1232
|
||||
3634 208.110.72.10
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
25 216.244.66.198
|
||||
38 40.77.167.185
|
||||
46 66.249.64.93
|
||||
56 157.55.39.71
|
||||
60 35.227.26.162
|
||||
65 157.55.39.234
|
||||
83 95.108.181.88
|
||||
87 66.249.64.91
|
||||
96 40.77.167.90
|
||||
7075 208.110.72.10
|
||||
```
|
||||
|
||||
- We have never seen `208.110.72.10` before... so that's interesting!
|
||||
- The user agent for these requests is: Pcore-HTTP/v0.44.0
|
||||
- A brief Google search doesn't turn up any information about what this bot is, but lots of users complaining about it
|
||||
- This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
17098 208.110.72.10
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
|
||||
1161
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
|
||||
1885
|
||||
```
|
||||
|
||||
- I think the problem is that, despite the bot requesting `robots.txt`, it almost exlusively requests dynamic pages from `/discover`:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
13364 GET /discover
|
||||
993 GET /search-filter
|
||||
804 GET /browse
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
|
||||
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0
|
||||
```
|
||||
|
||||
- So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting
|
||||
- I'll also add it to Tomcat's Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user