Add notes for 2018-07-12

This commit is contained in:
2018-07-12 08:35:39 +03:00
parent 96d677e0cf
commit 323e968c2f
3 changed files with 129 additions and 8 deletions

View File

@ -257,4 +257,62 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-
- Peter told Sisay to test this controlled vocabulary
- Discuss meeting in Nairobi in October
## 2018-07-12
- Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM
- Here are the top ten IPs from last night and this morning:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
48 66.249.64.91
50 35.227.26.162
57 157.55.39.234
59 157.55.39.71
62 147.99.27.190
82 95.108.181.88
92 40.77.167.90
97 183.128.40.185
97 240e:f0:44:fa53:745a:8afe:d221:1232
3634 208.110.72.10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
25 216.244.66.198
38 40.77.167.185
46 66.249.64.93
56 157.55.39.71
60 35.227.26.162
65 157.55.39.234
83 95.108.181.88
87 66.249.64.91
96 40.77.167.90
7075 208.110.72.10
```
- We have never seen `208.110.72.10` before... so that's interesting!
- The user agent for these requests is: Pcore-HTTP/v0.44.0
- A brief Google search doesn't turn up any information about what this bot is, but lots of users complaining about it
- This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
17098 208.110.72.10
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
1161
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
1885
```
- I think the problem is that, despite the bot requesting `robots.txt`, it almost exlusively requests dynamic pages from `/discover`:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
13364 GET /discover
993 GET /search-filter
804 GET /browse
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0
```
- So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting
- I'll also add it to Tomcat's Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
<!-- vim: set sw=2 ts=2: -->