mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
Add notes for 2018-07-12
This commit is contained in:
parent
96d677e0cf
commit
323e968c2f
@ -257,4 +257,62 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-
|
||||
- Peter told Sisay to test this controlled vocabulary
|
||||
- Discuss meeting in Nairobi in October
|
||||
|
||||
## 2018-07-12
|
||||
|
||||
- Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM
|
||||
- Here are the top ten IPs from last night and this morning:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
48 66.249.64.91
|
||||
50 35.227.26.162
|
||||
57 157.55.39.234
|
||||
59 157.55.39.71
|
||||
62 147.99.27.190
|
||||
82 95.108.181.88
|
||||
92 40.77.167.90
|
||||
97 183.128.40.185
|
||||
97 240e:f0:44:fa53:745a:8afe:d221:1232
|
||||
3634 208.110.72.10
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
25 216.244.66.198
|
||||
38 40.77.167.185
|
||||
46 66.249.64.93
|
||||
56 157.55.39.71
|
||||
60 35.227.26.162
|
||||
65 157.55.39.234
|
||||
83 95.108.181.88
|
||||
87 66.249.64.91
|
||||
96 40.77.167.90
|
||||
7075 208.110.72.10
|
||||
```
|
||||
|
||||
- We have never seen `208.110.72.10` before... so that's interesting!
|
||||
- The user agent for these requests is: Pcore-HTTP/v0.44.0
|
||||
- A brief Google search doesn't turn up any information about what this bot is, but lots of users complaining about it
|
||||
- This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
17098 208.110.72.10
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
|
||||
1161
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
|
||||
1885
|
||||
```
|
||||
|
||||
- I think the problem is that, despite the bot requesting `robots.txt`, it almost exlusively requests dynamic pages from `/discover`:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
13364 GET /discover
|
||||
993 GET /search-filter
|
||||
804 GET /browse
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
|
||||
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0
|
||||
```
|
||||
|
||||
- So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting
|
||||
- I'll also add it to Tomcat's Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -30,7 +30,7 @@ There is insufficient memory for the Java Runtime Environment to continue.
|
||||
|
||||
<meta property="article:published_time" content="2018-07-01T12:56:54+03:00"/>
|
||||
|
||||
<meta property="article:modified_time" content="2018-07-10T17:34:30+03:00"/>
|
||||
<meta property="article:modified_time" content="2018-07-11T16:55:30+03:00"/>
|
||||
|
||||
|
||||
|
||||
@ -71,9 +71,9 @@ There is insufficient memory for the Java Runtime Environment to continue.
|
||||
"@type": "BlogPosting",
|
||||
"headline": "July, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-07/",
|
||||
"wordCount": "1740",
|
||||
"wordCount": "2079",
|
||||
"datePublished": "2018-07-01T12:56:54+03:00",
|
||||
"dateModified": "2018-07-10T17:34:30+03:00",
|
||||
"dateModified": "2018-07-11T16:55:30+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -432,6 +432,69 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-12">2018-07-12</h2>
|
||||
|
||||
<ul>
|
||||
<li>Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM</li>
|
||||
<li>Here are the top ten IPs from last night and this morning:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
48 66.249.64.91
|
||||
50 35.227.26.162
|
||||
57 157.55.39.234
|
||||
59 157.55.39.71
|
||||
62 147.99.27.190
|
||||
82 95.108.181.88
|
||||
92 40.77.167.90
|
||||
97 183.128.40.185
|
||||
97 240e:f0:44:fa53:745a:8afe:d221:1232
|
||||
3634 208.110.72.10
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
25 216.244.66.198
|
||||
38 40.77.167.185
|
||||
46 66.249.64.93
|
||||
56 157.55.39.71
|
||||
60 35.227.26.162
|
||||
65 157.55.39.234
|
||||
83 95.108.181.88
|
||||
87 66.249.64.91
|
||||
96 40.77.167.90
|
||||
7075 208.110.72.10
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>We have never seen <code>208.110.72.10</code> before… so that’s interesting!</li>
|
||||
<li>The user agent for these requests is: Pcore-HTTP/v0.44.0</li>
|
||||
<li>A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it</li>
|
||||
<li>This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
17098 208.110.72.10
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
|
||||
1161
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
|
||||
1885
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>I think the problem is that, despite the bot requesting <code>robots.txt</code>, it almost exlusively requests dynamic pages from <code>/discover</code>:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
13364 GET /discover
|
||||
993 GET /search-filter
|
||||
804 GET /browse
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
|
||||
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting</li>
|
||||
<li>I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case</li>
|
||||
</ul>
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
||||
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2018-07/</loc>
|
||||
<lastmod>2018-07-10T17:34:30+03:00</lastmod>
|
||||
<lastmod>2018-07-11T16:55:30+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -174,7 +174,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2018-07-10T17:34:30+03:00</lastmod>
|
||||
<lastmod>2018-07-11T16:55:30+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -185,7 +185,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2018-07-10T17:34:30+03:00</lastmod>
|
||||
<lastmod>2018-07-11T16:55:30+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -197,13 +197,13 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2018-07-10T17:34:30+03:00</lastmod>
|
||||
<lastmod>2018-07-11T16:55:30+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2018-07-10T17:34:30+03:00</lastmod>
|
||||
<lastmod>2018-07-11T16:55:30+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user