mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-17 20:27:05 +01:00
Add notes for 2017-10-30
This commit is contained in:
parent
4c0cb0b3d8
commit
5646221467
@ -224,3 +224,89 @@ http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subje
|
|||||||
- I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve
|
- I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve
|
||||||
- After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now
|
- After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now
|
||||||
- For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace
|
- For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace
|
||||||
|
|
||||||
|
## 2017-10-30
|
||||||
|
|
||||||
|
- Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)
|
||||||
|
- Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:
|
||||||
|
|
||||||
|
```
|
||||||
|
dspace=# SELECT * FROM pg_stat_activity;
|
||||||
|
...
|
||||||
|
(93 rows)
|
||||||
|
```
|
||||||
|
|
||||||
|
- Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:
|
||||||
|
|
||||||
|
```
|
||||||
|
# grep -c "CORE/0.6" /var/log/nginx/access.log
|
||||||
|
26475
|
||||||
|
# grep -c "CORE/0.6" /var/log/nginx/access.log.1
|
||||||
|
135083
|
||||||
|
```
|
||||||
|
|
||||||
|
- IP addresses for this bot currently seem to be:
|
||||||
|
|
||||||
|
```
|
||||||
|
# grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
|
||||||
|
137.108.70.6
|
||||||
|
137.108.70.7
|
||||||
|
```
|
||||||
|
|
||||||
|
- I will add their user agent to the Tomcat Session Crawler Valve but it won't help much because they are only using two sessions:
|
||||||
|
|
||||||
|
```
|
||||||
|
# grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
|
||||||
|
session_id=5771742CABA3D0780860B8DA81E0551B
|
||||||
|
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||||
|
```
|
||||||
|
|
||||||
|
- ... and most of their requests are for dynamic discover pages:
|
||||||
|
|
||||||
|
```
|
||||||
|
# grep -c 137.108.70 /var/log/nginx/access.log
|
||||||
|
26622
|
||||||
|
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
|
||||||
|
24055
|
||||||
|
```
|
||||||
|
|
||||||
|
- Just because I'm curious who the top IPs are:
|
||||||
|
|
||||||
|
```
|
||||||
|
# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
|
||||||
|
496 62.210.247.93
|
||||||
|
571 46.4.94.226
|
||||||
|
651 40.77.167.39
|
||||||
|
763 157.55.39.231
|
||||||
|
782 207.46.13.90
|
||||||
|
998 66.249.66.90
|
||||||
|
1948 104.196.152.243
|
||||||
|
4247 190.19.92.5
|
||||||
|
31602 137.108.70.6
|
||||||
|
31636 137.108.70.7
|
||||||
|
```
|
||||||
|
|
||||||
|
- At least we know the top two are CORE, but who are the others?
|
||||||
|
- 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
|
||||||
|
- Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don't reuse their session variable, creating thousands of new sessions!
|
||||||
|
|
||||||
|
```
|
||||||
|
# grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||||
|
1419
|
||||||
|
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||||
|
2811
|
||||||
|
```
|
||||||
|
|
||||||
|
- From looking at the requests, it appears these are from CIAT and CCAFS
|
||||||
|
- I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them
|
||||||
|
- Actually, according to the Tomcat docs, we could use an IP with `crawlerIps`: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve
|
||||||
|
- As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:
|
||||||
|
|
||||||
|
```
|
||||||
|
# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
|
||||||
|
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
|
||||||
|
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
|
||||||
|
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
|
||||||
|
```
|
||||||
|
|
||||||
|
- I will check again tomorrow
|
||||||
|
@ -66,7 +66,7 @@ Add Katherine Lutz to the groups for content sumission and edit steps of the CGI
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "October, 2017",
|
"headline": "October, 2017",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2017-10/",
|
"url": "https://alanorth.github.io/cgspace-notes/2017-10/",
|
||||||
"wordCount": "1851",
|
"wordCount": "2261",
|
||||||
"datePublished": "2017-10-01T08:07:54+03:00",
|
"datePublished": "2017-10-01T08:07:54+03:00",
|
||||||
"dateModified": "2017-10-29T10:02:34+02:00",
|
"dateModified": "2017-10-29T10:02:34+02:00",
|
||||||
"author": {
|
"author": {
|
||||||
@ -395,6 +395,102 @@ Add Katherine Lutz to the groups for content sumission and edit steps of the CGI
|
|||||||
<li>For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace</li>
|
<li>For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
|
<h2 id="2017-10-30">2017-10-30</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)</li>
|
||||||
|
<li>Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>dspace=# SELECT * FROM pg_stat_activity;
|
||||||
|
...
|
||||||
|
(93 rows)
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code># grep -c "CORE/0.6" /var/log/nginx/access.log
|
||||||
|
26475
|
||||||
|
# grep -c "CORE/0.6" /var/log/nginx/access.log.1
|
||||||
|
135083
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>IP addresses for this bot currently seem to be:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code># grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
|
||||||
|
137.108.70.6
|
||||||
|
137.108.70.7
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
|
||||||
|
session_id=5771742CABA3D0780860B8DA81E0551B
|
||||||
|
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>… and most of their requests are for dynamic discover pages:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code># grep -c 137.108.70 /var/log/nginx/access.log
|
||||||
|
26622
|
||||||
|
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
|
||||||
|
24055
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Just because I’m curious who the top IPs are:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
|
||||||
|
496 62.210.247.93
|
||||||
|
571 46.4.94.226
|
||||||
|
651 40.77.167.39
|
||||||
|
763 157.55.39.231
|
||||||
|
782 207.46.13.90
|
||||||
|
998 66.249.66.90
|
||||||
|
1948 104.196.152.243
|
||||||
|
4247 190.19.92.5
|
||||||
|
31602 137.108.70.6
|
||||||
|
31636 137.108.70.7
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>At least we know the top two are CORE, but who are the others?</li>
|
||||||
|
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
|
||||||
|
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||||
|
1419
|
||||||
|
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||||
|
2811
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
|
||||||
|
<li>I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them</li>
|
||||||
|
<li>Actually, according to the Tomcat docs, we could use an IP with <code>crawlerIps</code>: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve</a></li>
|
||||||
|
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
|
||||||
|
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
|
||||||
|
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
|
||||||
|
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I will check again tomorrow</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user