Add notes for 2017-10-30

2025-01-27 05:49:12 +01:00 · 2017-10-30 18:01:05 +02:00
parent 4c0cb0b3d8
commit 5646221467
2 changed files with 183 additions and 1 deletions
--- a/content/post/2017-10.md
+++ b/content/post/2017-10.md
@@ -224,3 +224,89 @@ http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subje
 - I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve
 - After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now
 - For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace
+
+## 2017-10-30
+
+- Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)
+- Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:
+
+```
+dspace=# SELECT * FROM pg_stat_activity;
+...
+(93 rows)
+```
+
+- Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:
+
+```
+# grep -c "CORE/0.6" /var/log/nginx/access.log 
+26475
+# grep -c "CORE/0.6" /var/log/nginx/access.log.1
+135083
+```
+
+- IP addresses for this bot currently seem to be:
+
+```
+# grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
+137.108.70.6
+137.108.70.7
+```
+
+- I will add their user agent to the Tomcat Session Crawler Valve but it won't help much because they are only using two sessions:
+
+```
+# grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
+session_id=5771742CABA3D0780860B8DA81E0551B
+session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
+```
+
+- ... and most of their requests are for dynamic discover pages:
+
+```
+# grep -c 137.108.70 /var/log/nginx/access.log
+26622
+# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
+24055
+```
+
+- Just because I'm curious who the top IPs are:
+
+```
+# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
+    496 62.210.247.93
+    571 46.4.94.226
+    651 40.77.167.39
+    763 157.55.39.231
+    782 207.46.13.90
+    998 66.249.66.90
+   1948 104.196.152.243
+   4247 190.19.92.5
+  31602 137.108.70.6
+  31636 137.108.70.7
+```
+
+- At least we know the top two are CORE, but who are the others?
+- 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
+- Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don't reuse their session variable, creating thousands of new sessions!
+
+```
+# grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
+1419
+# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
+2811
+```
+
+- From looking at the requests, it appears these are from CIAT and CCAFS
+- I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them
+- Actually, according to the Tomcat docs, we could use an IP with `crawlerIps`: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve
+- As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:
+
+```
+# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h 
+    410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
+    574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
+   1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
+```
+
+- I will check again tomorrow