diff --git a/content/posts/2019-02.md b/content/posts/2019-02.md index c059daf43..3a12be7ed 100644 --- a/content/posts/2019-02.md +++ b/content/posts/2019-02.md @@ -884,4 +884,53 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i - I merged the changes to the `5_x-prod` branch and they will go live the next time we re-deploy CGSpace ([#412](https://github.com/ilri/DSpace/pull/412)) +## 2019-02-19 + +- Linode sent another alert about CPU usage on CGSpace (linode18) averaging 417% this morning +- Unfortunately, I don't see any strange activity in the web server API or XMLUI logs at that time in particular +- So far today the top ten IPs in the XMLUI logs are: + +``` +# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 11541 18.212.208.240 + 11560 3.81.136.184 + 11562 3.88.237.84 + 11569 34.230.15.139 + 11572 3.80.128.247 + 11573 3.91.17.126 + 11586 54.82.89.217 + 11610 54.209.39.13 + 11657 54.175.90.13 + 14686 143.233.242.130 +``` + +- 143.233.242.130 is in Greece and using the user agent "Indy Library", like the top IP yesterday (94.71.244.172) +- That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don't know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this +- The user is requesting only things like `/handle/10568/56199?show=full` so it's nothing malicious, only annoying +- Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday's nginx rate limiting updates + - I should really try to script something around [ipapi.co](https://ipapi.co/api/) to get these quickly and easily +- The top requests in the API logs today are: + +``` +# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 42 66.249.66.221 + 44 156.156.81.215 + 55 3.85.54.129 + 76 66.249.66.219 + 87 34.209.213.122 + 1550 34.218.226.147 + 2127 50.116.102.77 + 4684 205.186.128.185 + 11429 45.5.186.2 + 12360 2a01:7e00::f03c:91ff:fe0a:d645 +``` + +- `2a01:7e00::f03c:91ff:fe0a:d645` is on Linode, and I can see from the XMLUI access logs that it is Drupal, so I assume it is part of the new ILRI website harvester... +- Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I'm so fucking sick of this +- Our usage stats have exploded the last few months: + +![Usage stats](/cgspace-notes/2019/02/usage-stats.png) + +- I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate + diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html index 2face16dd..75b32090c 100644 --- a/docs/2019-02/index.html +++ b/docs/2019-02/index.html @@ -42,7 +42,7 @@ sys 0m1.979s - + @@ -89,9 +89,9 @@ sys 0m1.979s "@type": "BlogPosting", "headline": "February, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-02/", - "wordCount": "4900", + "wordCount": "5236", "datePublished": "2019-02-01T21:37:30+02:00", - "dateModified": "2019-02-18T15:00:47-08:00", + "dateModified": "2019-02-18T16:30:34-08:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -1158,6 +1158,60 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
5_x-prod
branch and they will go live the next time we re-deploy CGSpace (#412)# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+ 11541 18.212.208.240
+ 11560 3.81.136.184
+ 11562 3.88.237.84
+ 11569 34.230.15.139
+ 11572 3.80.128.247
+ 11573 3.91.17.126
+ 11586 54.82.89.217
+ 11610 54.209.39.13
+ 11657 54.175.90.13
+ 14686 143.233.242.130
+
+
+/handle/10568/56199?show=full
so it’s nothing malicious, only annoying# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+ 42 66.249.66.221
+ 44 156.156.81.215
+ 55 3.85.54.129
+ 76 66.249.66.219
+ 87 34.209.213.122
+ 1550 34.218.226.147
+ 2127 50.116.102.77
+ 4684 205.186.128.185
+ 11429 45.5.186.2
+ 12360 2a01:7e00::f03c:91ff:fe0a:d645
+
+
+2a01:7e00::f03c:91ff:fe0a:d645
is on Linode, and I can see from the XMLUI access logs that it is Drupal, so I assume it is part of the new ILRI website harvester…