From bf2221181b465104166f3440c47fda988cd29b91 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Thu, 27 Sep 2018 10:52:23 +0300 Subject: [PATCH] Update notes for 2018-09-27 --- content/posts/2018-09.md | 32 ++++++++++++++++++++++++++++++ docs/2018-09/index.html | 42 +++++++++++++++++++++++++++++++++++++--- docs/sitemap.xml | 10 +++++----- 3 files changed, 76 insertions(+), 8 deletions(-) diff --git a/content/posts/2018-09.md b/content/posts/2018-09.md index a67913c67..675012610 100644 --- a/content/posts/2018-09.md +++ b/content/posts/2018-09.md @@ -557,4 +557,36 @@ sys 2m18.485s - I updated the dspace-statistiscs-api to use psycopg2's `execute_values()` to insert batches of 100 values into PostgreSQL instead of doing every insert individually - On CGSpace this reduces the total run time of `indexer.py` from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though) +## 2018-09-27 + +- Linode emailed to say that CGSpace's (linode19) CPU load was high for a few hours last night +- Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things: + +``` +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 295 34.218.226.147 + 296 66.249.64.95 + 350 157.55.39.185 + 359 207.46.13.28 + 371 157.55.39.85 + 388 40.77.167.148 + 444 66.249.64.93 + 544 68.6.87.12 + 834 66.249.64.91 + 902 35.237.175.180 +``` + +- `35.237.175.180` is on Google Cloud +- `68.6.87.12` is on Cox Communications in the US (?) +- These hosts are not using proper user agents and are not re-using their Tomcat sessions: + +``` +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq +5423 +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq +758 +``` + +- I will add their IPs to the list of bad bots in nginx so we can add a "bot" user agent to them and let Tomcat's Crawler Session Manager Valve handle them + diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index 4c7f54b50..18bf2e519 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -18,7 +18,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I " /> - + On CGSpace this reduces the total run time of indexer.py from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though) +

2018-09-27

+ + + +
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+    295 34.218.226.147
+    296 66.249.64.95
+    350 157.55.39.185
+    359 207.46.13.28
+    371 157.55.39.85
+    388 40.77.167.148
+    444 66.249.64.93
+    544 68.6.87.12
+    834 66.249.64.91
+    902 35.237.175.180
+
+ + + +
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
+5423
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
+758
+
+ + + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index ce8aa3e5e..f85ae0f27 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-09/ - 2018-09-26T20:13:29+03:00 + 2018-09-27T09:30:28+03:00 @@ -184,7 +184,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-09-26T20:13:29+03:00 + 2018-09-27T09:30:28+03:00 0 @@ -195,7 +195,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-09-26T20:13:29+03:00 + 2018-09-27T09:30:28+03:00 0 @@ -207,13 +207,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2018-09-26T20:13:29+03:00 + 2018-09-27T09:30:28+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-09-26T20:13:29+03:00 + 2018-09-27T09:30:28+03:00 0