diff --git a/content/post/2017-11.md b/content/post/2017-11.md index 9b8ebeae3..5d1032168 100644 --- a/content/post/2017-11.md +++ b/content/post/2017-11.md @@ -407,11 +407,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017- - Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM - Linode sent another alert about CPU usage in the morning at 6:12AM -- Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours: +- Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours: ``` -# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf -22510 +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf +24981 ``` - This is about 20,000 Tomcat sessions: @@ -424,3 +424,29 @@ $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z - I'm getting really sick of this - Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections - I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test +- Run system updates on DSpace Test and reboot the server +- Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them ([#346](https://github.com/ilri/DSpace/pull/346)) +- I figured out a way to use nginx's map function to assign a "bot" user agent to misbehaving clients who don't define a user agent +- Most bots are automatically lumped into one generic session by [Tomcat's Crawler Session Manager Valve](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve) but this only works if their user agent matches a pre-defined regular expression like `.*[bB]ot.*` +- Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process +- Basically, we modify the nginx config to add a mapping with a modified user agent `$ua`: + +``` +map $remote_addr $ua { + # 2017-11-08 Random Chinese host grabbing 20,000 PDFs + 124.17.34.59 'ChineseBot'; + default $http_user_agent; +} +``` + +- If the client's address matches then the user agent is set, otherwise the default `$http_user_agent` variable is used +- Then, in the server's `/` block we pass this header to Tomcat: + +``` +proxy_pass http://tomcat_http; +proxy_set_header User-Agent $ua; +``` + +- Note to self: the `$ua` variable won't show up in nginx access logs because the default `combined` log format doesn't show it, so don't run around pulling your hair out wondering with the modified user agents aren't showing in the logs! +- If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve +- You can verify by cross referencing nginx's `access.log` and DSpace's `dspace.log.2017-11-08`, for example diff --git a/public/2017-11/index.html b/public/2017-11/index.html index 6e4c59135..d22a08fa8 100644 --- a/public/2017-11/index.html +++ b/public/2017-11/index.html @@ -38,7 +38,7 @@ COPY 54701 - + @@ -86,9 +86,9 @@ COPY 54701 "@type": "BlogPosting", "headline": "November, 2017", "url": "https://alanorth.github.io/cgspace-notes/2017-11/", - "wordCount": "2241", + "wordCount": "2505", "datePublished": "2017-11-02T09:37:54+02:00", - "dateModified": "2017-11-07T18:23:10+02:00", + "dateModified": "2017-11-08T09:08:32+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -601,11 +601,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017- -
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf
-22510
+
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
+24981
 
+ +
map $remote_addr $ua {
+    # 2017-11-08 Random Chinese host grabbing 20,000 PDFs
+    124.17.34.59     'ChineseBot';
+    default          $http_user_agent;
+}
+
+ + + +
proxy_pass http://tomcat_http;
+proxy_set_header User-Agent $ua;
+
+ + diff --git a/public/sitemap.xml b/public/sitemap.xml index 4808ca492..e5caff6d2 100644 --- a/public/sitemap.xml +++ b/public/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2017-11/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 @@ -134,7 +134,7 @@ https://alanorth.github.io/cgspace-notes/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0 @@ -145,7 +145,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0 @@ -157,13 +157,13 @@ https://alanorth.github.io/cgspace-notes/post/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0