From f775c480d86909ebd0d509013f66bc9751ee93d7 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 8 Nov 2017 14:17:04 +0200 Subject: [PATCH] Update notes for 2017-11-08 --- content/post/2017-11.md | 32 ++++++++++++++++++++++++++++--- public/2017-11/index.html | 40 +++++++++++++++++++++++++++++++++------ public/sitemap.xml | 10 +++++----- 3 files changed, 68 insertions(+), 14 deletions(-) diff --git a/content/post/2017-11.md b/content/post/2017-11.md index 9b8ebeae3..5d1032168 100644 --- a/content/post/2017-11.md +++ b/content/post/2017-11.md @@ -407,11 +407,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017- - Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM - Linode sent another alert about CPU usage in the morning at 6:12AM -- Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours: +- Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours: ``` -# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf -22510 +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf +24981 ``` - This is about 20,000 Tomcat sessions: @@ -424,3 +424,29 @@ $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z - I'm getting really sick of this - Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections - I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test +- Run system updates on DSpace Test and reboot the server +- Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them ([#346](https://github.com/ilri/DSpace/pull/346)) +- I figured out a way to use nginx's map function to assign a "bot" user agent to misbehaving clients who don't define a user agent +- Most bots are automatically lumped into one generic session by [Tomcat's Crawler Session Manager Valve](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve) but this only works if their user agent matches a pre-defined regular expression like `.*[bB]ot.*` +- Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process +- Basically, we modify the nginx config to add a mapping with a modified user agent `$ua`: + +``` +map $remote_addr $ua { + # 2017-11-08 Random Chinese host grabbing 20,000 PDFs + 124.17.34.59 'ChineseBot'; + default $http_user_agent; +} +``` + +- If the client's address matches then the user agent is set, otherwise the default `$http_user_agent` variable is used +- Then, in the server's `/` block we pass this header to Tomcat: + +``` +proxy_pass http://tomcat_http; +proxy_set_header User-Agent $ua; +``` + +- Note to self: the `$ua` variable won't show up in nginx access logs because the default `combined` log format doesn't show it, so don't run around pulling your hair out wondering with the modified user agents aren't showing in the logs! +- If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve +- You can verify by cross referencing nginx's `access.log` and DSpace's `dspace.log.2017-11-08`, for example diff --git a/public/2017-11/index.html b/public/2017-11/index.html index 6e4c59135..d22a08fa8 100644 --- a/public/2017-11/index.html +++ b/public/2017-11/index.html @@ -38,7 +38,7 @@ COPY 54701 - + @@ -86,9 +86,9 @@ COPY 54701 "@type": "BlogPosting", "headline": "November, 2017", "url": "https://alanorth.github.io/cgspace-notes/2017-11/", - "wordCount": "2241", + "wordCount": "2505", "datePublished": "2017-11-02T09:37:54+02:00", - "dateModified": "2017-11-07T18:23:10+02:00", + "dateModified": "2017-11-08T09:08:32+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -601,11 +601,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017- -
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf
-22510
+
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
+24981
 
    @@ -620,6 +620,34 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
  • I’m getting really sick of this
  • Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections
  • I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test
  • +
  • Run system updates on DSpace Test and reboot the server
  • +
  • Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (#346)
  • +
  • I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent
  • +
  • Most bots are automatically lumped into one generic session by Tomcat’s Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like .*[bB]ot.*
  • +
  • Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
  • +
  • Basically, we modify the nginx config to add a mapping with a modified user agent $ua:
  • +
+ +
map $remote_addr $ua {
+    # 2017-11-08 Random Chinese host grabbing 20,000 PDFs
+    124.17.34.59     'ChineseBot';
+    default          $http_user_agent;
+}
+
+ +
    +
  • If the client’s address matches then the user agent is set, otherwise the default $http_user_agent variable is used
  • +
  • Then, in the server’s / block we pass this header to Tomcat:
  • +
+ +
proxy_pass http://tomcat_http;
+proxy_set_header User-Agent $ua;
+
+ +
    +
  • Note to self: the $ua variable won’t show up in nginx access logs because the default combined log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!
  • +
  • If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve
  • +
  • You can verify by cross referencing nginx’s access.log and DSpace’s dspace.log.2017-11-08, for example
diff --git a/public/sitemap.xml b/public/sitemap.xml index 4808ca492..e5caff6d2 100644 --- a/public/sitemap.xml +++ b/public/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2017-11/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 @@ -134,7 +134,7 @@ https://alanorth.github.io/cgspace-notes/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0 @@ -145,7 +145,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0 @@ -157,13 +157,13 @@ https://alanorth.github.io/cgspace-notes/post/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2017-11-07T18:23:10+02:00 + 2017-11-08T09:08:32+02:00 0