Update notes for 2017-11-08

This commit is contained in:
Alan Orth 2017-11-08 14:17:04 +02:00
parent 4147b78b76
commit f775c480d8
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
3 changed files with 68 additions and 14 deletions

View File

@ -407,11 +407,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
- Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM
- Linode sent another alert about CPU usage in the morning at 6:12AM
- Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours:
- Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf
22510
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
24981
```
- This is about 20,000 Tomcat sessions:
@ -424,3 +424,29 @@ $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z
- I'm getting really sick of this
- Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections
- I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test
- Run system updates on DSpace Test and reboot the server
- Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them ([#346](https://github.com/ilri/DSpace/pull/346))
- I figured out a way to use nginx's map function to assign a "bot" user agent to misbehaving clients who don't define a user agent
- Most bots are automatically lumped into one generic session by [Tomcat's Crawler Session Manager Valve](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve) but this only works if their user agent matches a pre-defined regular expression like `.*[bB]ot.*`
- Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
- Basically, we modify the nginx config to add a mapping with a modified user agent `$ua`:
```
map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 'ChineseBot';
default $http_user_agent;
}
```
- If the client's address matches then the user agent is set, otherwise the default `$http_user_agent` variable is used
- Then, in the server's `/` block we pass this header to Tomcat:
```
proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
```
- Note to self: the `$ua` variable won't show up in nginx access logs because the default `combined` log format doesn't show it, so don't run around pulling your hair out wondering with the modified user agents aren't showing in the logs!
- If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve
- You can verify by cross referencing nginx's `access.log` and DSpace's `dspace.log.2017-11-08`, for example

View File

@ -38,7 +38,7 @@ COPY 54701
<meta property="article:published_time" content="2017-11-02T09:37:54&#43;02:00"/>
<meta property="article:modified_time" content="2017-11-07T18:23:10&#43;02:00"/>
<meta property="article:modified_time" content="2017-11-08T09:08:32&#43;02:00"/>
@ -86,9 +86,9 @@ COPY 54701
"@type": "BlogPosting",
"headline": "November, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-11/",
"wordCount": "2241",
"wordCount": "2505",
"datePublished": "2017-11-02T09:37:54&#43;02:00",
"dateModified": "2017-11-07T18:23:10&#43;02:00",
"dateModified": "2017-11-08T09:08:32&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -601,11 +601,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<ul>
<li>Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM</li>
<li>Linode sent another alert about CPU usage in the morning at 6:12AM</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours:</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -c pdf
22510
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
24981
</code></pre>
<ul>
@ -620,6 +620,34 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>I&rsquo;m getting really sick of this</li>
<li>Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections</li>
<li>I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test</li>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (<a href="https://github.com/ilri/DSpace/pull/346">#346</a>)</li>
<li>I figured out a way to use nginx&rsquo;s map function to assign a &ldquo;bot&rdquo; user agent to misbehaving clients who don&rsquo;t define a user agent</li>
<li>Most bots are automatically lumped into one generic session by <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Tomcat&rsquo;s Crawler Session Manager Valve</a> but this only works if their user agent matches a pre-defined regular expression like <code>.*[bB]ot.*</code></li>
<li>Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process</li>
<li>Basically, we modify the nginx config to add a mapping with a modified user agent <code>$ua</code>:</li>
</ul>
<pre><code>map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 'ChineseBot';
default $http_user_agent;
}
</code></pre>
<ul>
<li>If the client&rsquo;s address matches then the user agent is set, otherwise the default <code>$http_user_agent</code> variable is used</li>
<li>Then, in the server&rsquo;s <code>/</code> block we pass this header to Tomcat:</li>
</ul>
<pre><code>proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
</code></pre>
<ul>
<li>Note to self: the <code>$ua</code> variable won&rsquo;t show up in nginx access logs because the default <code>combined</code> log format doesn&rsquo;t show it, so don&rsquo;t run around pulling your hair out wondering with the modified user agents aren&rsquo;t showing in the logs!</li>
<li>If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve</li>
<li>You can verify by cross referencing nginx&rsquo;s <code>access.log</code> and DSpace&rsquo;s <code>dspace.log.2017-11-08</code>, for example</li>
</ul>

View File

@ -4,7 +4,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-11/</loc>
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
</url>
<url>
@ -134,7 +134,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
<priority>0</priority>
</url>
@ -145,7 +145,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
<priority>0</priority>
</url>
@ -157,13 +157,13 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
<priority>0</priority>
</url>