mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-23 13:34:32 +01:00
Update notes for 2017-11-08
This commit is contained in:
parent
4147b78b76
commit
f775c480d8
@ -407,11 +407,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
|
||||
|
||||
- Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM
|
||||
- Linode sent another alert about CPU usage in the morning at 6:12AM
|
||||
- Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours:
|
||||
- Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:
|
||||
|
||||
```
|
||||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf
|
||||
22510
|
||||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
|
||||
24981
|
||||
```
|
||||
|
||||
- This is about 20,000 Tomcat sessions:
|
||||
@ -424,3 +424,29 @@ $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z
|
||||
- I'm getting really sick of this
|
||||
- Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections
|
||||
- I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test
|
||||
- Run system updates on DSpace Test and reboot the server
|
||||
- Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them ([#346](https://github.com/ilri/DSpace/pull/346))
|
||||
- I figured out a way to use nginx's map function to assign a "bot" user agent to misbehaving clients who don't define a user agent
|
||||
- Most bots are automatically lumped into one generic session by [Tomcat's Crawler Session Manager Valve](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve) but this only works if their user agent matches a pre-defined regular expression like `.*[bB]ot.*`
|
||||
- Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
|
||||
- Basically, we modify the nginx config to add a mapping with a modified user agent `$ua`:
|
||||
|
||||
```
|
||||
map $remote_addr $ua {
|
||||
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
|
||||
124.17.34.59 'ChineseBot';
|
||||
default $http_user_agent;
|
||||
}
|
||||
```
|
||||
|
||||
- If the client's address matches then the user agent is set, otherwise the default `$http_user_agent` variable is used
|
||||
- Then, in the server's `/` block we pass this header to Tomcat:
|
||||
|
||||
```
|
||||
proxy_pass http://tomcat_http;
|
||||
proxy_set_header User-Agent $ua;
|
||||
```
|
||||
|
||||
- Note to self: the `$ua` variable won't show up in nginx access logs because the default `combined` log format doesn't show it, so don't run around pulling your hair out wondering with the modified user agents aren't showing in the logs!
|
||||
- If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve
|
||||
- You can verify by cross referencing nginx's `access.log` and DSpace's `dspace.log.2017-11-08`, for example
|
||||
|
@ -38,7 +38,7 @@ COPY 54701
|
||||
|
||||
<meta property="article:published_time" content="2017-11-02T09:37:54+02:00"/>
|
||||
|
||||
<meta property="article:modified_time" content="2017-11-07T18:23:10+02:00"/>
|
||||
<meta property="article:modified_time" content="2017-11-08T09:08:32+02:00"/>
|
||||
|
||||
|
||||
|
||||
@ -86,9 +86,9 @@ COPY 54701
|
||||
"@type": "BlogPosting",
|
||||
"headline": "November, 2017",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2017-11/",
|
||||
"wordCount": "2241",
|
||||
"wordCount": "2505",
|
||||
"datePublished": "2017-11-02T09:37:54+02:00",
|
||||
"dateModified": "2017-11-07T18:23:10+02:00",
|
||||
"dateModified": "2017-11-08T09:08:32+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -601,11 +601,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
|
||||
<ul>
|
||||
<li>Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM</li>
|
||||
<li>Linode sent another alert about CPU usage in the morning at 6:12AM</li>
|
||||
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours:</li>
|
||||
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf
|
||||
22510
|
||||
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
|
||||
24981
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
@ -620,6 +620,34 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
|
||||
<li>I’m getting really sick of this</li>
|
||||
<li>Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections</li>
|
||||
<li>I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test</li>
|
||||
<li>Run system updates on DSpace Test and reboot the server</li>
|
||||
<li>Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (<a href="https://github.com/ilri/DSpace/pull/346">#346</a>)</li>
|
||||
<li>I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent</li>
|
||||
<li>Most bots are automatically lumped into one generic session by <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Tomcat’s Crawler Session Manager Valve</a> but this only works if their user agent matches a pre-defined regular expression like <code>.*[bB]ot.*</code></li>
|
||||
<li>Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process</li>
|
||||
<li>Basically, we modify the nginx config to add a mapping with a modified user agent <code>$ua</code>:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>map $remote_addr $ua {
|
||||
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
|
||||
124.17.34.59 'ChineseBot';
|
||||
default $http_user_agent;
|
||||
}
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>If the client’s address matches then the user agent is set, otherwise the default <code>$http_user_agent</code> variable is used</li>
|
||||
<li>Then, in the server’s <code>/</code> block we pass this header to Tomcat:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>proxy_pass http://tomcat_http;
|
||||
proxy_set_header User-Agent $ua;
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Note to self: the <code>$ua</code> variable won’t show up in nginx access logs because the default <code>combined</code> log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!</li>
|
||||
<li>If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve</li>
|
||||
<li>You can verify by cross referencing nginx’s <code>access.log</code> and DSpace’s <code>dspace.log.2017-11-08</code>, for example</li>
|
||||
</ul>
|
||||
|
||||
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2017-11/</loc>
|
||||
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
|
||||
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -134,7 +134,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
|
||||
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -145,7 +145,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
|
||||
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -157,13 +157,13 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
||||
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
|
||||
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2017-11-07T18:23:10+02:00</lastmod>
|
||||
<lastmod>2017-11-08T09:08:32+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user