Update notes for 2017-11-08

This commit is contained in:
2017-11-08 14:17:04 +02:00
parent 4147b78b76
commit f775c480d8
3 changed files with 68 additions and 14 deletions

View File

@ -38,7 +38,7 @@ COPY 54701
<meta property="article:published_time" content="2017-11-02T09:37:54&#43;02:00"/>
<meta property="article:modified_time" content="2017-11-07T18:23:10&#43;02:00"/>
<meta property="article:modified_time" content="2017-11-08T09:08:32&#43;02:00"/>
@ -86,9 +86,9 @@ COPY 54701
"@type": "BlogPosting",
"headline": "November, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-11/",
"wordCount": "2241",
"wordCount": "2505",
"datePublished": "2017-11-02T09:37:54&#43;02:00",
"dateModified": "2017-11-07T18:23:10&#43;02:00",
"dateModified": "2017-11-08T09:08:32&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -601,11 +601,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<ul>
<li>Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM</li>
<li>Linode sent another alert about CPU usage in the morning at 6:12AM</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours:</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -c pdf
22510
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
24981
</code></pre>
<ul>
@ -620,6 +620,34 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>I&rsquo;m getting really sick of this</li>
<li>Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections</li>
<li>I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test</li>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (<a href="https://github.com/ilri/DSpace/pull/346">#346</a>)</li>
<li>I figured out a way to use nginx&rsquo;s map function to assign a &ldquo;bot&rdquo; user agent to misbehaving clients who don&rsquo;t define a user agent</li>
<li>Most bots are automatically lumped into one generic session by <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Tomcat&rsquo;s Crawler Session Manager Valve</a> but this only works if their user agent matches a pre-defined regular expression like <code>.*[bB]ot.*</code></li>
<li>Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process</li>
<li>Basically, we modify the nginx config to add a mapping with a modified user agent <code>$ua</code>:</li>
</ul>
<pre><code>map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 'ChineseBot';
default $http_user_agent;
}
</code></pre>
<ul>
<li>If the client&rsquo;s address matches then the user agent is set, otherwise the default <code>$http_user_agent</code> variable is used</li>
<li>Then, in the server&rsquo;s <code>/</code> block we pass this header to Tomcat:</li>
</ul>
<pre><code>proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
</code></pre>
<ul>
<li>Note to self: the <code>$ua</code> variable won&rsquo;t show up in nginx access logs because the default <code>combined</code> log format doesn&rsquo;t show it, so don&rsquo;t run around pulling your hair out wondering with the modified user agents aren&rsquo;t showing in the logs!</li>
<li>If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve</li>
<li>You can verify by cross referencing nginx&rsquo;s <code>access.log</code> and DSpace&rsquo;s <code>dspace.log.2017-11-08</code>, for example</li>
</ul>