<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="February, 2019" /> <meta property="og:description" content="2019-02-01 Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again! The top IPs before, during, and after this latest alert tonight were: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 245 207.46.13.5 332 54.70.40.11 385 5.143.231.38 405 207.46.13.173 405 207.46.13.75 1117 66.249.66.219 1121 35.237.175.180 1546 5.9.6.51 2474 45.5.186.2 5490 85.25.237.71 85.25.237.71 is the “Linguee Bot” that I first saw last month The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase There were just over 3 million accesses in the nginx logs last month: # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019" 3018243 real 0m19.873s user 0m22.203s sys 0m1.979s " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-02/" /> <meta property="article:published_time" content="2019-02-01T21:37:30+02:00"/> <meta property="article:modified_time" content="2019-02-03T11:33:02+02:00"/> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="February, 2019"/> <meta name="twitter:description" content="2019-02-01 Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again! The top IPs before, during, and after this latest alert tonight were: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 245 207.46.13.5 332 54.70.40.11 385 5.143.231.38 405 207.46.13.173 405 207.46.13.75 1117 66.249.66.219 1121 35.237.175.180 1546 5.9.6.51 2474 45.5.186.2 5490 85.25.237.71 85.25.237.71 is the “Linguee Bot” that I first saw last month The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase There were just over 3 million accesses in the nginx logs last month: # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019" 3018243 real 0m19.873s user 0m22.203s sys 0m1.979s "/> <meta name="generator" content="Hugo 0.54.0" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "February, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-02/", "wordCount": "659", "datePublished": "2019-02-01T21:37:30+02:00", "dateModified": "2019-02-03T11:33:02+02:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-02/"> <title>February, 2019 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-6+EGfPoOzk/n2DVJSlglKT8TV1TgIMvVcKI73IZgBswLasPBn94KommV6ilJqCXE" crossorigin="anonymous"> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-02/">February, 2019</a></h2> <p class="blog-post-meta"><time datetime="2019-02-01T21:37:30+02:00">Fri Feb 01, 2019</time> by Alan Orth in <i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a> </p> </header> <h2 id="2019-02-01">2019-02-01</h2> <ul> <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li> <li>The top IPs before, during, and after this latest alert tonight were:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 245 207.46.13.5 332 54.70.40.11 385 5.143.231.38 405 207.46.13.173 405 207.46.13.75 1117 66.249.66.219 1121 35.237.175.180 1546 5.9.6.51 2474 45.5.186.2 5490 85.25.237.71 </code></pre> <ul> <li><code>85.25.237.71</code> is the “Linguee Bot” that I first saw last month</li> <li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li> <li>There were just over 3 million accesses in the nginx logs last month:</li> </ul> <pre><code># time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019" 3018243 real 0m19.873s user 0m22.203s sys 0m1.979s </code></pre> <ul> <li>Normally I’d say this was very high, but <a href="/cgspace-notes/2018-02/">about this time last year</a> I remember thinking the same thing when we had 3.1 million…</li> <li>I will have to keep an eye on this to see if there is some error in Solr…</li> <li>Atmire sent their <a href="https://github.com/ilri/DSpace/pull/407">pull request to re-enable the Metadata Quality Module (MQM) on our <code>5_x-dev</code> branch</a> today <ul> <li>I will test it next week and send them feedback</li> </ul></li> </ul> <h2 id="2019-02-02">2019-02-02</h2> <ul> <li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 284 18.195.78.144 329 207.46.13.32 417 35.237.175.180 448 34.218.226.147 694 2a01:4f8:13b:1296::2 718 2a01:4f8:140:3192::2 786 137.108.70.14 1002 5.9.6.51 6077 85.25.237.71 8726 45.5.184.2 </code></pre> <ul> <li><code>45.5.184.2</code> is CIAT and <code>85.25.237.71</code> is the new Linguee bot that I first noticed a few days ago</li> <li>I will increase the Linode alert threshold from 275 to 300% because this is becoming too much!</li> <li>I tested the Atmire Metadata Quality Module (MQM)’s duplicate checked on the some <a href="https://dspacetest.cgiar.org/handle/10568/81268">WLE items</a> that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates!</li> </ul> <h2 id="2019-02-03">2019-02-03</h2> <ul> <li>This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!</li> <li>Here are the top IPs before, during, and after that time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 325 85.25.237.71 340 45.5.184.72 431 5.143.231.8 756 5.9.6.51 1048 34.218.226.147 1203 66.249.66.219 1496 195.201.104.240 4658 205.186.128.185 4658 70.32.83.92 4852 45.5.184.2 </code></pre> <ul> <li><code>45.5.184.2</code> is CIAT, <code>70.32.83.92</code> and <code>205.186.128.185</code> are Macaroni Bros harvesters for CCAFS I think</li> <li><code>195.201.104.240</code> is a new IP address in Germany with the following user agent:</li> </ul> <pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0 </code></pre> <ul> <li>This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20 19 03/Feb/2019:07:42 20 03/Feb/2019:07:12 21 03/Feb/2019:07:27 21 03/Feb/2019:07:28 25 03/Feb/2019:07:23 25 03/Feb/2019:07:29 26 03/Feb/2019:07:33 28 03/Feb/2019:07:38 30 03/Feb/2019:07:31 33 03/Feb/2019:07:35 33 03/Feb/2019:07:37 38 03/Feb/2019:07:40 43 03/Feb/2019:07:24 43 03/Feb/2019:07:32 46 03/Feb/2019:07:36 47 03/Feb/2019:07:34 47 03/Feb/2019:07:39 47 03/Feb/2019:07:41 51 03/Feb/2019:07:26 59 03/Feb/2019:07:25 </code></pre> <ul> <li>At least they re-used their Tomcat session!</li> </ul> <pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l 1 </code></pre> <ul> <li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config <ul> <li>I <a href="https://github.com/ilri/rmg-ansible-public/commit/36dfb072d6724fb5cdc81ef79cab08ed9ce427ad">extended the existing <code>dynamicpages</code> (12/m) rate limit to <code>/browse</code> and <code>/discover</code></a> with an allowance for bursting of up to five requests for “real” users</li> </ul></li> <li>Run all system updates on linode20 and reboot it <ul> <li>This will be the new AReS repository explorer server soon</li> </ul></li> </ul> <!-- vim: set sw=2 ts=2: --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2019-02/">February, 2019</a></li> <li><a href="/cgspace-notes/2019-01/">January, 2019</a></li> <li><a href="/cgspace-notes/2018-12/">December, 2018</a></li> <li><a href="/cgspace-notes/2018-11/">November, 2018</a></li> <li><a href="/cgspace-notes/2018-10/">October, 2018</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>