Add notes for 2018-12-04 and regenerate

This commit is contained in:
2018-12-04 10:02:51 +02:00
parent 5f051ca9ee
commit 0d1f50665a
6 changed files with 195 additions and 30 deletions

View File

@ -21,7 +21,7 @@ Today these are the top 10 IPs:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-11/" /><meta property="article:published_time" content="2018-11-01T16:41:30&#43;02:00"/>
<meta property="article:modified_time" content="2018-11-28T09:32:04&#43;02:00"/>
<meta property="article:modified_time" content="2018-12-04T09:50:36&#43;02:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2018"/>
@ -48,9 +48,9 @@ Today these are the top 10 IPs:
"@type": "BlogPosting",
"headline": "November, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-11/",
"wordCount": "2698",
"wordCount": "2823",
"datePublished": "2018-11-01T16:41:30&#43;02:00",
"dateModified": "2018-11-28T09:32:04&#43;02:00",
"dateModified": "2018-12-04T09:50:36&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -156,7 +156,7 @@ Today these are the top 10 IPs:
<li>They at least seem to be re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03 | sort | uniq
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
342
</code></pre>
@ -173,7 +173,7 @@ Today these are the top 10 IPs:
<li>And it doesn&rsquo;t seem they are re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03 | sort | uniq
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
1243
</code></pre>
@ -205,14 +205,17 @@ Today these are the top 10 IPs:
</code></pre>
<ul>
<li>It&rsquo;s making lots of requests and using quite a number of Tomcat sessions:</li>
<li>It&rsquo;s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' /home/cgspace.cgiar.org/log/dspace.log.2018-11-03 | sort | uniq
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
8449
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
1
</code></pre>
<ul>
<li><em>Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions</em></li>
<li>I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing</li>
<li>Perhaps I should think about adding rate limits to dynamic pages like <code>/discover</code> and <code>/browse</code></li>
<li>I think it&rsquo;s reasonable for a human to click one of those links five or ten times a minute&hellip;</li>
@ -270,14 +273,17 @@ Today these are the top 10 IPs:
</code></pre>
<ul>
<li><code>78.46.89.18</code> is back&hellip; and still making tons of Tomcat sessions:</li>
<li><code>78.46.89.18</code> is back&hellip; and it is still actually re-using its Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
8765
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
1
</code></pre>
<ul>
<li><em>Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly</em></li>
<li>Also, now we have a ton of Facebook crawlers:</li>
</ul>
@ -302,14 +308,15 @@ Today these are the top 10 IPs:
</code></pre>
<ul>
<li>They are really making shit tons of Tomcat sessions:</li>
<li>They are really making shit tons of requests:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
14368
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
37721
</code></pre>
<ul>
<li><em>Updated on 2018-12-04 to correct the grep command to accurately show the number of requests</em></li>
<li>Their user agent is:</li>
</ul>
@ -342,16 +349,17 @@ Today these are the top 10 IPs:
</code></pre>
<ul>
<li>And still making shit tons of Tomcat sessions:</li>
<li>Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
28470
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
37721
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
15206
</code></pre>
<ul>
<li>And that&rsquo;s even using the Tomcat Crawler Session Manager valve!</li>
<li>Maybe we need to limit more dynamic pages, like the &ldquo;most popular&rdquo; country, item, and author pages</li>
<li>I think we still need to limit more of the dynamic pages, like the &ldquo;most popular&rdquo; country, item, and author pages</li>
<li>It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!</li>
</ul>
@ -377,19 +385,22 @@ Today these are the top 10 IPs:
<li>The file <code>marlo.csv</code> was cleaned up and formatted in Open Refine</li>
<li>165 of the items in their 2017 data are from CGSpace!</li>
<li>I will add the data to CGSpace this week (done!)</li>
<li>Jesus, is Facebook <em>trying</em> to be annoying?</li>
<li>Jesus, is Facebook <em>trying</em> to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep -c &quot;2a03:2880:11ff:&quot;
29889
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq
29156
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
29763
# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
1057
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | grep -c -E &quot;(handle|bitstream)&quot;
29896
</code></pre>
<ul>
<li>29,000 requests from Facebook, 29,000 Tomcat sessions, and none of the requests are to the dynamic pages I rate limited yesterday!</li>
<li>29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!</li>
<li>At least the Tomcat Crawler Session Manager Valve is working now&hellip;</li>
</ul>
<h2 id="2018-11-06">2018-11-06</h2>