Update notes for 2020-02-24

This commit is contained in:
Alan Orth 2020-02-24 19:15:05 +02:00
parent 8e066fa46c
commit 08adaae29b
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
8 changed files with 176 additions and 17 deletions

View File

@ -717,6 +717,56 @@ Total number of bot hits purged: 5535399
![CGSpace stats by year since 2011 after the purge](/cgspace-notes/2020/02/cgspace-stats-years.png)
- I'm a little suspicious of the 2012, 2013, and 2014 numbers though, I should facet those years by IP and see if there were any large ones
- I'm a little suspicious of the 2012, 2013, and 2014 numbers, though
- I should facet those years by IP and see if any stand out...
- The next thing I need to do is figure out why the nginx IP to bot mapping isn't working...
- Actually, and I've probably learned this before, but the bot mapping is working, but nginx only logs the real user agent (of course!), as I'm only using the mapped one in the proxy pass...
- This trick for adding a header with the mapped "ua" variable is nice:
```
add_header X-debug-message "ua is $ua" always;
```
- Then in the HTTP response you see:
```
X-debug-message: ua is bot
```
- So the IP to bot mapping is working, phew.
- More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
- For example, ciat.cgiar.org was previously 104.196.152.243, but it is now 35.237.175.180, which I had noticed as a "mystery" client on Google Cloud in 2018-09
- Others I should probably add to the nginx bot map list are:
- wle.cgiar.org (70.32.90.172)
- ccafs.cgiar.org (205.186.128.185)
- another CIAT scraper using the PHP GuzzleHttp library (45.5.184.72)
- macaronilab.com ([63.32.242.35](https://viewdns.info/reverseip/?host=63.32.242.35&t=1))
- africa-rising.net ([162.243.171.159](https://viewdns.info/reverseip/?host=162.243.171.159&t=1)
- These IPs are all active in the REST API logs over the last few months and they account for *thirty-four million* more hits in the statistics!
- I purged them from CGSpace:
```
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 15 hits from 104.196.152.243 in statistics
Purging 61064 hits from 35.237.175.180 in statistics
Purging 1378 hits from 70.32.90.172 in statistics
Purging 28880 hits from 205.186.128.185 in statistics
Purging 464613 hits from 63.32.242.35 in statistics
Purging 131 hits from 162.243.171.159 in statistics
Total number of bot hits purged: 556081
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 684888 hits from 104.196.152.243 in statistics-2018
Purging 323737 hits from 35.227.26.162 in statistics-2018
Purging 221091 hits from 35.237.175.180 in statistics-2018
Purging 3834 hits from 205.186.128.185 in statistics-2018
Purging 20337 hits from 63.32.242.35 in statistics-2018
Total number of bot hits purged: 1253887
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 1752548 hits from 104.196.152.243 in statistics-2017
Total number of bot hits purged: 1752548
```
<!-- vim: set sw=2 ts=2: -->

View File

@ -14,7 +14,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-10/" />
<meta property="article:published_time" content="2018-10-01T22:31:54+03:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
<meta property="article:modified_time" content="2020-02-23T20:10:47+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2018"/>
@ -35,7 +35,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2018-10\/",
"wordCount": "4518",
"datePublished": "2018-10-01T22:31:54+03:00",
"dateModified": "2019-10-28T13:39:25+02:00",
"dateModified": "2020-02-23T20:10:47+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -25,7 +25,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-05/" />
<meta property="article:published_time" content="2019-05-01T07:37:43+03:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
<meta property="article:modified_time" content="2020-02-24T18:07:35+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2019"/>
@ -57,7 +57,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-05\/",
"wordCount": "3190",
"datePublished": "2019-05-01T07:37:43+03:00",
"dateModified": "2019-10-28T13:39:25+02:00",
"dateModified": "2020-02-24T18:07:35+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -349,7 +349,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
3 2a01:7e00::f03c:91ff:fe0a:d645
113 63.32.242.35
</code></pre><ul>
<li>According to <a href="https://viewdns.info/reverseip/?host=63.32.242.35&amp;t=1">viewdns.info</a> that server belongs to Macaroni Brothers&rsquo;
<li>According to <a href="https://viewdns.info/reverseip/?host=63.32.242.35&amp;t=1">viewdns.info</a> that server belongs to Macaroni Brothers
<ul>
<li>The user agent of their non-REST API requests from the same IP is Drupal</li>
<li>This is one very good reason to limit REST API requests, and perhaps to enable caching via nginx</li>

View File

@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
<meta property="article:modified_time" content="2020-02-23T09:16:50+02:00" />
<meta property="article:modified_time" content="2020-02-24T14:11:56+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2020"/>
@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install
"@type": "BlogPosting",
"headline": "February, 2020",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
"wordCount": "4210",
"wordCount": "4888",
"datePublished": "2020-02-02T11:56:30+02:00",
"dateModified": "2020-02-23T09:16:50+02:00",
"dateModified": "2020-02-24T14:11:56+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -788,8 +788,117 @@ dspace.log.2020-01-21:4
</ul>
</li>
<li>Those are the requests AReS and ILRI servers are making&hellip; nearly 150,000 per day!</li>
<li>Well that settles it!</li>
</ul>
<!-- raw HTML omitted -->
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:12,&quot;start&quot;:0,&quot;docs&quot;:[
$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=82450'
$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:62,&quot;start&quot;:0,&quot;docs&quot;:[
</code></pre><ul>
<li>A REST request with <code>limit=50</code> will make exactly fifty <code>statistics_type=view</code> statistics in the Solr core&hellip; fuck.
<ul>
<li>So not only do I need to purge all these millions of hits, we need to add these IPs to the list of spider IPs so they don&rsquo;t get recorded</li>
</ul>
</li>
</ul>
<h2 id="2020-02-24">2020-02-24</h2>
<ul>
<li>I tried to add some IPs to the DSpace spider list so they would not get recorded in Solr statistics, but it doesn&rsquo;t support IPv6
<ul>
<li>A better method is actually to just use the nginx mapping logic we already have to reset the user agent for these requests to &ldquo;bot&rdquo;</li>
<li>That, or to really insist that users harvesting us specify some kind of user agent</li>
</ul>
</li>
<li>I tried to add the IPs to our nginx IP bot mapping but it doesn&rsquo;t seem to work&hellip; WTF, why is everything broken?!</li>
<li>Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:</li>
</ul>
<pre><code>$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:42395486,&quot;start&quot;:0,&quot;docs&quot;:[]
</code></pre><ul>
<li>I modified my <code>check-spider-hits.sh</code> script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
Purging 22809216 hits from 34.218.226.147 in statistics
Purging 19586270 hits from 172.104.229.92 in statistics
Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
Purging 271668 hits from 2a01:7e00::f03c:91ff:fe18:7396 in statistics
Total number of bot hits purged: 42778291
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics-2018 -p
Purging 5535399 hits from 34.218.226.147 in statistics-2018
Total number of bot hits purged: 5535399
</code></pre><ul>
<li>(The <code>statistics</code> core holds 2019 and 2020 stats, because the yearly sharding process failed this year)</li>
<li>Attached is a before and after of the period from 2019-01 to 2020-02:</li>
</ul>
<p><img src="/cgspace-notes/2020/02/cgspace-stats-before.png" alt="CGSpace stats for 2019 and 2020 before the purge"></p>
<p><img src="/cgspace-notes/2020/02/cgspace-stats-after.png" alt="CGSpace stats for 2019 and 2020 after the purge"></p>
<ul>
<li>And here is a graph of the stats by year since 2011:</li>
</ul>
<p><img src="/cgspace-notes/2020/02/cgspace-stats-years.png" alt="CGSpace stats by year since 2011 after the purge"></p>
<ul>
<li>I&rsquo;m a little suspicious of the 2012, 2013, and 2014 numbers, though
<ul>
<li>I should facet those years by IP and see if any stand out&hellip;</li>
</ul>
</li>
<li>The next thing I need to do is figure out why the nginx IP to bot mapping isn&rsquo;t working&hellip;
<ul>
<li>Actually, and I&rsquo;ve probably learned this before, but the bot mapping is working, but nginx only logs the real user agent (of course!), as I&rsquo;m only using the mapped one in the proxy pass&hellip;</li>
<li>This trick for adding a header with the mapped &ldquo;ua&rdquo; variable is nice:</li>
</ul>
</li>
</ul>
<pre><code>add_header X-debug-message &quot;ua is $ua&quot; always;
</code></pre><ul>
<li>Then in the HTTP response you see:</li>
</ul>
<pre><code>X-debug-message: ua is bot
</code></pre><ul>
<li>So the IP to bot mapping is working, phew.</li>
<li>More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
<ul>
<li>For example, ciat.cgiar.org was previously 104.196.152.243, but it is now 35.237.175.180, which I had noticed as a &ldquo;mystery&rdquo; client on Google Cloud in 2018-09</li>
<li>Others I should probably add to the nginx bot map list are:
<ul>
<li>wle.cgiar.org (70.32.90.172)</li>
<li>ccafs.cgiar.org (205.186.128.185)</li>
<li>another CIAT scraper using the PHP GuzzleHttp library (45.5.184.72)</li>
<li>macaronilab.com (<a href="https://viewdns.info/reverseip/?host=63.32.242.35&amp;t=1">63.32.242.35</a>)</li>
<li>africa-rising.net (<a href="https://viewdns.info/reverseip/?host=162.243.171.159&amp;t=1">162.243.171.159</a></li>
</ul>
</li>
</ul>
</li>
<li>These IPs are all active in the REST API logs over the last few months and they account for <em>thirty-four million</em> more hits in the statistics!</li>
<li>I purged them from CGSpace:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 15 hits from 104.196.152.243 in statistics
Purging 61064 hits from 35.237.175.180 in statistics
Purging 1378 hits from 70.32.90.172 in statistics
Purging 28880 hits from 205.186.128.185 in statistics
Purging 464613 hits from 63.32.242.35 in statistics
Purging 131 hits from 162.243.171.159 in statistics
Total number of bot hits purged: 556081
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 684888 hits from 104.196.152.243 in statistics-2018
Purging 323737 hits from 35.227.26.162 in statistics-2018
Purging 221091 hits from 35.237.175.180 in statistics-2018
Purging 3834 hits from 205.186.128.185 in statistics-2018
Purging 20337 hits from 63.32.242.35 in statistics-2018
Total number of bot hits purged: 1253887
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 1752548 hits from 104.196.152.243 in statistics-2017
Total number of bot hits purged: 1752548
</code></pre><!-- raw HTML omitted -->

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.0 KiB

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-02-23T09:16:50+02:00</lastmod>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-02-23T09:16:50+02:00</lastmod>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-02/</loc>
<lastmod>2020-02-23T09:16:50+02:00</lastmod>
<lastmod>2020-02-24T14:11:56+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-02-23T09:16:50+02:00</lastmod>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-02-23T09:16:50+02:00</lastmod>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
</url>
<url>
@ -84,7 +84,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-05/</loc>
<lastmod>2019-10-28T13:39:25+02:00</lastmod>
<lastmod>2020-02-24T18:07:35+02:00</lastmod>
</url>
<url>
@ -119,7 +119,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2018-10/</loc>
<lastmod>2019-10-28T13:39:25+02:00</lastmod>
<lastmod>2020-02-23T20:10:47+02:00</lastmod>
</url>
<url>