Update notes for 2020-07-24

This commit is contained in:
Alan Orth 2020-07-24 23:23:15 +03:00
parent 6b75032413
commit 9e6ff5d999
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
21 changed files with 223 additions and 30 deletions

View File

@ -670,5 +670,93 @@ $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H
- I closed all issues in the [OpenRXV](https://github.com/ilri/OpenRXV/issues) and [AReS](https://github.com/ilri/AReS/issues) GitHub repositories with screenshots so that Moayad can use them for his invoice
- The statistics-2018 core always crashes with the same error even after I deleted the "id:10" records...
- I started the statistics-2017 core and it finished in 3:44:15
- I started the statistics-2016 core and it finished in 2:27:08
- I started the statistics-2015 core and it finished in 1:07:38
## 2020-07-24
- Looking at the statistics-2019 Solr stats and see some interesting user agents and IPs
- For example, I see 568,000 requests from 66.109.27.x in 2019-10, all with the same exact user agent:
```
Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
```
- Also, in the same month with the same *exact* user agent, I see 300,000 from 192.157.89.x
- The 66.109.27.x IPs belong to galaxyvisions.com
- The 192.157.89.x IPs belong to cologuard.com
- All these hosts were reported in late 2019 on abuseipdb.com
- Then I see another one 163.172.71.23 that made 215,000 requests in 2019-09 and 2019-08
- It belongs to poneytelecom.eu and is also in abuseipdb.com for PHP injection and directory traversal
- It uses this user agent:
```
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
```
- In statistics-2018 I see more weird IPs
- 54.214.112.202 made 839,000 requests with no user agent...
- It is on Amazon Web Services (AWS) and made 100% `statistics_type:view` so I guess it was harvesting via the REST API
- A few IPs owned by perfectip.net made 400,000 requests in 2018-01
- They are 2607:fa98:40:9:26b6:fdff:feff:195d and 2607:fa98:40:9:26b6:fdff:feff:1888 and 2607:fa98:40:9:26b6:fdff:feff:1c96
- All the requests used this user agent:
```
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
```
- Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it's definitely CodeObia / ICARDA and I will purge them
- Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645
- Jesus fuck there is 46.101.86.248 making 15,000 requests per month in 2018 with no user agent...
- I will purge the hits from all the following IPs:
```
192.157.89.4
192.157.89.5
192.157.89.6
192.157.89.7
66.109.27.142
66.109.27.139
66.109.27.138
66.109.27.140
66.109.27.141
2607:fa98:40:9:26b6:fdff:feff:1888
2607:fa98:40:9:26b6:fdff:feff:195d
2607:fa98:40:9:26b6:fdff:feff:1c96
213.139.53.62
2a01:7e00::f03c:91ff:fe0a:d645
46.101.86.248
```
- In total these accounted for the following amount of requests in each year:
- 2020: 1436
- 2019: 933148
- 2018: 613936
- I noticed a few other user agents that should be purged too:
```
^Java\/\d{1,2}.\d
FlipboardProxy\/\d
API scraper
RebelMouse\/\d
Iframely\/\d
Python\/\d
Ruby
NING\/\d
ubermetrics-technologies\.com
Jetty\/\d
scalaj-http\/\d
mailto\:team@impactstory\.org
```
- I purged them from the stats too:
- 2020: 18153
- 2019: 29745
- 2018: 18083
- 2017: 19399
- 2016: 16283
- 2015: 16659
- 2014: 713
<!-- vim: set sw=2 ts=2: -->

View File

@ -24,7 +24,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-03/" />
<meta property="article:published_time" content="2019-03-01T12:16:30+01:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta property="article:modified_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2019"/>
@ -55,7 +55,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
"url": "https://alanorth.github.io/cgspace-notes/2019-03/",
"wordCount": "7105",
"datePublished": "2019-03-01T12:16:30+01:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"dateModified": "2020-07-24T21:57:55+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -951,7 +951,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
712 35.174.184.209
784 2a01:4f8:13b:1296::2
</code></pre><ul>
<li>The two IPV6 addresses are something called BLEXBot, which seems to check the robots.txt file and the completely ignore it by making thousands of requests to dynamic pages like Browse and Discovery</li>
<li>The two IPV6 addresses are something called BLEXBot, which seems to check the robots.txt file and then completely ignore it by making thousands of requests to dynamic pages like Browse and Discovery</li>
<li>Then <code>35.174.184.209</code> is MauiBot, which does the same thing</li>
<li>Also <code>3.91.79.74</code> does, which appears to be CCBot</li>
<li>I will add these three to the &ldquo;bad bot&rdquo; rate limiting that I originally used for Baidu</li>

View File

@ -20,7 +20,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-07/" />
<meta property="article:published_time" content="2020-07-01T10:53:54+03:00" />
<meta property="article:modified_time" content="2020-07-22T11:00:40+03:00" />
<meta property="article:modified_time" content="2020-07-23T12:32:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="July, 2020"/>
@ -45,9 +45,9 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
"@type": "BlogPosting",
"headline": "July, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-07/",
"wordCount": "4352",
"wordCount": "4728",
"datePublished": "2020-07-01T10:53:54+03:00",
"dateModified": "2020-07-22T11:00:40+03:00",
"dateModified": "2020-07-23T12:32:11+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -802,7 +802,112 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<h2 id="2020-07-23">2020-07-23</h2>
<ul>
<li>I closed all issues in the <a href="https://github.com/ilri/OpenRXV/issues">OpenRXV</a> and <a href="https://github.com/ilri/AReS/issues">AReS</a> GitHub repositories with screenshots so that Moayad can use them for his invoice</li>
<li>The statistics-2018 core always crashes with the same error even after I deleted the &ldquo;id:10&rdquo; records&hellip;</li>
<li>The statistics-2018 core always crashes with the same error even after I deleted the &ldquo;id:10&rdquo; records&hellip;
<ul>
<li>I started the statistics-2017 core and it finished in 3:44:15</li>
<li>I started the statistics-2016 core and it finished in 2:27:08</li>
<li>I started the statistics-2015 core and it finished in 1:07:38</li>
</ul>
</li>
</ul>
<h2 id="2020-07-24">2020-07-24</h2>
<ul>
<li>Looking at the statistics-2019 Solr stats and see some interesting user agents and IPs
<ul>
<li>For example, I see 568,000 requests from 66.109.27.x in 2019-10, all with the same exact user agent:</li>
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
</code></pre><ul>
<li>Also, in the same month with the same <em>exact</em> user agent, I see 300,000 from 192.157.89.x
<ul>
<li>The 66.109.27.x IPs belong to galaxyvisions.com</li>
<li>The 192.157.89.x IPs belong to cologuard.com</li>
<li>All these hosts were reported in late 2019 on abuseipdb.com</li>
</ul>
</li>
<li>Then I see another one 163.172.71.23 that made 215,000 requests in 2019-09 and 2019-08
<ul>
<li>It belongs to poneytelecom.eu and is also in abuseipdb.com for PHP injection and directory traversal</li>
<li>It uses this user agent:</li>
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>In statistics-2018 I see more weird IPs
<ul>
<li>54.214.112.202 made 839,000 requests with no user agent&hellip;
<ul>
<li>It is on Amazon Web Services (AWS) and made 100% <code>statistics_type:view</code> so I guess it was harvesting via the REST API</li>
</ul>
</li>
<li>A few IPs owned by perfectip.net made 400,000 requests in 2018-01
<ul>
<li>They are 2607:fa98:40:9:26b6:fdff:feff:195d and 2607:fa98:40:9:26b6:fdff:feff:1888 and 2607:fa98:40:9:26b6:fdff:feff:1c96</li>
<li>All the requests used this user agent:</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
</code></pre><ul>
<li>Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it&rsquo;s definitely CodeObia / ICARDA and I will purge them</li>
<li>Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645</li>
<li>Jesus fuck there is 46.101.86.248 making 15,000 requests per month in 2018 with no user agent&hellip;</li>
<li>I will purge the hits from all the following IPs:</li>
</ul>
<pre><code>192.157.89.4
192.157.89.5
192.157.89.6
192.157.89.7
66.109.27.142
66.109.27.139
66.109.27.138
66.109.27.140
66.109.27.141
2607:fa98:40:9:26b6:fdff:feff:1888
2607:fa98:40:9:26b6:fdff:feff:195d
2607:fa98:40:9:26b6:fdff:feff:1c96
213.139.53.62
2a01:7e00::f03c:91ff:fe0a:d645
46.101.86.248
</code></pre><ul>
<li>In total these accounted for the following amount of requests in each year:
<ul>
<li>2020: 1436</li>
<li>2019: 933148</li>
<li>2018: 613936</li>
</ul>
</li>
<li>I noticed a few other user agents that should be purged too:</li>
</ul>
<pre><code>^Java\/\d{1,2}.\d
FlipboardProxy\/\d
API scraper
RebelMouse\/\d
Iframely\/\d
Python\/\d
Ruby
NING\/\d
ubermetrics-technologies\.com
Jetty\/\d
scalaj-http\/\d
mailto\:team@impactstory\.org
</code></pre><ul>
<li>I purged them from the stats too:
<ul>
<li>2020: 18153</li>
<li>2019: 29745</li>
<li>2018: 18083</li>
<li>2017: 19399</li>
<li>2016: 16283</li>
<li>2015: 16659</li>
<li>2014: 713</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-07-22T11:00:40+03:00" />
<meta property="og:updated_time" content="2020-07-24T21:57:55+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-07-22T11:00:40+03:00</lastmod>
<lastmod>2020-07-24T21:57:55+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-07-22T11:00:40+03:00</lastmod>
<lastmod>2020-07-24T21:57:55+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-07/</loc>
<lastmod>2020-07-22T11:00:40+03:00</lastmod>
<lastmod>2020-07-23T12:32:11+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-07-22T11:00:40+03:00</lastmod>
<lastmod>2020-07-24T21:57:55+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-07-22T11:00:40+03:00</lastmod>
<lastmod>2020-07-24T21:57:55+03:00</lastmod>
</url>
<url>
@ -119,7 +119,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-03/</loc>
<lastmod>2020-04-13T15:30:24+03:00</lastmod>
<lastmod>2020-07-24T21:57:55+03:00</lastmod>
</url>
<url>