mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 00:18:21 +01:00
Update notes for 2020-04-30
This commit is contained in:
parent
a491c7f2bf
commit
10805f4272
@ -395,5 +395,66 @@ $ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c "idle in
|
||||
```
|
||||
|
||||
- I don't see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong... I think the solution to clear these idle connections is probably to just restart Tomcat
|
||||
- I looked at the Solr stats for this month and see lots of suspicious IPs:
|
||||
|
||||
```
|
||||
$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
|
||||
|
||||
"88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent
|
||||
"104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent
|
||||
"104.198.96.245",4925,# Google cloud, using REST API with no user agent
|
||||
"52.34.238.26",2907, # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
|
||||
```
|
||||
|
||||
- And a bunch more... ugh...
|
||||
- 70.32.90.172: scraping REST API for IWMI/WLE pages with no user agent
|
||||
- 2a01:7e00::f03c:91ff:fe16:fcb: Linode, REST API, no user agent
|
||||
- 2607:f298:5:101d:f816:3eff:fed9:a484: DreamHost, XMLUI and REST API, python-requests/2.18.4
|
||||
- 2a00:1768:2001:7a::20: Netherlands, XMLUI, trying SQL injections
|
||||
- I need to start blocking requests without a user agent...
|
||||
- I purged these user agents using my `check-spider-ip-hits.sh` script:
|
||||
|
||||
```
|
||||
$ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
|
||||
$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
|
||||
```
|
||||
- Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018
|
||||
- Looking through the Solr stats faceted by the `userAgent` field I see some interesting ones:
|
||||
|
||||
```
|
||||
$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
|
||||
...
|
||||
"Delphi 2009",50725,
|
||||
"OgScrper/1.0.0",12421,
|
||||
```
|
||||
|
||||
- Delphi is only used by IP addresses in Greece, so that's obviously the GARDIAN people harvesting us...
|
||||
- I have no idea what OgScrper is, but it's not a user!
|
||||
- Then there are 276,000 hits from `MEL-API` from Jordanian IPs in 2018, so that's obviously CodeObia guys...
|
||||
- Other user agents:
|
||||
- GigablastOpenSource/1
|
||||
- Owlin Domain Resolver V1
|
||||
- API scraper
|
||||
- MetaURI
|
||||
- I don't know why, but my `check-spider-hits.sh` script doesn't seem to be handling the user agents with spaces properly so I will delete those manually after
|
||||
- First delete the ones without spaces, creating a temp file in `/tmp/agents` containing the patterns:
|
||||
|
||||
```
|
||||
$ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
|
||||
$ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
|
||||
```
|
||||
- That's about 300,000 hits purged...
|
||||
- Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core:
|
||||
|
||||
```
|
||||
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
|
||||
...
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">52</int><lst name="params"><str name="q">userAgent:/Delphi 2009/</str><str name="rows">0</str></lst></lst><result name="response" numFound="38760" start="0"></result>
|
||||
$ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'; done
|
||||
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'
|
||||
```
|
||||
|
||||
- Quoting them works for now until I can look into it and handle it properly in the script
|
||||
- This was about 400,000 hits in total purged from the Solr statistics
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -25,7 +25,7 @@ On the same note, the one item Abenet pointed out last week now has a donut with
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-04/" />
|
||||
<meta property="article:published_time" content="2020-04-02T10:53:24+03:00" />
|
||||
<meta property="article:modified_time" content="2020-04-29T14:20:28+03:00" />
|
||||
<meta property="article:modified_time" content="2020-04-29T17:33:34+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="April, 2020"/>
|
||||
@ -55,9 +55,9 @@ On the same note, the one item Abenet pointed out last week now has a donut with
|
||||
"@type": "BlogPosting",
|
||||
"headline": "April, 2020",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2020-04/",
|
||||
"wordCount": "2787",
|
||||
"wordCount": "3186",
|
||||
"datePublished": "2020-04-02T10:53:24+03:00",
|
||||
"dateModified": "2020-04-29T14:20:28+03:00",
|
||||
"dateModified": "2020-04-29T17:33:34+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -532,6 +532,65 @@ Caused by: java.lang.NullPointerException
|
||||
67
|
||||
</code></pre><ul>
|
||||
<li>I don’t see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong… I think the solution to clear these idle connections is probably to just restart Tomcat</li>
|
||||
<li>I looked at the Solr stats for this month and see lots of suspicious IPs:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
|
||||
|
||||
"88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent
|
||||
"104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent
|
||||
"104.198.96.245",4925,# Google cloud, using REST API with no user agent
|
||||
"52.34.238.26",2907, # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
|
||||
</code></pre><ul>
|
||||
<li>And a bunch more… ugh…
|
||||
<ul>
|
||||
<li>70.32.90.172: scraping REST API for IWMI/WLE pages with no user agent</li>
|
||||
<li>2a01:7e00::f03c:91ff:fe16:fcb: Linode, REST API, no user agent</li>
|
||||
<li>2607:f298:5:101d:f816:3eff:fed9:a484: DreamHost, XMLUI and REST API, python-requests/2.18.4</li>
|
||||
<li>2a00:1768:2001:7a::20: Netherlands, XMLUI, trying SQL injections</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I need to start blocking requests without a user agent…</li>
|
||||
<li>I purged these user agents using my <code>check-spider-ip-hits.sh</code> script:</li>
|
||||
</ul>
|
||||
<pre><code>$ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
|
||||
$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
|
||||
</code></pre><ul>
|
||||
<li>Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018</li>
|
||||
<li>Looking through the Solr stats faceted by the <code>userAgent</code> field I see some interesting ones:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
|
||||
...
|
||||
"Delphi 2009",50725,
|
||||
"OgScrper/1.0.0",12421,
|
||||
</code></pre><ul>
|
||||
<li>Delphi is only used by IP addresses in Greece, so that’s obviously the GARDIAN people harvesting us…</li>
|
||||
<li>I have no idea what OgScrper is, but it’s not a user!</li>
|
||||
<li>Then there are 276,000 hits from <code>MEL-API</code> from Jordanian IPs in 2018, so that’s obviously CodeObia guys…</li>
|
||||
<li>Other user agents:
|
||||
<ul>
|
||||
<li>GigablastOpenSource/1</li>
|
||||
<li>Owlin Domain Resolver V1</li>
|
||||
<li>API scraper</li>
|
||||
<li>MetaURI</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I don’t know why, but my <code>check-spider-hits.sh</code> script doesn’t seem to be handling the user agents with spaces properly so I will delete those manually after</li>
|
||||
<li>First delete the ones without spaces, creating a temp file in <code>/tmp/agents</code> containing the patterns:</li>
|
||||
</ul>
|
||||
<pre><code>$ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
|
||||
$ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
|
||||
</code></pre><ul>
|
||||
<li>That’s about 300,000 hits purged…</li>
|
||||
<li>Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
|
||||
...
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">52</int><lst name="params"><str name="q">userAgent:/Delphi 2009/</str><str name="rows">0</str></lst></lst><result name="response" numFound="38760" start="0"></result>
|
||||
$ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'; done
|
||||
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'
|
||||
</code></pre><ul>
|
||||
<li>Quoting them works for now until I can look into it and handle it properly in the script</li>
|
||||
<li>This was about 400,000 hits in total purged from the Solr statistics</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
@ -4,27 +4,27 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2020-04/</loc>
|
||||
<lastmod>2020-04-29T14:20:28+03:00</lastmod>
|
||||
<lastmod>2020-04-29T17:33:34+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||
<lastmod>2020-04-29T14:20:28+03:00</lastmod>
|
||||
<lastmod>2020-04-29T17:33:34+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2020-04-29T14:20:28+03:00</lastmod>
|
||||
<lastmod>2020-04-29T17:33:34+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2020-04-29T14:20:28+03:00</lastmod>
|
||||
<lastmod>2020-04-29T17:33:34+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2020-04-29T14:20:28+03:00</lastmod>
|
||||
<lastmod>2020-04-29T17:33:34+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
|
Loading…
Reference in New Issue
Block a user