From 10805f427262c9c99fa457757d671fed0c34c017 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Thu, 30 Apr 2020 10:44:23 +0300 Subject: [PATCH] Update notes for 2020-04-30 --- content/posts/2020-04.md | 61 +++++++++++++++++++++++++++++++++++++ docs/2020-04/index.html | 65 ++++++++++++++++++++++++++++++++++++++-- docs/sitemap.xml | 10 +++---- 3 files changed, 128 insertions(+), 8 deletions(-) diff --git a/content/posts/2020-04.md b/content/posts/2020-04.md index 431ba2228..7af45c15c 100644 --- a/content/posts/2020-04.md +++ b/content/posts/2020-04.md @@ -395,5 +395,66 @@ $ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c "idle in ``` - I don't see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong... I think the solution to clear these idle connections is probably to just restart Tomcat +- I looked at the Solr stats for this month and see lots of suspicious IPs: + +``` +$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip + + "88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent + "104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent + "104.198.96.245",4925,# Google cloud, using REST API with no user agent + "52.34.238.26",2907, # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/) +``` + +- And a bunch more... ugh... + - 70.32.90.172: scraping REST API for IWMI/WLE pages with no user agent + - 2a01:7e00::f03c:91ff:fe16:fcb: Linode, REST API, no user agent + - 2607:f298:5:101d:f816:3eff:fed9:a484: DreamHost, XMLUI and REST API, python-requests/2.18.4 + - 2a00:1768:2001:7a::20: Netherlands, XMLUI, trying SQL injections +- I need to start blocking requests without a user agent... +- I purged these user agents using my `check-spider-ip-hits.sh` script: + +``` +$ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done +$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p +``` +- Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018 +- Looking through the Solr stats faceted by the `userAgent` field I see some interesting ones: + +``` +$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent' +... +"Delphi 2009",50725, +"OgScrper/1.0.0",12421, +``` + +- Delphi is only used by IP addresses in Greece, so that's obviously the GARDIAN people harvesting us... +- I have no idea what OgScrper is, but it's not a user! +- Then there are 276,000 hits from `MEL-API` from Jordanian IPs in 2018, so that's obviously CodeObia guys... +- Other user agents: + - GigablastOpenSource/1 + - Owlin Domain Resolver V1 + - API scraper + - MetaURI +- I don't know why, but my `check-spider-hits.sh` script doesn't seem to be handling the user agents with spaces properly so I will delete those manually after +- First delete the ones without spaces, creating a temp file in `/tmp/agents` containing the patterns: + +``` +$ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done +$ ./check-spider-hits.sh -f /tmp/agents -s statistics -p +``` +- That's about 300,000 hits purged... +- Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core: + +``` +$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0" +... +052userAgent:/Delphi 2009/0 +$ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary 'userAgent:"Delphi 2009"'; done +$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary 'userAgent:"Delphi 2009"' +``` + +- Quoting them works for now until I can look into it and handle it properly in the script +- This was about 400,000 hits in total purged from the Solr statistics diff --git a/docs/2020-04/index.html b/docs/2020-04/index.html index 49948db65..4fa093478 100644 --- a/docs/2020-04/index.html +++ b/docs/2020-04/index.html @@ -25,7 +25,7 @@ On the same note, the one item Abenet pointed out last week now has a donut with - + @@ -55,9 +55,9 @@ On the same note, the one item Abenet pointed out last week now has a donut with "@type": "BlogPosting", "headline": "April, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-04/", - "wordCount": "2787", + "wordCount": "3186", "datePublished": "2020-04-02T10:53:24+03:00", - "dateModified": "2020-04-29T14:20:28+03:00", + "dateModified": "2020-04-29T17:33:34+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -532,6 +532,65 @@ Caused by: java.lang.NullPointerException 67 +
$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
+
+        "88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent
+        "104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent
+        "104.198.96.245",4925,# Google cloud, using REST API with no user agent
+        "52.34.238.26",2907,  # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
+
+
$ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
+$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
+
+
$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
+...
+"Delphi 2009",50725,
+"OgScrper/1.0.0",12421,
+
+
$ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
+$ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
+
+
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
+...
+<lst name="responseHeader"><int name="status">0</int><int name="QTime">52</int><lst name="params"><str name="q">userAgent:/Delphi 2009/</str><str name="rows">0</str></lst></lst><result name="response" numFound="38760" start="0"></result>
+$ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'; done
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 4a9b6258b..2eeb468f5 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/2020-04/ - 2020-04-29T14:20:28+03:00 + 2020-04-29T17:33:34+03:00 https://alanorth.github.io/cgspace-notes/categories/ - 2020-04-29T14:20:28+03:00 + 2020-04-29T17:33:34+03:00 https://alanorth.github.io/cgspace-notes/ - 2020-04-29T14:20:28+03:00 + 2020-04-29T17:33:34+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-04-29T14:20:28+03:00 + 2020-04-29T17:33:34+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-04-29T14:20:28+03:00 + 2020-04-29T17:33:34+03:00