diff --git a/content/posts/2023-01.md b/content/posts/2023-01.md index be7163313..198ab6335 100644 --- a/content/posts/2023-01.md +++ b/content/posts/2023-01.md @@ -153,4 +153,136 @@ $ curl -v "https://dspace7test.ilri.org/api/core/items" -H "Authorization: Beare - Start a harvest on AReS +## 2023-01-16 + +- Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems +- Batch import another twenty-eight items for IFPRI across several Initiatives + - On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc + - I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts + - Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values + +## 2023-01-17 + +- Batch import another twenty-three items for IFPRI across several Initiatives + - I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc + - I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts + - Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669 + - Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values +- I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality +- I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace +- There is a high load on CGSpace pretty regularly + - Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks: + +![DSpace sessions year](/cgspace-notes/2023/01/jmx_dspace_sessions-year.png) + +- Is this attributable to all the PRMS harvesting? +- I also see some PostgreSQL locks starting earlier today: + +![PostgreSQL locks day](/cgspace-notes/2023/01/postgres_connections_ALL-day.png) + +- I'm curious to see what kinds of IPs have been connecting, so I will look at the last few weeks: + +```console +# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt +# wc -l /tmp/2023-01-17-cgspace-ips.txt +129446 /tmp/2023-01-17-cgspace-ips.txt +``` + +- I ran the IPs through my `resolve-addresses-geoip2.py` script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others): + +```console +$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \ + /tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \ + sed 1d | sort | uniq > /tmp/networks-to-block.txt +$ wc -l /tmp/networks-to-block.txt +776 /tmp/networks-to-block.txt +``` + +- I added the list of networks to nginx's `bot-networks.conf` so they will all be heavily rate limited +- Looking at the Munin stats again I see the load has been extra high since yesterday morning: + +![CPU week](/cgspace-notes/2023/01/cpu-week.png) + +- But still, it's suspicious that there are so many PostgreSQL locks +- Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy) + - I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!) + - I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request + - I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request + - I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request + - I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request + - I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it's a data center ISP so nope + - I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request + - I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request + - I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request + - I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request + - I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request + - I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request + - I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request + - I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request + - I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request + - I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked + - I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request + - There are too many to count... so I will purge these and then move on to user agents +- I purged hits from those IPs: + +```console +$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p +Purging 439185 hits from 31.148.223.10 in statistics +Purging 2151 hits from 18.203.245.60 in statistics +Purging 1990 hits from 3.249.192.212 in statistics +Purging 1975 hits from 34.244.160.145 in statistics +Purging 1969 hits from 52.213.59.101 in statistics +Purging 2540 hits from 91.209.8.29 in statistics +Purging 1624 hits from 54.78.176.127 in statistics +Purging 1236 hits from 54.74.197.53 in statistics +Purging 1327 hits from 54.246.128.111 in statistics +Purging 1108 hits from 52.16.103.133 in statistics +Purging 1045 hits from 63.32.99.252 in statistics +Purging 999 hits from 176.34.141.181 in statistics +Purging 997 hits from 34.243.17.80 in statistics +Purging 985 hits from 34.240.206.16 in statistics +Purging 862 hits from 18.203.81.120 in statistics +Purging 1654 hits from 176.97.210.106 in statistics +Purging 1628 hits from 51.81.193.200 in statistics +Purging 1020 hits from 79.110.73.54 in statistics +Purging 842 hits from 35.153.105.213 in statistics +Purging 1689 hits from 54.164.237.125 in statistics + +Total number of bot hits purged: 466826 +``` + +- Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones: + - `azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0` + - `Gov employment data scraper ([[your email]])` + - `Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)` + - `crownpeak` + - `Mozilla/5.0 (compatible)` +- Also, a ton of them are lower case, which I've never seen before... it might be possible, but looks super fishy to me: + - `mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0` + - `mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0` + - `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36` + - `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36` +- I purged some of those: + +```console +$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p +Purging 1658 hits from azure-logic-apps\/1.0 in statistics +Purging 948 hits from Gov employment data scraper in statistics +Purging 786 hits from Microsoft\.Data\.Mashup in statistics +Purging 303 hits from crownpeak in statistics +Purging 332 hits from Mozilla\/5.0 (compatible) in statistics + +Total number of bot hits purged: 4027 +``` + +- Then I ran all system updates on the server and rebooted it + - Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers + - I need to re-work how I'm doing this whitelisting and blacklisting... it's way too complicated now +- Export entire CGSpace to check Initiative mappings, and add nineteen... +- Start a harvest on AReS + diff --git a/docs/2023-01/index.html b/docs/2023-01/index.html index 3e23f82ee..4a3345e97 100644 --- a/docs/2023-01/index.html +++ b/docs/2023-01/index.html @@ -19,7 +19,7 @@ I see we have some new ones that aren’t in our list if I combine with this - + @@ -44,9 +44,9 @@ I see we have some new ones that aren’t in our list if I combine with this "@type": "BlogPosting", "headline": "January, 2023", "url": "https://alanorth.github.io/cgspace-notes/2023-01/", - "wordCount": "1103", + "wordCount": "2250", "datePublished": "2023-01-01T08:44:36+03:00", - "dateModified": "2023-01-12T23:11:42+03:00", + "dateModified": "2023-01-15T08:10:16+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -307,6 +307,151 @@ I see we have some new ones that aren’t in our list if I combine with this
# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt
+# wc -l /tmp/2023-01-17-cgspace-ips.txt
+129446 /tmp/2023-01-17-cgspace-ips.txt
+
resolve-addresses-geoip2.py
script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \
+ /tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
+ sed 1d | sort | uniq > /tmp/networks-to-block.txt
+$ wc -l /tmp/networks-to-block.txt
+776 /tmp/networks-to-block.txt
+
bot-networks.conf
so they will all be heavily rate limited$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+Purging 439185 hits from 31.148.223.10 in statistics
+Purging 2151 hits from 18.203.245.60 in statistics
+Purging 1990 hits from 3.249.192.212 in statistics
+Purging 1975 hits from 34.244.160.145 in statistics
+Purging 1969 hits from 52.213.59.101 in statistics
+Purging 2540 hits from 91.209.8.29 in statistics
+Purging 1624 hits from 54.78.176.127 in statistics
+Purging 1236 hits from 54.74.197.53 in statistics
+Purging 1327 hits from 54.246.128.111 in statistics
+Purging 1108 hits from 52.16.103.133 in statistics
+Purging 1045 hits from 63.32.99.252 in statistics
+Purging 999 hits from 176.34.141.181 in statistics
+Purging 997 hits from 34.243.17.80 in statistics
+Purging 985 hits from 34.240.206.16 in statistics
+Purging 862 hits from 18.203.81.120 in statistics
+Purging 1654 hits from 176.97.210.106 in statistics
+Purging 1628 hits from 51.81.193.200 in statistics
+Purging 1020 hits from 79.110.73.54 in statistics
+Purging 842 hits from 35.153.105.213 in statistics
+Purging 1689 hits from 54.164.237.125 in statistics
+
+Total number of bot hits purged: 466826
+
azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0
Gov employment data scraper ([[your email]])
Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)
crownpeak
Mozilla/5.0 (compatible)
mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0
mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36
mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0
mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36
mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
+Purging 1658 hits from azure-logic-apps\/1.0 in statistics
+Purging 948 hits from Gov employment data scraper in statistics
+Purging 786 hits from Microsoft\.Data\.Mashup in statistics
+Purging 303 hits from crownpeak in statistics
+Purging 332 hits from Mozilla\/5.0 (compatible) in statistics
+
+Total number of bot hits purged: 4027
+