diff --git a/content/posts/2023-01.md b/content/posts/2023-01.md index be7163313..198ab6335 100644 --- a/content/posts/2023-01.md +++ b/content/posts/2023-01.md @@ -153,4 +153,136 @@ $ curl -v "https://dspace7test.ilri.org/api/core/items" -H "Authorization: Beare - Start a harvest on AReS +## 2023-01-16 + +- Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems +- Batch import another twenty-eight items for IFPRI across several Initiatives + - On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc + - I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts + - Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values + +## 2023-01-17 + +- Batch import another twenty-three items for IFPRI across several Initiatives + - I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc + - I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts + - Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669 + - Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values +- I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality +- I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace +- There is a high load on CGSpace pretty regularly + - Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks: + +![DSpace sessions year](/cgspace-notes/2023/01/jmx_dspace_sessions-year.png) + +- Is this attributable to all the PRMS harvesting? +- I also see some PostgreSQL locks starting earlier today: + +![PostgreSQL locks day](/cgspace-notes/2023/01/postgres_connections_ALL-day.png) + +- I'm curious to see what kinds of IPs have been connecting, so I will look at the last few weeks: + +```console +# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt +# wc -l /tmp/2023-01-17-cgspace-ips.txt +129446 /tmp/2023-01-17-cgspace-ips.txt +``` + +- I ran the IPs through my `resolve-addresses-geoip2.py` script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others): + +```console +$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \ + /tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \ + sed 1d | sort | uniq > /tmp/networks-to-block.txt +$ wc -l /tmp/networks-to-block.txt +776 /tmp/networks-to-block.txt +``` + +- I added the list of networks to nginx's `bot-networks.conf` so they will all be heavily rate limited +- Looking at the Munin stats again I see the load has been extra high since yesterday morning: + +![CPU week](/cgspace-notes/2023/01/cpu-week.png) + +- But still, it's suspicious that there are so many PostgreSQL locks +- Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy) + - I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!) + - I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request + - I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request + - I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request + - I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request + - I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it's a data center ISP so nope + - I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request + - I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request + - I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request + - I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request + - I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request + - I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request + - I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request + - I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request + - I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request + - I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked + - I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request + - There are too many to count... so I will purge these and then move on to user agents +- I purged hits from those IPs: + +```console +$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p +Purging 439185 hits from 31.148.223.10 in statistics +Purging 2151 hits from 18.203.245.60 in statistics +Purging 1990 hits from 3.249.192.212 in statistics +Purging 1975 hits from 34.244.160.145 in statistics +Purging 1969 hits from 52.213.59.101 in statistics +Purging 2540 hits from 91.209.8.29 in statistics +Purging 1624 hits from 54.78.176.127 in statistics +Purging 1236 hits from 54.74.197.53 in statistics +Purging 1327 hits from 54.246.128.111 in statistics +Purging 1108 hits from 52.16.103.133 in statistics +Purging 1045 hits from 63.32.99.252 in statistics +Purging 999 hits from 176.34.141.181 in statistics +Purging 997 hits from 34.243.17.80 in statistics +Purging 985 hits from 34.240.206.16 in statistics +Purging 862 hits from 18.203.81.120 in statistics +Purging 1654 hits from 176.97.210.106 in statistics +Purging 1628 hits from 51.81.193.200 in statistics +Purging 1020 hits from 79.110.73.54 in statistics +Purging 842 hits from 35.153.105.213 in statistics +Purging 1689 hits from 54.164.237.125 in statistics + +Total number of bot hits purged: 466826 +``` + +- Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones: + - `azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0` + - `Gov employment data scraper ([[your email]])` + - `Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)` + - `crownpeak` + - `Mozilla/5.0 (compatible)` +- Also, a ton of them are lower case, which I've never seen before... it might be possible, but looks super fishy to me: + - `mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0` + - `mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0` + - `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36` + - `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36` +- I purged some of those: + +```console +$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p +Purging 1658 hits from azure-logic-apps\/1.0 in statistics +Purging 948 hits from Gov employment data scraper in statistics +Purging 786 hits from Microsoft\.Data\.Mashup in statistics +Purging 303 hits from crownpeak in statistics +Purging 332 hits from Mozilla\/5.0 (compatible) in statistics + +Total number of bot hits purged: 4027 +``` + +- Then I ran all system updates on the server and rebooted it + - Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers + - I need to re-work how I'm doing this whitelisting and blacklisting... it's way too complicated now +- Export entire CGSpace to check Initiative mappings, and add nineteen... +- Start a harvest on AReS + diff --git a/docs/2023-01/index.html b/docs/2023-01/index.html index 3e23f82ee..4a3345e97 100644 --- a/docs/2023-01/index.html +++ b/docs/2023-01/index.html @@ -19,7 +19,7 @@ I see we have some new ones that aren’t in our list if I combine with this - + @@ -44,9 +44,9 @@ I see we have some new ones that aren’t in our list if I combine with this "@type": "BlogPosting", "headline": "January, 2023", "url": "https://alanorth.github.io/cgspace-notes/2023-01/", - "wordCount": "1103", + "wordCount": "2250", "datePublished": "2023-01-01T08:44:36+03:00", - "dateModified": "2023-01-12T23:11:42+03:00", + "dateModified": "2023-01-15T08:10:16+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -307,6 +307,151 @@ I see we have some new ones that aren’t in our list if I combine with this +

2023-01-16

+ +

2023-01-17

+ +

DSpace sessions year

+ +

PostgreSQL locks day

+ +
# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt
+# wc -l /tmp/2023-01-17-cgspace-ips.txt 
+129446 /tmp/2023-01-17-cgspace-ips.txt
+
+
$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \
+  /tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
+  sed 1d | sort | uniq > /tmp/networks-to-block.txt
+$ wc -l /tmp/networks-to-block.txt 
+776 /tmp/networks-to-block.txt
+
+

CPU week

+ +
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+Purging 439185 hits from 31.148.223.10 in statistics
+Purging 2151 hits from 18.203.245.60 in statistics
+Purging 1990 hits from 3.249.192.212 in statistics
+Purging 1975 hits from 34.244.160.145 in statistics
+Purging 1969 hits from 52.213.59.101 in statistics
+Purging 2540 hits from 91.209.8.29 in statistics
+Purging 1624 hits from 54.78.176.127 in statistics
+Purging 1236 hits from 54.74.197.53 in statistics
+Purging 1327 hits from 54.246.128.111 in statistics
+Purging 1108 hits from 52.16.103.133 in statistics
+Purging 1045 hits from 63.32.99.252 in statistics
+Purging 999 hits from 176.34.141.181 in statistics
+Purging 997 hits from 34.243.17.80 in statistics
+Purging 985 hits from 34.240.206.16 in statistics
+Purging 862 hits from 18.203.81.120 in statistics
+Purging 1654 hits from 176.97.210.106 in statistics
+Purging 1628 hits from 51.81.193.200 in statistics
+Purging 1020 hits from 79.110.73.54 in statistics
+Purging 842 hits from 35.153.105.213 in statistics
+Purging 1689 hits from 54.164.237.125 in statistics
+
+Total number of bot hits purged: 466826
+
+
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
+Purging 1658 hits from azure-logic-apps\/1.0 in statistics
+Purging 948 hits from Gov employment data scraper in statistics
+Purging 786 hits from Microsoft\.Data\.Mashup in statistics
+Purging 303 hits from crownpeak in statistics
+Purging 332 hits from Mozilla\/5.0 (compatible) in statistics
+
+Total number of bot hits purged: 4027
+
diff --git a/docs/2023/01/cpu-week.png b/docs/2023/01/cpu-week.png new file mode 100644 index 000000000..087d9e888 Binary files /dev/null and b/docs/2023/01/cpu-week.png differ diff --git a/docs/2023/01/jmx_dspace_sessions-year.png b/docs/2023/01/jmx_dspace_sessions-year.png new file mode 100644 index 000000000..5cbc61563 Binary files /dev/null and b/docs/2023/01/jmx_dspace_sessions-year.png differ diff --git a/docs/2023/01/postgres_connections_ALL-day.png b/docs/2023/01/postgres_connections_ALL-day.png new file mode 100644 index 000000000..b9c2fe966 Binary files /dev/null and b/docs/2023/01/postgres_connections_ALL-day.png differ diff --git a/docs/categories/index.html b/docs/categories/index.html index 6b0c5fbc9..6a1a542ca 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 92f4c4b61..2a3f395fa 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 801865bd7..595b4f0a9 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 435225ec3..92e901562 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index ab3295797..156503523 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 59a3c50d3..9997c18a7 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 6eedaee01..d74a5a7a7 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index adec653ff..105899e6f 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index e882634f2..7dc153ec1 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index d0b347cc7..8ba04298f 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 553a00687..d34d17573 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index d09fa6066..ea4cbf2f1 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 24a4bfe93..1f96b8e7a 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 1360c6073..e2930acfa 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 2cee0cbe2..6b851f0f1 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index b9b0e6a03..9e143c254 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 27921f933..ac67f4219 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 4bf3192d8..82efe447e 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index bd7e34da8..9004ef403 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index e8921142c..bb24b4ad9 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index b43a0824d..9357c90eb 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 9c60341e8..862e0aede 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index adac05a3b..e5bace86f 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index af8615970..ad3ba1dc0 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index fb1df69e5..400311775 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 0d0784c75..f630c6633 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 74357a8d3..4e9820d38 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2023-01-12T23:11:42+03:00 + 2023-01-15T08:10:16+03:00 https://alanorth.github.io/cgspace-notes/ - 2023-01-12T23:11:42+03:00 + 2023-01-15T08:10:16+03:00 https://alanorth.github.io/cgspace-notes/2023-01/ - 2023-01-12T23:11:42+03:00 + 2023-01-15T08:10:16+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2023-01-12T23:11:42+03:00 + 2023-01-15T08:10:16+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2023-01-12T23:11:42+03:00 + 2023-01-15T08:10:16+03:00 https://alanorth.github.io/cgspace-notes/2022-12/ 2023-01-01T10:12:13+02:00 diff --git a/static/2023/01/cpu-week.png b/static/2023/01/cpu-week.png new file mode 100644 index 000000000..087d9e888 Binary files /dev/null and b/static/2023/01/cpu-week.png differ diff --git a/static/2023/01/jmx_dspace_sessions-year.png b/static/2023/01/jmx_dspace_sessions-year.png new file mode 100644 index 000000000..5cbc61563 Binary files /dev/null and b/static/2023/01/jmx_dspace_sessions-year.png differ diff --git a/static/2023/01/postgres_connections_ALL-day.png b/static/2023/01/postgres_connections_ALL-day.png new file mode 100644 index 000000000..b9c2fe966 Binary files /dev/null and b/static/2023/01/postgres_connections_ALL-day.png differ