diff --git a/content/posts/2022-07.md b/content/posts/2022-07.md index 534fa0466..f450c16e4 100644 --- a/content/posts/2022-07.md +++ b/content/posts/2022-07.md @@ -168,3 +168,114 @@ UPDATE 108 - cg.identifier.ccafsprojectpii - cg.subject.cifor - For now we will keep them in the search filters +- I modified my `check-duplicates.py` script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents) + - I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace + - I am curious to see how the similarity scores compare to those from trgm... perhaps we don't need them actually +- Deploy latest changes to submission form, Discovery, and browse on CGSpace + - Also run all system updates and reboot the host +- Fix 152 `dcterms.relation` that are using "cgspace.cgiar.org" links instead of handles: + +```console +UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$'; +``` + +## 2022-07-10 + +- UptimeRobot says that CGSpace is down + - I see high load around 22, high CPU around 800% + - Doesn't seem to be a lot of unique IPs: + +```console +# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l +2243 +``` + +Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions: + +```console +$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1613 +$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1696 +$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1708 +$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1830 +$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l +1811 +``` + +![DSpace sessions week](/cgspace-notes/2022/07/jmx_dspace_sessions-week.png) + +- These IPs are using normal-looking user agents: + - `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1` + - `Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"` + - `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1` + - `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36` + +- I will add networks I'm seeing now to nginx's bot-networks.conf for now (not all of Hetzner) and purge the hits later: + - 65.108.0.0/16 + - 65.21.0.0/16 + - 95.216.0.0/16 + - 135.181.0.0/16 + - 138.201.0.0/16 + +- I think I'm going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need +- Sheesh, there are a bunch more IPv6 addresses also on Hetzner: + +```console +# awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h + 1 2a01:4f9:6a:1c2b::2 + 2 2a01:4f9:2b:5a8::2 + 2 2a01:4f9:4b:4495::2 + 96 2a01:4f9:c010:518c::1 + 137 2a01:4f9:c010:a9bc::1 + 142 2a01:4f9:c010:58c9::1 + 142 2a01:4f9:c010:58ea::1 + 144 2a01:4f9:c010:58eb::1 + 145 2a01:4f9:c010:6ff8::1 + 148 2a01:4f9:c010:5190::1 + 149 2a01:4f9:c010:7d6d::1 + 153 2a01:4f9:c010:5226::1 + 156 2a01:4f9:c010:7f74::1 + 160 2a01:4f9:c010:5188::1 + 161 2a01:4f9:c010:58e5::1 + 168 2a01:4f9:c010:58ed::1 + 170 2a01:4f9:c010:548e::1 + 170 2a01:4f9:c010:8c97::1 + 175 2a01:4f9:c010:58c8::1 + 175 2a01:4f9:c010:aada::1 + 182 2a01:4f9:c010:58ec::1 + 182 2a01:4f9:c010:ae8c::1 + 502 2a01:4f9:c010:ee57::1 + 530 2a01:4f9:c011:567a::1 + 535 2a01:4f9:c010:d04e::1 + 539 2a01:4f9:c010:3d9a::1 + 586 2a01:4f9:c010:93db::1 + 593 2a01:4f9:c010:a04a::1 + 601 2a01:4f9:c011:4166::1 + 607 2a01:4f9:c010:9881::1 + 640 2a01:4f9:c010:87fb::1 + 648 2a01:4f9:c010:e680::1 + 1141 2a01:4f9:3a:2696::2 + 1146 2a01:4f9:3a:2555::2 + 3207 2a01:4f9:3a:2c19::2 +``` + +- Maybe it's time I ban all of Hetzner... sheesh. +- I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back + +![CPU day](/cgspace-notes/2022/07/cpu-day.png) + +- I am not sure what's going on + - I extracted all the IPs and used `resolve-addresses-geoip2.py` to analyze them and extract all the Hetzner networks and block them + - It's 181 IPs on Hetzner... +- I rebooted the server to see if it was just some stuck locks in PostgreSQL... +- The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block +- Start a harvest on AReS + +## 2022-07-12 + +- Update an incorrect ORCID identifier for Alliance + + diff --git a/docs/2022-07/index.html b/docs/2022-07/index.html index 777a7b6b6..3f7918603 100644 --- a/docs/2022-07/index.html +++ b/docs/2022-07/index.html @@ -19,7 +19,7 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens - + @@ -44,9 +44,9 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens "@type": "BlogPosting", "headline": "July, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-07/", - "wordCount": "1095", + "wordCount": "1660", "datePublished": "2022-07-02T14:07:36+03:00", - "dateModified": "2022-07-07T10:02:04+03:00", + "dateModified": "2022-07-08T15:49:45+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -315,7 +315,123 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens
check-duplicates.py
script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents)
+dcterms.relation
that are using “cgspace.cgiar.org” links instead of handles:UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
+
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
+2243
+
Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:
+$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
+1613
+$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
+1696
+$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
+1708
+$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
+1830
+$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
+1811
+
These IPs are using normal-looking user agents:
+Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36
I will add networks I’m seeing now to nginx’s bot-networks.conf for now (not all of Hetzner) and purge the hits later:
+I think I’m going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need
+Sheesh, there are a bunch more IPv6 addresses also on Hetzner:
+# awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h
+ 1 2a01:4f9:6a:1c2b::2
+ 2 2a01:4f9:2b:5a8::2
+ 2 2a01:4f9:4b:4495::2
+ 96 2a01:4f9:c010:518c::1
+ 137 2a01:4f9:c010:a9bc::1
+ 142 2a01:4f9:c010:58c9::1
+ 142 2a01:4f9:c010:58ea::1
+ 144 2a01:4f9:c010:58eb::1
+ 145 2a01:4f9:c010:6ff8::1
+ 148 2a01:4f9:c010:5190::1
+ 149 2a01:4f9:c010:7d6d::1
+ 153 2a01:4f9:c010:5226::1
+ 156 2a01:4f9:c010:7f74::1
+ 160 2a01:4f9:c010:5188::1
+ 161 2a01:4f9:c010:58e5::1
+ 168 2a01:4f9:c010:58ed::1
+ 170 2a01:4f9:c010:548e::1
+ 170 2a01:4f9:c010:8c97::1
+ 175 2a01:4f9:c010:58c8::1
+ 175 2a01:4f9:c010:aada::1
+ 182 2a01:4f9:c010:58ec::1
+ 182 2a01:4f9:c010:ae8c::1
+ 502 2a01:4f9:c010:ee57::1
+ 530 2a01:4f9:c011:567a::1
+ 535 2a01:4f9:c010:d04e::1
+ 539 2a01:4f9:c010:3d9a::1
+ 586 2a01:4f9:c010:93db::1
+ 593 2a01:4f9:c010:a04a::1
+ 601 2a01:4f9:c011:4166::1
+ 607 2a01:4f9:c010:9881::1
+ 640 2a01:4f9:c010:87fb::1
+ 648 2a01:4f9:c010:e680::1
+ 1141 2a01:4f9:3a:2696::2
+ 1146 2a01:4f9:3a:2555::2
+ 3207 2a01:4f9:3a:2c19::2
+
resolve-addresses-geoip2.py
to analyze them and extract all the Hetzner networks and block them