Add notes for 2022-09-12

2025-01-27 05:49:12 +01:00 · 2022-09-12 11:35:57 +03:00
parent 69392070de
commit 147ad86375
120 changed files with 294 additions and 152 deletions
--- a/content/posts/2022-09.md
+++ b/content/posts/2022-09.md
@ -137,4 +137,72 @@ COMMIT

 - Start a full Discovery index on CGSpace to catch these changes in the Discovery

+## 2022-09-11
+
+- Today is Sunday and I see the load on the server is high
+  - Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it's not from them!
+  - Looking at the top IPs this morning:
+
+```console
+# cat /var/log/nginx/{access,library-access,oai,rest}.log /var/log/nginx/{access,library-access,oai,rest}.log.1 | grep '11/Sep/2022' | awk '{print $1}' | sort | uniq -c | sort -h | tail -n 40
+...
+    165 64.233.172.79
+    166 87.250.224.34
+    200 69.162.124.231
+    202 216.244.66.198
+    385 207.46.13.149
+    398 207.46.13.147
+    421 66.249.64.185
+    422 157.55.39.81
+    442 2a01:4f8:1c17:5550::1
+    451 64.124.8.36
+    578 137.184.159.211
+    597 136.243.228.195
+   1185 66.249.64.183
+   1201 157.55.39.80
+   3135 80.248.237.167
+   4794 54.195.118.125
+   5486 45.5.186.2
+   6322 2a01:7e00::f03c:91ff:fe9a:3a37
+   9556 66.249.64.181
+```
+
+- The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
+- Then there's 80.248.237.167, which is using a normal user agent and scraping Discovery
+  - That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as 'bot' for XMLUI so most of these requests are HTTP 503
+- On another note, I'm curious to explore enabling caching of certain REST API responses
+  - For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
+
+```console
+# awk '{print $7}' /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n 10
+      4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
+      4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
+      4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
+      4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
+      5 /rest/handle/10568/110310?expand=all
+      5 /rest/handle/10568/89980?expand=all
+      5 /rest/handle/10568/97614?expand=all
+      6 /rest/handle/10568/107086?expand=all
+      6 /rest/handle/10568/108503?expand=all
+      6 /rest/handle/10568/98424?expand=all
+```
+
+- I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
+  - Will be interesting to check the results above as the day goes on (now 10AM)
+  - To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday's log):
+
+```console
+# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
+33733
+# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
+5637
+```
+
+- In the afternoon I started a harvest on AReS (which should affect the numbers above also)
+- I enabled an nginx proxy cache on DSpace Test for this location regex: `location ~ /rest/(handle|items|collections|communities)/.+`
+
+## 2022-09-12
+
+- I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled
+
 <!-- vim: set sw=2 ts=2: -->