Add notes for 2022-09-12

This commit is contained in:
2022-09-12 11:35:57 +03:00
parent 69392070de
commit 147ad86375
120 changed files with 294 additions and 152 deletions

View File

@ -137,4 +137,72 @@ COMMIT
- Start a full Discovery index on CGSpace to catch these changes in the Discovery
## 2022-09-11
- Today is Sunday and I see the load on the server is high
- Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it's not from them!
- Looking at the top IPs this morning:
```console
# cat /var/log/nginx/{access,library-access,oai,rest}.log /var/log/nginx/{access,library-access,oai,rest}.log.1 | grep '11/Sep/2022' | awk '{print $1}' | sort | uniq -c | sort -h | tail -n 40
...
165 64.233.172.79
166 87.250.224.34
200 69.162.124.231
202 216.244.66.198
385 207.46.13.149
398 207.46.13.147
421 66.249.64.185
422 157.55.39.81
442 2a01:4f8:1c17:5550::1
451 64.124.8.36
578 137.184.159.211
597 136.243.228.195
1185 66.249.64.183
1201 157.55.39.80
3135 80.248.237.167
4794 54.195.118.125
5486 45.5.186.2
6322 2a01:7e00::f03c:91ff:fe9a:3a37
9556 66.249.64.181
```
- The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
- Then there's 80.248.237.167, which is using a normal user agent and scraping Discovery
- That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as 'bot' for XMLUI so most of these requests are HTTP 503
- On another note, I'm curious to explore enabling caching of certain REST API responses
- For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
```console
# awk '{print $7}' /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n 10
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
5 /rest/handle/10568/110310?expand=all
5 /rest/handle/10568/89980?expand=all
5 /rest/handle/10568/97614?expand=all
6 /rest/handle/10568/107086?expand=all
6 /rest/handle/10568/108503?expand=all
6 /rest/handle/10568/98424?expand=all
```
- I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
- Will be interesting to check the results above as the day goes on (now 10AM)
- To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday's log):
```console
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
33733
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
5637
```
- In the afternoon I started a harvest on AReS (which should affect the numbers above also)
- I enabled an nginx proxy cache on DSpace Test for this location regex: `location ~ /rest/(handle|items|collections|communities)/.+`
## 2022-09-12
- I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled
<!-- vim: set sw=2 ts=2: -->