mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-09-12
This commit is contained in:
@ -137,4 +137,72 @@ COMMIT
|
||||
|
||||
- Start a full Discovery index on CGSpace to catch these changes in the Discovery
|
||||
|
||||
## 2022-09-11
|
||||
|
||||
- Today is Sunday and I see the load on the server is high
|
||||
- Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it's not from them!
|
||||
- Looking at the top IPs this morning:
|
||||
|
||||
```console
|
||||
# cat /var/log/nginx/{access,library-access,oai,rest}.log /var/log/nginx/{access,library-access,oai,rest}.log.1 | grep '11/Sep/2022' | awk '{print $1}' | sort | uniq -c | sort -h | tail -n 40
|
||||
...
|
||||
165 64.233.172.79
|
||||
166 87.250.224.34
|
||||
200 69.162.124.231
|
||||
202 216.244.66.198
|
||||
385 207.46.13.149
|
||||
398 207.46.13.147
|
||||
421 66.249.64.185
|
||||
422 157.55.39.81
|
||||
442 2a01:4f8:1c17:5550::1
|
||||
451 64.124.8.36
|
||||
578 137.184.159.211
|
||||
597 136.243.228.195
|
||||
1185 66.249.64.183
|
||||
1201 157.55.39.80
|
||||
3135 80.248.237.167
|
||||
4794 54.195.118.125
|
||||
5486 45.5.186.2
|
||||
6322 2a01:7e00::f03c:91ff:fe9a:3a37
|
||||
9556 66.249.64.181
|
||||
```
|
||||
|
||||
- The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
|
||||
- Then there's 80.248.237.167, which is using a normal user agent and scraping Discovery
|
||||
- That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as 'bot' for XMLUI so most of these requests are HTTP 503
|
||||
- On another note, I'm curious to explore enabling caching of certain REST API responses
|
||||
- For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
|
||||
|
||||
```console
|
||||
# awk '{print $7}' /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n 10
|
||||
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
|
||||
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
|
||||
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
|
||||
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
|
||||
5 /rest/handle/10568/110310?expand=all
|
||||
5 /rest/handle/10568/89980?expand=all
|
||||
5 /rest/handle/10568/97614?expand=all
|
||||
6 /rest/handle/10568/107086?expand=all
|
||||
6 /rest/handle/10568/108503?expand=all
|
||||
6 /rest/handle/10568/98424?expand=all
|
||||
```
|
||||
|
||||
- I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
|
||||
- Will be interesting to check the results above as the day goes on (now 10AM)
|
||||
- To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday's log):
|
||||
|
||||
```console
|
||||
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
|
||||
33733
|
||||
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
|
||||
5637
|
||||
```
|
||||
|
||||
- In the afternoon I started a harvest on AReS (which should affect the numbers above also)
|
||||
- I enabled an nginx proxy cache on DSpace Test for this location regex: `location ~ /rest/(handle|items|collections|communities)/.+`
|
||||
|
||||
## 2022-09-12
|
||||
|
||||
- I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user