mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2019-11-04
This commit is contained in:
81
content/posts/2019-11.md
Normal file
81
content/posts/2019-11.md
Normal file
@ -0,0 +1,81 @@
|
||||
---
|
||||
title: "November, 2019"
|
||||
date: 2019-11-04T12:20:30+02:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2019-11-04
|
||||
|
||||
- Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
|
||||
- I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
4671942
|
||||
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||||
1277694
|
||||
```
|
||||
|
||||
- So 4.6 million from XMLUI and another 1.2 million from API requests
|
||||
- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
```
|
||||
|
||||
<!--more-->
|
||||
|
||||
- The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
|
||||
1 PUT
|
||||
8 PROPFIND
|
||||
283 OPTIONS
|
||||
30102 POST
|
||||
46581 HEAD
|
||||
4594967 GET
|
||||
```
|
||||
|
||||
- Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
||||
365288
|
||||
```
|
||||
|
||||
- Their user agent is one I've never seen before:
|
||||
|
||||
```
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
|
||||
```
|
||||
|
||||
- Most of them seem to be to community or collection discover and browse results pages like `/handle/10568/103/discover`:
|
||||
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
|
||||
6566 GET /bitstream
|
||||
351928 GET /handle
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
|
||||
214209
|
||||
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
|
||||
86874
|
||||
```
|
||||
|
||||
- As far as I can tell, none of their requests are counted in the Solr statistics:
|
||||
|
||||
```
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
|
||||
```
|
||||
|
||||
- Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load
|
||||
- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):
|
||||
|
||||
```
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
Reference in New Issue
Block a user