mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-09-27 20:54:16 +02:00
82 lines
3.0 KiB
Markdown
82 lines
3.0 KiB
Markdown
|
---
|
||
|
title: "November, 2019"
|
||
|
date: 2019-11-04T12:20:30+02:00
|
||
|
author: "Alan Orth"
|
||
|
categories: ["Notes"]
|
||
|
---
|
||
|
|
||
|
## 2019-11-04
|
||
|
|
||
|
- Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
|
||
|
- I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
|
||
|
|
||
|
```
|
||
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||
|
4671942
|
||
|
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
||
|
1277694
|
||
|
```
|
||
|
|
||
|
- So 4.6 million from XMLUI and another 1.2 million from API requests
|
||
|
- Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||
|
|
||
|
```
|
||
|
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||
|
1183456
|
||
|
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||
|
106781
|
||
|
```
|
||
|
|
||
|
<!--more-->
|
||
|
|
||
|
- The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
|
||
|
|
||
|
```
|
||
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
|
||
|
1 PUT
|
||
|
8 PROPFIND
|
||
|
283 OPTIONS
|
||
|
30102 POST
|
||
|
46581 HEAD
|
||
|
4594967 GET
|
||
|
```
|
||
|
|
||
|
- Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
|
||
|
|
||
|
```
|
||
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
||
|
365288
|
||
|
```
|
||
|
|
||
|
- Their user agent is one I've never seen before:
|
||
|
|
||
|
```
|
||
|
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
|
||
|
```
|
||
|
|
||
|
- Most of them seem to be to community or collection discover and browse results pages like `/handle/10568/103/discover`:
|
||
|
|
||
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
|
||
|
6566 GET /bitstream
|
||
|
351928 GET /handle
|
||
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
|
||
|
214209
|
||
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
|
||
|
86874
|
||
|
```
|
||
|
|
||
|
- As far as I can tell, none of their requests are counted in the Solr statistics:
|
||
|
|
||
|
```
|
||
|
$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
|
||
|
```
|
||
|
|
||
|
- Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load
|
||
|
- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):
|
||
|
|
||
|
```
|
||
|
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
||
|
```
|
||
|
|
||
|
<!-- vim: set sw=2 ts=2: -->
|