cgspace-notes/content/posts/2019-02.md

146 lines
5.4 KiB
Markdown
Raw Normal View History

2019-02-01 20:46:09 +01:00
---
title: "February, 2019"
date: 2019-02-01T21:37:30+02:00
author: "Alan Orth"
tags: ["Notes"]
---
## 2019-02-01
- Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
2019-02-01 23:01:39 +01:00
- The top IPs before, during, and after this latest alert tonight were:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
405 207.46.13.173
405 207.46.13.75
1117 66.249.66.219
1121 35.237.175.180
1546 5.9.6.51
2474 45.5.186.2
5490 85.25.237.71
```
- `85.25.237.71` is the "Linguee Bot" that I first saw last month
2019-02-01 20:46:09 +01:00
- The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
- There were just over 3 million accesses in the nginx logs last month:
```
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243
real 0m19.873s
user 0m22.203s
sys 0m1.979s
```
<!--more-->
- Normally I'd say this was very high, but [about this time last year]({{< relref "2018-02.md" >}}) I remember thinking the same thing when we had 3.1 million...
- I will have to keep an eye on this to see if there is some error in Solr...
2019-02-01 23:01:39 +01:00
- Atmire sent their [pull request to re-enable the Metadata Quality Module (MQM) on our `5_x-dev` branch](https://github.com/ilri/DSpace/pull/407) today
- I will test it next week and send them feedback
2019-02-01 20:46:09 +01:00
2019-02-02 10:36:24 +01:00
## 2019-02-02
- Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
284 18.195.78.144
329 207.46.13.32
417 35.237.175.180
448 34.218.226.147
694 2a01:4f8:13b:1296::2
718 2a01:4f8:140:3192::2
786 137.108.70.14
1002 5.9.6.51
6077 85.25.237.71
8726 45.5.184.2
```
- `45.5.184.2` is CIAT and `85.25.237.71` is the new Linguee bot that I first noticed a few days ago
- I will increase the Linode alert threshold from 275 to 300% because this is becoming too much!
- I tested the Atmire Metadata Quality Module (MQM)'s duplicate checked on the some [WLE items](https://dspacetest.cgiar.org/handle/10568/81268) that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates!
2019-02-03 10:33:02 +01:00
## 2019-02-03
- This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!
- Here are the top IPs before, during, and after that time:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
325 85.25.237.71
340 45.5.184.72
431 5.143.231.8
756 5.9.6.51
1048 34.218.226.147
1203 66.249.66.219
1496 195.201.104.240
4658 205.186.128.185
4658 70.32.83.92
4852 45.5.184.2
```
- `45.5.184.2` is CIAT, `70.32.83.92` and `205.186.128.185` are Macaroni Bros harvesters for CCAFS I think
- `195.201.104.240` is a new IP address in Germany with the following user agent:
```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
```
- This user was making 2060 requests per minute this morning... seems like I should try to block this type of behavior heuristically, regardless of user agent!
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
19 03/Feb/2019:07:42
20 03/Feb/2019:07:12
21 03/Feb/2019:07:27
21 03/Feb/2019:07:28
25 03/Feb/2019:07:23
25 03/Feb/2019:07:29
26 03/Feb/2019:07:33
28 03/Feb/2019:07:38
30 03/Feb/2019:07:31
33 03/Feb/2019:07:35
33 03/Feb/2019:07:37
38 03/Feb/2019:07:40
43 03/Feb/2019:07:24
43 03/Feb/2019:07:32
46 03/Feb/2019:07:36
47 03/Feb/2019:07:34
47 03/Feb/2019:07:39
47 03/Feb/2019:07:41
51 03/Feb/2019:07:26
59 03/Feb/2019:07:25
```
- At least they re-used their Tomcat session!
```
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
1
```
- This user was making requests to `/browse`, which is not currently under the existing rate limiting of dynamic pages in our nginx config
- I [extended the existing `dynamicpages` (12/m) rate limit to `/browse` and `/discover`](https://github.com/ilri/rmg-ansible-public/commit/36dfb072d6724fb5cdc81ef79cab08ed9ce427ad) with an allowance for bursting of up to five requests for "real" users
2019-02-03 18:05:28 +01:00
- Run all system updates on linode20 and reboot it
- This will be the new AReS repository explorer server soon
2019-02-03 10:33:02 +01:00
2019-02-04 08:32:27 +01:00
## 2019-02-04
- Generate a list of CTA subjects from CGSpace for Peter:
```
dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
COPY 321
```
2019-02-04 10:22:07 +01:00
- Skype with Michael Victor about CKM and CGSpace
2019-02-04 19:09:20 +01:00
- Discuss the new IITA research theme field with Abenet and decide that we should use `cg.identifier.iitatheme`
2019-02-04 10:22:07 +01:00
2019-02-01 20:46:09 +01:00
<!-- vim: set sw=2 ts=2: -->