2019-02-01
- Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
- The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
405 207.46.13.173
405 207.46.13.75
1117 66.249.66.219
1121 35.237.175.180
1546 5.9.6.51
2474 45.5.186.2
5490 85.25.237.71
85.25.237.71
is the “Linguee Bot” that I first saw last month
- The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
- There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243
real 0m19.873s
user 0m22.203s
sys 0m1.979s
2019-02-02
- Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
284 18.195.78.144
329 207.46.13.32
417 35.237.175.180
448 34.218.226.147
694 2a01:4f8:13b:1296::2
718 2a01:4f8:140:3192::2
786 137.108.70.14
1002 5.9.6.51
6077 85.25.237.71
8726 45.5.184.2
45.5.184.2
is CIAT and 85.25.237.71
is the new Linguee bot that I first noticed a few days ago
- I will increase the Linode alert threshold from 275 to 300% because this is becoming too much!
- I tested the Atmire Metadata Quality Module (MQM)’s duplicate checked on the some WLE items that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates!
2019-02-03
- This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!
- Here are the top IPs before, during, and after that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
325 85.25.237.71
340 45.5.184.72
431 5.143.231.8
756 5.9.6.51
1048 34.218.226.147
1203 66.249.66.219
1496 195.201.104.240
4658 205.186.128.185
4658 70.32.83.92
4852 45.5.184.2
45.5.184.2
is CIAT, 70.32.83.92
and 205.186.128.185
are Macaroni Bros harvesters for CCAFS I think
195.201.104.240
is a new IP address in Germany with the following user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
- This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
19 03/Feb/2019:07:42
20 03/Feb/2019:07:12
21 03/Feb/2019:07:27
21 03/Feb/2019:07:28
25 03/Feb/2019:07:23
25 03/Feb/2019:07:29
26 03/Feb/2019:07:33
28 03/Feb/2019:07:38
30 03/Feb/2019:07:31
33 03/Feb/2019:07:35
33 03/Feb/2019:07:37
38 03/Feb/2019:07:40
43 03/Feb/2019:07:24
43 03/Feb/2019:07:32
46 03/Feb/2019:07:36
47 03/Feb/2019:07:34
47 03/Feb/2019:07:39
47 03/Feb/2019:07:41
51 03/Feb/2019:07:26
59 03/Feb/2019:07:25
- At least they re-used their Tomcat session!
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
1
- This user was making requests to
/browse
, which is not currently under the existing rate limiting of dynamic pages in our nginx config
- Run all system updates on linode20 and reboot it
- This will be the new AReS repository explorer server soon
2019-02-04
- Generate a list of CTA subjects from CGSpace for Peter:
dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
COPY 321
- Skype with Michael Victor about CKM and CGSpace
- Discuss the new IITA research theme field with Abenet and decide that we should use
cg.identifier.iitatheme
- This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
589 2a01:4f8:140:3192::2
762 66.249.66.219
889 35.237.175.180
1332 34.218.226.147
1393 5.9.6.51
1940 50.116.102.77
3578 85.25.237.71
4311 45.5.184.2
4658 205.186.128.185
4658 70.32.83.92
- At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there’s nothing we can do to improve REST API performance!
- Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host?
2019-02-05
- Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file
- In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use
toString()
:
or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
isNotNull(value.match(/.*\u2019.*/)),
isNotNull(value.match(/.*\u00b4.*/)),
isNotNull(value.match(/.*\u007e.*/))
).toString()
$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
- I applied them on DSpace Test and CGSpace and started a full Discovery re-index:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
- Peter had marked several terms with
||
to indicate multiple values in his corrections so I will have to go back and do those manually:
EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
ENVIRONMENTAL PROTECTION AND NATURAL RESOURCES MANAGEMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
FISHERIES AND AQUACULTURE,FISHERIES||AQUACULTURE
MARKETING AND TRADE,MARKETING||TRADE
MARKETING ET COMMERCE,MARKETING||COMMERCE
NATURAL RESOURCES AND ENVIRONMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
PÊCHES ET AQUACULTURE,PÊCHES||AQUACULTURE
PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
2019-02-06
- I dumped the CTA community so I can try to fix the subjects with multiple subjects that Peter indicated in his corrections:
$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
- Then I used
csvcut
to get only the CTA subject columns:
$ csvcut -c "id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]" /tmp/cta.csv > /tmp/cta-subjects.csv
- After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values
- Then I imported it back into CGSpace:
$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
- Another day, another alert about high load on CGSpace (linode18) from Linode
- This time the load average was 370% and the top ten IPs before, during, and after that time were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
689 35.237.175.180
1236 5.9.6.51
1305 34.218.226.147
1580 66.249.66.219
1939 50.116.102.77
2313 108.212.105.35
4666 205.186.128.185
4666 70.32.83.92
4950 85.25.237.71
5158 45.5.186.2
- Looking closer at the top users, I see
45.5.186.2
is in Brazil and was making over 100 requests per minute to the REST API:
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
118 06/Feb/2019:05:46
119 06/Feb/2019:05:37
119 06/Feb/2019:05:47
120 06/Feb/2019:05:43
120 06/Feb/2019:05:44
121 06/Feb/2019:05:38
122 06/Feb/2019:05:39
125 06/Feb/2019:05:42
126 06/Feb/2019:05:40
126 06/Feb/2019:05:41
- I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
10411 200
1 301
7 302
3 404
18 499
2 500
- I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
328 220.247.212.35
372 66.249.66.221
380 207.46.13.2
519 2a01:4f8:140:3192::2
572 5.143.231.8
689 35.237.175.180
771 108.212.105.35
1236 5.9.6.51
1554 66.249.66.219
4942 85.25.237.71
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
10 66.249.66.221
26 66.249.66.219
69 5.143.231.8
340 45.5.184.72
1040 34.218.226.147
1542 108.212.105.35
1937 50.116.102.77
4661 205.186.128.185
4661 70.32.83.92
5102 45.5.186.2