--- title: "February, 2019" date: 2019-02-01T21:37:30+02:00 author: "Alan Orth" tags: ["Notes"] --- ## 2019-02-01 - Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again! - The top IPs before, during, and after this latest alert tonight were: ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 245 332 385 405 405 1117 1121 1546 2474 5490 ``` - `` is the "Linguee Bot" that I first saw last month - The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase - There were just over 3 million accesses in the nginx logs last month: ``` # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019" 3018243 real 0m19.873s user 0m22.203s sys 0m1.979s ``` - Normally I'd say this was very high, but [about this time last year]({{< relref "2018-02.md" >}}) I remember thinking the same thing when we had 3.1 million... - I will have to keep an eye on this to see if there is some error in Solr... - Atmire sent their [pull request to re-enable the Metadata Quality Module (MQM) on our `5_x-dev` branch](https://github.com/ilri/DSpace/pull/407) today - I will test it next week and send them feedback ## 2019-02-02 - Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time: ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 284 329 417 448 694 2a01:4f8:13b:1296::2 718 2a01:4f8:140:3192::2 786 1002 6077 8726 ``` - `` is CIAT and `` is the new Linguee bot that I first noticed a few days ago - I will increase the Linode alert threshold from 275 to 300% because this is becoming too much! - I tested the Atmire Metadata Quality Module (MQM)'s duplicate checked on the some [WLE items](https://dspacetest.cgiar.org/handle/10568/81268) that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates! ## 2019-02-03 - This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%! - Here are the top IPs before, during, and after that time: ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 325 340 431 756 1048 1203 1496 4658 4658 4852 ``` - `` is CIAT, `` and `` are Macaroni Bros harvesters for CCAFS I think - `` is a new IP address in Germany with the following user agent: ``` Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0 ``` - This user was making 20–60 requests per minute this morning... seems like I should try to block this type of behavior heuristically, regardless of user agent! ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20 19 03/Feb/2019:07:42 20 03/Feb/2019:07:12 21 03/Feb/2019:07:27 21 03/Feb/2019:07:28 25 03/Feb/2019:07:23 25 03/Feb/2019:07:29 26 03/Feb/2019:07:33 28 03/Feb/2019:07:38 30 03/Feb/2019:07:31 33 03/Feb/2019:07:35 33 03/Feb/2019:07:37 38 03/Feb/2019:07:40 43 03/Feb/2019:07:24 43 03/Feb/2019:07:32 46 03/Feb/2019:07:36 47 03/Feb/2019:07:34 47 03/Feb/2019:07:39 47 03/Feb/2019:07:41 51 03/Feb/2019:07:26 59 03/Feb/2019:07:25 ``` - At least they re-used their Tomcat session! ``` $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=' dspace.log.2019-02-03 | sort | uniq | wc -l 1 ``` - This user was making requests to `/browse`, which is not currently under the existing rate limiting of dynamic pages in our nginx config - I [extended the existing `dynamicpages` (12/m) rate limit to `/browse` and `/discover`](https://github.com/ilri/rmg-ansible-public/commit/36dfb072d6724fb5cdc81ef79cab08ed9ce427ad) with an allowance for bursting of up to five requests for "real" users - Run all system updates on linode20 and reboot it - This will be the new AReS repository explorer server soon ## 2019-02-04 - Generate a list of CTA subjects from CGSpace for Peter: ``` dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header; COPY 321 ``` - Skype with Michael Victor about CKM and CGSpace - Discuss the new IITA research theme field with Abenet and decide that we should use `cg.identifier.iitatheme` - This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time: ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 589 2a01:4f8:140:3192::2 762 889 1332 1393 1940 3578 4311 4658 4658 ``` - At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there's nothing we can do to improve REST API performance! - Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host? ## 2019-02-05 - Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file - In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use `toString()`: ``` or( isNotNull(value.match(/.*\uFFFD.*/)), isNotNull(value.match(/.*\u00A0.*/)), isNotNull(value.match(/.*\u200A.*/)), isNotNull(value.match(/.*\u2019.*/)), isNotNull(value.match(/.*\u00b4.*/)), isNotNull(value.match(/.*\u007e.*/)) ).toString() ``` - Testing the corrections for sixty-five items and sixteen deletions using my [fix-metadata-values.py](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) and [delete-metadata-values.py](https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be) scripts: ``` $ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d $ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d ``` - I applied them on DSpace Test and CGSpace and started a full Discovery re-index: ``` $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m" $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b ```