CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2019

2019-02-01

  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
  • The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    245 207.46.13.5
    332 54.70.40.11
    385 5.143.231.38
    405 207.46.13.173
    405 207.46.13.75
   1117 66.249.66.219
   1121 35.237.175.180
   1546 5.9.6.51
   2474 45.5.186.2
   5490 85.25.237.71
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
  • There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243

real    0m19.873s
user    0m22.203s
sys     0m1.979s

2019-02-02

  • Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    284 18.195.78.144
    329 207.46.13.32
    417 35.237.175.180
    448 34.218.226.147
    694 2a01:4f8:13b:1296::2
    718 2a01:4f8:140:3192::2
    786 137.108.70.14
   1002 5.9.6.51
   6077 85.25.237.71
   8726 45.5.184.2
  • 45.5.184.2 is CIAT and 85.25.237.71 is the new Linguee bot that I first noticed a few days ago
  • I will increase the Linode alert threshold from 275 to 300% because this is becoming too much!
  • I tested the Atmire Metadata Quality Module (MQM)’s duplicate checked on the some WLE items that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates!

2019-02-03

  • This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!
  • Here are the top IPs before, during, and after that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    325 85.25.237.71
    340 45.5.184.72
    431 5.143.231.8
    756 5.9.6.51
   1048 34.218.226.147
   1203 66.249.66.219
   1496 195.201.104.240
   4658 205.186.128.185
   4658 70.32.83.92
   4852 45.5.184.2
  • 45.5.184.2 is CIAT, 70.32.83.92 and 205.186.128.185 are Macaroni Bros harvesters for CCAFS I think
  • 195.201.104.240 is a new IP address in Germany with the following user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
  • This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
     19 03/Feb/2019:07:42
     20 03/Feb/2019:07:12
     21 03/Feb/2019:07:27
     21 03/Feb/2019:07:28
     25 03/Feb/2019:07:23
     25 03/Feb/2019:07:29
     26 03/Feb/2019:07:33
     28 03/Feb/2019:07:38
     30 03/Feb/2019:07:31
     33 03/Feb/2019:07:35
     33 03/Feb/2019:07:37
     38 03/Feb/2019:07:40
     43 03/Feb/2019:07:24
     43 03/Feb/2019:07:32
     46 03/Feb/2019:07:36
     47 03/Feb/2019:07:34
     47 03/Feb/2019:07:39
     47 03/Feb/2019:07:41
     51 03/Feb/2019:07:26
     59 03/Feb/2019:07:25
  • At least they re-used their Tomcat session!
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
1
  • This user was making requests to /browse, which is not currently under the existing rate limiting of dynamic pages in our nginx config
  • Run all system updates on linode20 and reboot it
    • This will be the new AReS repository explorer server soon

2019-02-04

  • Generate a list of CTA subjects from CGSpace for Peter:
dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
COPY 321
  • Skype with Michael Victor about CKM and CGSpace
  • Discuss the new IITA research theme field with Abenet and decide that we should use cg.identifier.iitatheme