cgspace-notes/2017-12.md at 3add488b9b9be4541251f321720a477d740c581f

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-14 19:07:03 +01:00

Alan Orth 3add488b9b

Add notes for 2017-12-13

2017-12-13 15:53:35 +02:00

7.0 KiB

Raw Blame History

title

date

author

2017-12-01

Uptime Robot noticed that CGSpace went down
The logs say "Timeout waiting for idle object"
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:

# cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    763 2.86.122.76
    907 207.46.13.94
   1018 157.55.39.206
   1021 157.55.39.235
   1407 66.249.66.70
   1411 104.196.152.243
   1503 50.116.102.77
   1805 66.249.66.90
   4007 70.32.83.92
   6061 45.5.184.196

The number of DSpace sessions isn't even that high:

$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
5815

Connections in the last two hours:

# cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail                                                      
     78 93.160.60.22
    101 40.77.167.122
    113 66.249.66.70
    129 157.55.39.206
    130 157.55.39.235
    135 40.77.167.58
    164 68.180.229.254
    177 87.100.118.220
    188 66.249.66.90
    314 2.86.122.76

What the fuck is going on?
I've never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:

$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
822

Appears to be some new bot:

2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)"

I restarted Tomcat and everything came back up
I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx
I will also add 'Drupal' to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots

# cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
      3 54.75.205.145
      6 70.32.83.92
     14 2a01:7e00::f03c:91ff:fe18:7396
     46 2001:4b99:1:1:216:3eff:fe2c:dc6c
    319 2001:4b99:1:1:216:3eff:fe76:205b

2017-12-03

Linode alerted that CGSpace's load was 327.5% from 6 to 8 AM again

2017-12-04

Linode alerted that CGSpace's load was 255.5% from 8 to 10 AM again
I looked at the Munin stats on DSpace Test (linode02) again to see how the PostgreSQL tweaks from a few weeks ago were holding up:

The results look fantastic! So the random_page_cost tweak is massively important for informing the PostgreSQL scheduler that there is no "cost" to accessing random pages, as we're on an SSD!
I guess we could probably even reduce the PostgreSQL connections in DSpace / PostgreSQL after using this
Run system updates on DSpace Test (linode02) and reboot it
I'm going to enable the PostgreSQL random_page_cost tweak on CGSpace
For reference, here is the past month's connections:

2017-12-05

Linode alerted again that the CPU usage on CGSpace was high this morning from 8 to 10 AM
CORE updated the entry for CGSpace on their index: https://core.ac.uk/search?q=repositories.id:(1016)&fullTextOnly=false
Linode alerted again that the CPU usage on CGSpace was high this evening from 8 to 10 PM

2017-12-06

Linode alerted again that the CPU usage on CGSpace was high this morning from 6 to 8 AM
Uptime Robot alerted that the server went down and up around 8:53 this morning
Uptime Robot alerted that CGSpace was down and up again a few minutes later
I don't see any errors in the DSpace logs but I see in nginx's access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
Looking at the REST API logs I see some new client IP I haven't noticed before:

# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
     18 95.108.181.88
     19 68.180.229.254
     30 207.46.13.151
     33 207.46.13.110
     38 40.77.167.20
     41 157.55.39.223
     82 104.196.152.243
   1529 50.116.102.77
   4005 70.32.83.92
   6045 45.5.184.196

50.116.102.77 is apparently in the US on websitewelcome.com

2017-12-07

Uptime Robot reported a few times today that CGSpace was down and then up
At one point Tsega restarted Tomcat
I never got any alerts about high load from Linode though...
I looked just now and see that there are 121 PostgreSQL connections!
The top users right now are:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
    838 40.77.167.11
    939 66.249.66.223
   1149 66.249.66.206
   1316 207.46.13.110
   1322 207.46.13.151
   1323 2001:da8:203:2224:c912:1106:d94f:9189
   1414 157.55.39.223
   2378 104.196.152.243
   2662 66.249.66.219
   5110 124.17.34.60

We've never seen 124.17.34.60 yet, but it's really hammering us!
Apparently it is from China, and here is one of its user agents:

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)

It is responsible for 4,500 Tomcat sessions today alone:

$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
4574

I've adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it's the same bot on the same subnet
I was running the DSpace cleanup task manually and it hit an error:

$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
...
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle".

The solution is like I discovered in 2017-04, to set the primary_bitstream_id to null:

dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
UPDATE 1

2017-12-13

Linode alerted that CGSpace was using high CPU from 10:13 to 12:13 this morning

7.0 KiB Raw Blame History

2017-12-01

2017-12-03

2017-12-04

2017-12-05

2017-12-06

2017-12-07

2017-12-13

7.0 KiB

Raw Blame History