cgspace-notes/content/post/2018-01.md

252 lines
9.0 KiB
Markdown
Raw Normal View History

2018-01-02 17:52:14 +01:00
---
title: "January, 2018"
date: 2018-01-02T08:35:54-08:00
author: "Alan Orth"
tags: ["Notes"]
---
## 2018-01-02
- Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
- I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
2018-01-02 18:30:34 +01:00
- The nginx logs show HTTP 200s until `02/Jan/2018:11:27:17 +0000` when Uptime Robot got an HTTP 500
- In dspace.log around that time I see many errors like "Client closed the connection before file download was complete"
- And just before that I see this:
```
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
```
- Ah hah! So the pool was actually empty!
- I need to increase that, let's try to bump it up from 50 to 75
- After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
2018-01-02 17:52:14 +01:00
- I notice this error quite a few times in dspace.log:
```
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
```
- And there are many of these errors every day for the past month:
```
$ grep -c "Error while searching for sidebar facets" dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
```
- Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains
<!--more-->
2018-01-03 16:49:38 +01:00
## 2018-01-03
- I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM
- Looks like I need to increase the database pool size again:
```
$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
```
- For some reason there were a lot of "active" connections last night:
![CGSpace PostgreSQL connections](/cgspace-notes/2018/01/postgres_connections-day.png)
- The active IPs in XMLUI are:
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
607 40.77.167.141
611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
663 188.226.169.37
759 157.55.39.245
887 68.180.229.254
1037 157.55.39.175
1068 216.244.66.245
1495 66.249.64.91
1934 104.196.152.243
2219 134.155.96.78
```
- 134.155.96.78 appears to be at the University of Mannheim in Germany
- They identify as: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://ifm.uni-mannheim.de)
- This appears to be the [Internet Archive's open source bot](https://github.com/internetarchive/heritrix3)
- They seem to be re-using their Tomcat session so I don't need to do anything to them just yet:
```
$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2
```
- The API logs show the normal users:
```
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
32 207.46.13.182
38 40.77.167.132
38 68.180.229.254
43 66.249.64.91
46 40.77.167.141
49 157.55.39.245
79 157.55.39.175
1533 50.116.102.77
4069 70.32.83.92
9355 45.5.184.196
```
- In other related news I see a sizeable amount of requests coming from python-requests
- For example, just in the last day there were 1700!
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
1773
```
- But they come from hundreds of IPs, many of which are 54.x.x.x:
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
9 54.144.87.92
9 54.146.222.143
9 54.146.249.249
9 54.158.139.206
9 54.161.235.224
9 54.163.41.19
9 54.163.4.51
9 54.196.195.107
9 54.198.89.134
9 54.80.158.113
10 54.198.171.98
10 54.224.53.185
10 54.226.55.207
10 54.227.8.195
10 54.242.234.189
10 54.242.238.209
10 54.80.100.66
11 54.161.243.121
11 54.205.154.178
11 54.234.225.84
11 54.87.23.173
11 54.90.206.30
12 54.196.127.62
12 54.224.242.208
12 54.226.199.163
13 54.162.149.249
13 54.211.182.255
19 50.17.61.150
21 54.211.119.107
139 164.39.7.62
```
- I have no idea what these are but they seem to be coming from Amazon...
- I guess for now I just have to increase the database connection pool's max active
- It's currently 75 and normally I'd just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling
2018-01-04 17:02:36 +01:00
## 2018-01-04
- CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night
- The XMLUI logs show this activity:
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
968 197.211.63.81
981 213.55.99.121
1039 66.249.64.93
1258 157.55.39.175
1273 207.46.13.182
1311 157.55.39.191
1319 157.55.39.197
1775 66.249.64.78
2216 104.196.152.243
3366 66.249.64.91
```
- Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!
```
2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
```
- So for this week that is the number one problem!
```
$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559
```
- I will just bump the connection limit to 300 because I'm fucking fed up with this shit
- Once I get back to Amman I will have to try to create different database pools for different web applications, like recently discussed on the dspace-tech mailing list
2018-01-04 17:33:26 +01:00
- Create accounts on CGSpace for two CTA staff km4ard@cta.int and bheenick@cta.int
2018-01-05 16:05:14 +01:00
## 2018-01-05
- Peter said that CGSpac was down last night and Tsega restarted Tomcat
- I don't see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:
```
$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559
dspace.log.2018-01-05:0
```
2018-01-05 16:10:49 +01:00
- Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space
- I had a look and there is one Apache 2 log file that is 73GB, with lots of this:
```
[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
```
- I will delete the log file for now and tell Danny
2018-01-05 16:26:23 +01:00
- Also, I'm still seeing a hundred or so of the "ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer" errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is