Add notes for 2018-01-31

This commit is contained in:
2018-01-31 12:01:34 +02:00
parent b051fb4bf6
commit 0e9a9d06a4
7 changed files with 162 additions and 8 deletions

View File

@ -1288,3 +1288,75 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace" maxActiv
```
- I filed a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
## 2018-01-31
- UptimeRobot says CGSpace went down at 7:57 AM, and indeed I see a lot of HTTP 499 codes in nginx logs
- PostgreSQL activity shows 222 database connections
- Now PostgreSQL activity shows 265 database connections!
- I don't see any errors anywhere...
- Now PostgreSQL activity shows 308 connections!
- Well this is interesting, there are 400 Tomcat threads busy:
```
# munin-run tomcat_threads
busy.value 400
idle.value 0
max.value 400
```
- And wow, we finally exhausted the database connections, from dspace.log:
```
2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
```
- Now even the nightly Atmire background thing is getting HTTP 500 error:
```
Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
```
- For now I will restart Tomcat to clear this shit and bring the site back up
- The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
67 66.249.66.70
70 207.46.13.12
71 197.210.168.174
83 207.46.13.13
85 157.55.39.79
89 207.46.13.14
123 68.180.228.157
198 66.249.66.90
219 41.204.190.40
255 2405:204:a208:1e12:132:2a8e:ad28:46c0
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2 65.55.210.187
2 66.249.66.90
3 157.55.39.79
4 197.232.39.92
4 34.216.252.127
6 104.196.152.243
6 213.55.85.89
15 122.52.115.13
16 213.55.107.186
596 45.5.184.196
```
- This looks reasonable to me, so I have no idea why we ran out of Tomcat threads
![Tomcat threads](/cgspace-notes/2018/01/tomcat-threads-day.png)
- We need to start graphing the Tomcat sessions as well, though that requires JMX
- Also, I wonder if I could disable the nightly Atmire thing
- God, I don't know where this load is coming from
- Since I bumped up the Tomcat threads from 200 to 400 the load on the server has been sustained at about 200%:
![CPU usage week](/cgspace-notes/2018/01/cpu-week.png)
- I should make separate database pools for the web applications and the API applications like REST and OAI