Add notes for 2018-01-31

2025-01-27 05:49:12 +01:00 · 2018-01-31 12:01:34 +02:00
parent b051fb4bf6
commit 0e9a9d06a4
7 changed files with 162 additions and 8 deletions
--- a/content/post/2018-01.md
+++ b/content/post/2018-01.md
@ -1288,3 +1288,75 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"  maxActiv
 ```

 - I filed a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
+
+## 2018-01-31
+
+- UptimeRobot says CGSpace went down at 7:57 AM, and indeed I see a lot of HTTP 499 codes in nginx logs
+- PostgreSQL activity shows 222 database connections
+- Now PostgreSQL activity shows 265 database connections!
+- I don't see any errors anywhere...
+- Now PostgreSQL activity shows 308 connections!
+- Well this is interesting, there are 400 Tomcat threads busy:
+
+``` 
+# munin-run tomcat_threads
+busy.value 400
+idle.value 0
+max.value 400
+```
+
+- And wow, we finally exhausted the database connections, from dspace.log:
+
+```
+2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
+org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
+```
+
+- Now even the nightly Atmire background thing is getting HTTP 500 error:
+
+```
+Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
+SEVERE: Mapped exception to response: 500 (Internal Server Error)
+javax.ws.rs.WebApplicationException
+```
+
+- For now I will restart Tomcat to clear this shit and bring the site back up
+- The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:
+
+```
+# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+     67 66.249.66.70
+     70 207.46.13.12
+     71 197.210.168.174
+     83 207.46.13.13
+     85 157.55.39.79
+     89 207.46.13.14
+    123 68.180.228.157
+    198 66.249.66.90
+    219 41.204.190.40
+    255 2405:204:a208:1e12:132:2a8e:ad28:46c0
+# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+      2 65.55.210.187
+      2 66.249.66.90
+      3 157.55.39.79
+      4 197.232.39.92
+      4 34.216.252.127
+      6 104.196.152.243
+      6 213.55.85.89
+     15 122.52.115.13
+     16 213.55.107.186
+    596 45.5.184.196
+```
+
+- This looks reasonable to me, so I have no idea why we ran out of Tomcat threads
+
+![Tomcat threads](/cgspace-notes/2018/01/tomcat-threads-day.png)
+
+- We need to start graphing the Tomcat sessions as well, though that requires JMX
+- Also, I wonder if I could disable the nightly Atmire thing
+- God, I don't know where this load is coming from
+- Since I bumped up the Tomcat threads from 200 to 400 the load on the server has been sustained at about 200%:
+
+![CPU usage week](/cgspace-notes/2018/01/cpu-week.png)
+
+- I should make separate database pools for the web applications and the API applications like REST and OAI