diff --git a/content/post/2018-01.md b/content/post/2018-01.md index 7fae0a865..1a58bfb67 100644 --- a/content/post/2018-01.md +++ b/content/post/2018-01.md @@ -1288,3 +1288,75 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace" maxActiv ``` - I filed a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566 + +## 2018-01-31 + +- UptimeRobot says CGSpace went down at 7:57 AM, and indeed I see a lot of HTTP 499 codes in nginx logs +- PostgreSQL activity shows 222 database connections +- Now PostgreSQL activity shows 265 database connections! +- I don't see any errors anywhere... +- Now PostgreSQL activity shows 308 connections! +- Well this is interesting, there are 400 Tomcat threads busy: + +``` +# munin-run tomcat_threads +busy.value 400 +idle.value 0 +max.value 400 +``` + +- And wow, we finally exhausted the database connections, from dspace.log: + +``` +2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - +org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000]. +``` + +- Now even the nightly Atmire background thing is getting HTTP 500 error: + +``` +Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException +SEVERE: Mapped exception to response: 500 (Internal Server Error) +javax.ws.rs.WebApplicationException +``` + +- For now I will restart Tomcat to clear this shit and bring the site back up +- The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI: + +``` +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 67 66.249.66.70 + 70 207.46.13.12 + 71 197.210.168.174 + 83 207.46.13.13 + 85 157.55.39.79 + 89 207.46.13.14 + 123 68.180.228.157 + 198 66.249.66.90 + 219 41.204.190.40 + 255 2405:204:a208:1e12:132:2a8e:ad28:46c0 +# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 2 65.55.210.187 + 2 66.249.66.90 + 3 157.55.39.79 + 4 197.232.39.92 + 4 34.216.252.127 + 6 104.196.152.243 + 6 213.55.85.89 + 15 122.52.115.13 + 16 213.55.107.186 + 596 45.5.184.196 +``` + +- This looks reasonable to me, so I have no idea why we ran out of Tomcat threads + +![Tomcat threads](/cgspace-notes/2018/01/tomcat-threads-day.png) + +- We need to start graphing the Tomcat sessions as well, though that requires JMX +- Also, I wonder if I could disable the nightly Atmire thing +- God, I don't know where this load is coming from +- Since I bumped up the Tomcat threads from 200 to 400 the load on the server has been sustained at about 200%: + +![CPU usage week](/cgspace-notes/2018/01/cpu-week.png) + +- I should make separate database pools for the web applications and the API applications like REST and OAI diff --git a/public/2018-01/index.html b/public/2018-01/index.html index a41cc2347..91f4c5de1 100644 --- a/public/2018-01/index.html +++ b/public/2018-01/index.html @@ -92,7 +92,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -194,9 +194,9 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv "@type": "BlogPosting", "headline": "January, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-01/", - "wordCount": "7537", + "wordCount": "7885", "datePublished": "2018-01-02T08:35:54-08:00", - "dateModified": "2018-01-29T09:47:55+02:00", + "dateModified": "2018-01-29T12:25:30+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -1688,6 +1688,88 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"
# munin-run tomcat_threads
+busy.value 400
+idle.value 0
+max.value 400
+
+
+2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
+org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
+
+
+Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
+SEVERE: Mapped exception to response: 500 (Internal Server Error)
+javax.ws.rs.WebApplicationException
+
+
+# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+ 67 66.249.66.70
+ 70 207.46.13.12
+ 71 197.210.168.174
+ 83 207.46.13.13
+ 85 157.55.39.79
+ 89 207.46.13.14
+ 123 68.180.228.157
+ 198 66.249.66.90
+ 219 41.204.190.40
+ 255 2405:204:a208:1e12:132:2a8e:ad28:46c0
+# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+ 2 65.55.210.187
+ 2 66.249.66.90
+ 3 157.55.39.79
+ 4 197.232.39.92
+ 4 34.216.252.127
+ 6 104.196.152.243
+ 6 213.55.85.89
+ 15 122.52.115.13
+ 16 213.55.107.186
+ 596 45.5.184.196
+
+
+