diff --git a/content/post/2018-01.md b/content/post/2018-01.md index 7fae0a865..1a58bfb67 100644 --- a/content/post/2018-01.md +++ b/content/post/2018-01.md @@ -1288,3 +1288,75 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace" maxActiv ``` - I filed a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566 + +## 2018-01-31 + +- UptimeRobot says CGSpace went down at 7:57 AM, and indeed I see a lot of HTTP 499 codes in nginx logs +- PostgreSQL activity shows 222 database connections +- Now PostgreSQL activity shows 265 database connections! +- I don't see any errors anywhere... +- Now PostgreSQL activity shows 308 connections! +- Well this is interesting, there are 400 Tomcat threads busy: + +``` +# munin-run tomcat_threads +busy.value 400 +idle.value 0 +max.value 400 +``` + +- And wow, we finally exhausted the database connections, from dspace.log: + +``` +2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - +org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000]. +``` + +- Now even the nightly Atmire background thing is getting HTTP 500 error: + +``` +Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException +SEVERE: Mapped exception to response: 500 (Internal Server Error) +javax.ws.rs.WebApplicationException +``` + +- For now I will restart Tomcat to clear this shit and bring the site back up +- The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI: + +``` +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 67 66.249.66.70 + 70 207.46.13.12 + 71 197.210.168.174 + 83 207.46.13.13 + 85 157.55.39.79 + 89 207.46.13.14 + 123 68.180.228.157 + 198 66.249.66.90 + 219 41.204.190.40 + 255 2405:204:a208:1e12:132:2a8e:ad28:46c0 +# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 2 65.55.210.187 + 2 66.249.66.90 + 3 157.55.39.79 + 4 197.232.39.92 + 4 34.216.252.127 + 6 104.196.152.243 + 6 213.55.85.89 + 15 122.52.115.13 + 16 213.55.107.186 + 596 45.5.184.196 +``` + +- This looks reasonable to me, so I have no idea why we ran out of Tomcat threads + +![Tomcat threads](/cgspace-notes/2018/01/tomcat-threads-day.png) + +- We need to start graphing the Tomcat sessions as well, though that requires JMX +- Also, I wonder if I could disable the nightly Atmire thing +- God, I don't know where this load is coming from +- Since I bumped up the Tomcat threads from 200 to 400 the load on the server has been sustained at about 200%: + +![CPU usage week](/cgspace-notes/2018/01/cpu-week.png) + +- I should make separate database pools for the web applications and the API applications like REST and OAI diff --git a/public/2018-01/index.html b/public/2018-01/index.html index a41cc2347..91f4c5de1 100644 --- a/public/2018-01/index.html +++ b/public/2018-01/index.html @@ -92,7 +92,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -194,9 +194,9 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv "@type": "BlogPosting", "headline": "January, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-01/", - "wordCount": "7537", + "wordCount": "7885", "datePublished": "2018-01-02T08:35:54-08:00", - "dateModified": "2018-01-29T09:47:55+02:00", + "dateModified": "2018-01-29T12:25:30+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -1688,6 +1688,88 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"
  • I filed a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
  • +

    2018-01-31

    + + + +
    # munin-run tomcat_threads
    +busy.value 400
    +idle.value 0
    +max.value 400
    +
    + + + +
    2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
    +org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
    +
    + + + +
    Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
    +SEVERE: Mapped exception to response: 500 (Internal Server Error)
    +javax.ws.rs.WebApplicationException
    +
    + + + +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +     67 66.249.66.70
    +     70 207.46.13.12
    +     71 197.210.168.174
    +     83 207.46.13.13
    +     85 157.55.39.79
    +     89 207.46.13.14
    +    123 68.180.228.157
    +    198 66.249.66.90
    +    219 41.204.190.40
    +    255 2405:204:a208:1e12:132:2a8e:ad28:46c0
    +# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +      2 65.55.210.187
    +      2 66.249.66.90
    +      3 157.55.39.79
    +      4 197.232.39.92
    +      4 34.216.252.127
    +      6 104.196.152.243
    +      6 213.55.85.89
    +     15 122.52.115.13
    +     16 213.55.107.186
    +    596 45.5.184.196
    +
    + + + +

    Tomcat threads

    + + + +

    CPU usage week

    + + + diff --git a/public/2018/01/cpu-week.png b/public/2018/01/cpu-week.png new file mode 100644 index 000000000..866cc3086 Binary files /dev/null and b/public/2018/01/cpu-week.png differ diff --git a/public/2018/01/tomcat-threads-day.png b/public/2018/01/tomcat-threads-day.png new file mode 100644 index 000000000..e14507228 Binary files /dev/null and b/public/2018/01/tomcat-threads-day.png differ diff --git a/public/sitemap.xml b/public/sitemap.xml index ef9d1ff7f..f45dd0f0f 100644 --- a/public/sitemap.xml +++ b/public/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-01/ - 2018-01-29T09:47:55+02:00 + 2018-01-29T12:25:30+02:00 @@ -144,7 +144,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-01-29T09:47:55+02:00 + 2018-01-29T12:25:30+02:00 0 @@ -155,7 +155,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-01-29T09:47:55+02:00 + 2018-01-29T12:25:30+02:00 0 @@ -167,13 +167,13 @@ https://alanorth.github.io/cgspace-notes/post/ - 2018-01-29T09:47:55+02:00 + 2018-01-29T12:25:30+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-01-29T09:47:55+02:00 + 2018-01-29T12:25:30+02:00 0 diff --git a/static/2018/01/cpu-week.png b/static/2018/01/cpu-week.png new file mode 100644 index 000000000..866cc3086 Binary files /dev/null and b/static/2018/01/cpu-week.png differ diff --git a/static/2018/01/tomcat-threads-day.png b/static/2018/01/tomcat-threads-day.png new file mode 100644 index 000000000..e14507228 Binary files /dev/null and b/static/2018/01/tomcat-threads-day.png differ