From 389ab31e63766600c73980ffa0c9b9e5f5acddbb Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 7 Feb 2018 18:33:26 +0200 Subject: [PATCH] Update notes for 2018-02-07 --- content/post/2018-02.md | 53 +++++++++++++++++++++++++++++++ public/2018-02/index.html | 66 +++++++++++++++++++++++++++++++++++++-- public/sitemap.xml | 10 +++--- 3 files changed, 121 insertions(+), 8 deletions(-) diff --git a/content/post/2018-02.md b/content/post/2018-02.md index 43872534f..960f446bb 100644 --- a/content/post/2018-02.md +++ b/content/post/2018-02.md @@ -139,3 +139,56 @@ Tue Feb 6 09:30:32 UTC 2018 - According to [the announcement](https://groups.google.com/forum/#!topic/orcid-api-users/qfg-HwAB1bk) the v1 API was moved from `http://pub.orcid.org/` to `https://pub.orcid.org/v1.2` until March 1st when it will be discontinued for good - But the old URL is hard coded in DSpace and it doesn't work anyways, because it currently redirects you to `https://pub.orcid.org/v2.0/v1.2` - So I guess we have to disable that shit once and for all and switch to a controlled vocabulary +- CGSpace crashed again, this time around `Wed Feb 7 11:20:28 UTC 2018` +- I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own: + +``` +$ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt +$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity* +/tmp/pg_stat_activity1.txt:300 +/tmp/pg_stat_activity2.txt:272 +/tmp/pg_stat_activity3.txt:168 +/tmp/pg_stat_activity4.txt:5 +/tmp/pg_stat_activity5.txt:6 +``` + +- Interestingly, all of those 751 connections were idle! + +``` +$ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle +751 +``` + +- Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps +- Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!): + +![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace-sessions-day.png) + +- Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM: + +``` +$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l +1828 +``` + +- CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme) +- What's interesting is that the DSpace log says the connections are all busy: + +``` +org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000]. +``` + +- ... but in PostgreSQL I see them `idle` or `idle in transaction`: + +``` +$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb +250 +$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle +250 +$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction" +187 +``` + +- What the fuck, does DSpace think all connections are busy? +- I suspect these are issues with abandoned connections or maybe a leak, so I'm going to try adding the `removeAbandoned='true'` parameter which is apparently off by default +- I will try `testOnReturn='true'` too, just to add more validation, because I'm fucking grasping at straws diff --git a/public/2018-02/index.html b/public/2018-02/index.html index 0797c6b07..c597bf194 100644 --- a/public/2018-02/index.html +++ b/public/2018-02/index.html @@ -23,7 +23,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl - + @@ -57,9 +57,9 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl "@type": "BlogPosting", "headline": "February, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-02/", - "wordCount": "995", + "wordCount": "1334", "datePublished": "2018-02-01T16:28:54+02:00", - "dateModified": "2018-02-06T14:03:46+02:00", + "dateModified": "2018-02-07T12:34:36+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -274,6 +274,66 @@ Tue Feb 6 09:30:32 UTC 2018
  • According to the announcement the v1 API was moved from http://pub.orcid.org/ to https://pub.orcid.org/v1.2 until March 1st when it will be discontinued for good
  • But the old URL is hard coded in DSpace and it doesn’t work anyways, because it currently redirects you to https://pub.orcid.org/v2.0/v1.2
  • So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
  • +
  • CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018
  • +
  • I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:
  • + + +
    $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
    +$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
    +/tmp/pg_stat_activity1.txt:300
    +/tmp/pg_stat_activity2.txt:272
    +/tmp/pg_stat_activity3.txt:168
    +/tmp/pg_stat_activity4.txt:5
    +/tmp/pg_stat_activity5.txt:6
    +
    + + + +
    $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
    +751
    +
    + + + +

    DSpace Sessions

    + + + +
    $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +1828
    +
    + + + +
    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    +
    + + + +
    $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
    +250
    +$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
    +250
    +$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
    +187
    +
    + + diff --git a/public/sitemap.xml b/public/sitemap.xml index f660c27f1..3ff2c79ad 100644 --- a/public/sitemap.xml +++ b/public/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-02/ - 2018-02-06T14:03:46+02:00 + 2018-02-07T12:34:36+02:00 @@ -149,7 +149,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-02-06T14:03:46+02:00 + 2018-02-07T12:34:36+02:00 0 @@ -160,7 +160,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-02-06T14:03:46+02:00 + 2018-02-07T12:34:36+02:00 0 @@ -172,13 +172,13 @@ https://alanorth.github.io/cgspace-notes/post/ - 2018-02-06T14:03:46+02:00 + 2018-02-07T12:34:36+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-02-06T14:03:46+02:00 + 2018-02-07T12:34:36+02:00 0