2015-11-22
- CGSpace went down
- Looks like DSpace exhausted its PostgreSQL connection pool
- Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
78
- For now I have increased the limit from 60 to 90, run updates, and rebooted the server
2015-11-24
- CGSpace went down again
- Getting emails from uptimeRobot and uptimeButler that it’s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors
- Looks like there are still a bunch of idle PostgreSQL connections:
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
96
- For some reason the number of idle connections is very high since we upgraded to DSpace 5
2015-11-25
- Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config
- The OAI application requests stylesheets and javascript files with the path
/oai/static/css
, which gets matched here:
# static assets we can load from the file system directly with nginx
location ~ /(themes|static|aspects/ReportingSuite) {
try_files $uri @tomcat;
...
- The document root is relative to the xmlui app, so this gets a 404—I’m not sure why it doesn’t pass to
@tomcat
- Anyways, I can’t find any URIs with path
/static
, and the more important point is to handle all the static theme assets, so we can just remove static
from the regex for now (who cares if we can’t use nginx to send Etags for OAI CSS!)
- Also, I noticed we aren’t setting CSP headers on the static assets, because in nginx headers are inherited in child blocks, but if you use
add_header
in a child block it doesn’t inherit the others
- We simply need to add
include extra-security.conf;
to the above location block (but research and test first)
- We should add WOFF assets to the list of things to set expires for:
location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
- We should also add
aspects/Statistics
to the location block for static assets (minus static
from above):
location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
- Need to check
/about
on CGSpace, as it’s blank on my local test server and we might need to add something there
- CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):
$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
93
- I looked closer at the idle connections and saw that many have been idle for hours (current time on server is
2015-11-25T20:20:42+0000
):
$ psql -c 'SELECT * from pg_stat_activity;' | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
20951 | cgspace | 10967 | 18205 | cgspace | | 127.0.0.1 | | 37737 | 2015-11-25 13:13:03.069421+00 | | 20
...
- There is a relevant Jira issue about this: https://jira.duraspace.org/browse/DS-1458
- It seems there is some sense changing DSpace’s default
db.maxidle
from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)
- Change
db.maxidle
from -1 to 10, reduce db.maxconnections
from 90 to 50, and restart postgres and tomcat7
- Also redeploy DSpace Test with a clean sync of CGSpace and mirror these database settings there as well
- Also deploy the nginx fixes for the
try_files
location block as well as the expires block
2015-11-26
- CGSpace behaving much better since changing
db.maxidle
yesterday, but still two up/down notices from monitoring this morning (better than 50!)
- CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item
- Not as bad for me, but still unsustainable if you have to get many:
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
8.415
- Monitoring e-mailed in the evening to say CGSpace was down
- Idle connections in PostgreSQL again:
$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
66
- At the time, the current DSpace pool size was 50…
- I reduced the pool back to the default of 30, and reduced the
db.maxidle
settings from 10 to 8
2015-11-29
- Still more alerts that CGSpace has been up and down all day
- Current database settings for DSpace:
db.maxconnections = 30
db.maxwait = 5000
db.maxidle = 8
db.statementpool = true
$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
49
- Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace’s thirst can ever be quenched
- On another note, SUNScholar’s notes suggest adjusting some other postgres variables: http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database
- This might help with REST API speed (which I mentioned above and still need to do real tests)