CGSpace Notes

February, 2016

Fri, 05 Feb 2016 13:18:00 +0300

2016-02-05

Looking at some DAGRIS data for Abenet Yabowork
Lots of issues with spaces, newlines, etc causing the import to fail
I noticed we have a very interesting list of countries on CGSpace:

Not only are there 49,000 countries, we have some blanks (25)…
Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”

January, 2016

Wed, 13 Jan 2016 13:18:00 +0300

2016-01-13

Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.

2016-01-14

Update CCAFS project identifiers in input-forms.xml
Run system updates and restart the server

2016-01-18

Change “Extension material” to “Extension Material” in input-forms.xml (a mistake that fell through the cracks when we fixed the others in DSpace 4 era)

2016-01-19

Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme’s primary color (#157)
Tweak date-based facets to show more values in drill-down ranges (#162)
Need to remember to clear the Cocoon cache after deployment or else you don’t see the new ranges immediately
Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account
Altmetrics’ support for Handles is kinda weak, so they can’t associate our items with DOIs until they are tweeted or blogged, etc first.

2016-01-21

Still waiting for my IFTTT recipe to fire, two days later
It looks like the Atom feed on CGSpace hasn’t changed in two days, but there have definitely been new items
The RSS feed is nearly as old, but has different old items there
On a hunch I cleared the Cocoon cache and now the feeds are fresh
Looks like there is configuration option related to this, webui.feed.cache.age, which defaults to 48 hours, though I’m not sure what relation it has to the Cocoon cache
In any case, we should change this cache to be something more like 6 hours, as we publish new items several times per day.
Work around a CSS issue with long URLs in the item view (#172)

2016-01-25

Re-deploy CGSpace and DSpace Test with latest 5_x-prod branch
This included the social icon fixes/updates, date-based facet tweaks, reducing the feed cache age, and fixing a layout issue in XMLUI item view when an item had long URLs

2016-01-26

Run nginx updates on CGSpace and DSpace Test (1.8.1 and 1.9.10, respectively)
Run updates on DSpace Test and reboot for new Linode kernel Linux 4.4.0-x86_64-linode63 (first update in months)

2016-01-28

Start looking at importing some Bioversity data that had been prepared earlier this week
While checking the data I noticed something strange, there are 79 items but only 8 unique PDFs:

$ ls SimpleArchiveForBio/ | wc -l 79 $ find SimpleArchiveForBio/ -iname “*.pdf” -exec basename {} \; | sort -u | wc -l 8

2016-01-29

Add five missing center-specific subjects to XMLUI item view (#174)
This CCAFS item Before:

After:

December, 2015

Wed, 02 Dec 2015 13:18:00 +0300

2015-12-02

Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:

# cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz

I had used lrzip once, but it needs more memory and is harder to use as it requires the lrztar wrapper
Need to remember to go check if everything is ok in a few days and then change CGSpace
CGSpace went down again (due to PostgreSQL idle connections of course)
Current database settings for DSpace are db.maxconnections = 30 and db.maxidle = 8, yet idle connections are exceeding this:

$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
39

I restarted PostgreSQL and Tomcat and it’s back
On a related note of why CGSpace is so slow, I decided to finally try the pgtune script to tune the postgres settings:

# apt-get install pgtune
# pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
# mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig 
# mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf

It introduced the following new settings:

default_statistics_target = 50
maintenance_work_mem = 480MB
constraint_exclusion = on
checkpoint_completion_target = 0.9
effective_cache_size = 5632MB
work_mem = 48MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 1920MB
max_connections = 80

Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc
For what it’s worth, now the REST API should be faster (because of these PostgreSQL tweaks):

$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.474
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
2.141
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.685
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.995
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.786

Last week it was an average of 8 seconds… now this is ¹⁄₄ of that
CCAFS noticed that one of their items displays only the Atmire statlets: https://cgspace.cgiar.org/handle/10568/42445

The authorizations for the item are all public READ, and I don’t see any errors in dspace.log when browsing that item
I filed a ticket on Atmire’s issue tracker
I also filed a ticket on Atmire’s issue tracker for the PostgreSQL stuff

2015-12-03

CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)
Idle postgres connections look like this (with no change in DSpace db settings lately):

$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
29

I restarted Tomcat and postgres…
Atmire commented that we should raise the JVM heap size by ~500M, so it is now -Xms3584m -Xmx3584m
We weren’t out of heap yet, but it’s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it’s ok
A possible side effect is that I see that the REST API is twice as fast for the request above now:

$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.368
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.968
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.006
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.849
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.806
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.854

2015-12-05

CGSpace has been up and down all day and REST API is completely unresponsive
PostgreSQL idle connections are currently:

postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
28

I have reverted all the pgtune tweaks from the other day, as they didn’t fix the stability issues, so I’d rather not have them introducing more variables into the equation
The PostgreSQL stats from Munin all point to something database-related with the DSpace 5 upgrade around mid–late November

2015-12-07

Atmire sent some fixes to DSpace’s REST API code that was leaving contexts open (causing the slow performance and database issues)
After deploying the fix to CGSpace the REST API is consistently faster:

$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.675
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.599
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.588
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.566
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.497

2015-12-08

Switch CGSpace log compression cron jobs from using lzop to xz—the compression isn’t as good, but it’s much faster and causes less IO/CPU load
Since we figured out (and fixed) the cause of the performance issue, I reverted Google Bot’s crawl rate to the “Let Google optimize” setting

November, 2015

Mon, 23 Nov 2015 17:00:57 +0300

2015-11-22

CGSpace went down
Looks like DSpace exhausted its PostgreSQL connection pool
Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:

$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
78

For now I have increased the limit from 60 to 90, run updates, and rebooted the server

2015-11-24

CGSpace went down again
Getting emails from uptimeRobot and uptimeButler that it’s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors
Looks like there are still a bunch of idle PostgreSQL connections:

$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
96

For some reason the number of idle connections is very high since we upgraded to DSpace 5

2015-11-25

Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config
The OAI application requests stylesheets and javascript files with the path /oai/static/css, which gets matched here:

# static assets we can load from the file system directly with nginx
location ~ /(themes|static|aspects/ReportingSuite) {
    try_files $uri @tomcat;
...

The document root is relative to the xmlui app, so this gets a 404—I’m not sure why it doesn’t pass to @tomcat
Anyways, I can’t find any URIs with path /static, and the more important point is to handle all the static theme assets, so we can just remove static from the regex for now (who cares if we can’t use nginx to send Etags for OAI CSS!)
Also, I noticed we aren’t setting CSP headers on the static assets, because in nginx headers are inherited in child blocks, but if you use add_header in a child block it doesn’t inherit the others
We simply need to add include extra-security.conf; to the above location block (but research and test first)
We should add WOFF assets to the list of things to set expires for:

location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {

We should also add aspects/Statistics to the location block for static assets (minus static from above):

location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {

Need to check /about on CGSpace, as it’s blank on my local test server and we might need to add something there
CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):

$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
93

I looked closer at the idle connections and saw that many have been idle for hours (current time on server is 2015-11-25T20:20:42+0000):

$ psql -c 'SELECT * from pg_stat_activity;' | less -S
datid | datname  |  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace  | 10966 |    18205 | cgspace  |                  | 127.0.0.1   |                 |       37731 | 2015-11-25 13:13:02.837624+00 |                               | 20
20951 | cgspace  | 10967 |    18205 | cgspace  |                  | 127.0.0.1   |                 |       37737 | 2015-11-25 13:13:03.069421+00 |                               | 20
...

There is a relevant Jira issue about this: https://jira.duraspace.org/browse/DS-1458
It seems there is some sense changing DSpace’s default db.maxidle from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)
Change db.maxidle from -1 to 10, reduce db.maxconnections from 90 to 50, and restart postgres and tomcat7
Also redeploy DSpace Test with a clean sync of CGSpace and mirror these database settings there as well
Also deploy the nginx fixes for the try_files location block as well as the expires block

2015-11-26

CGSpace behaving much better since changing db.maxidle yesterday, but still two up/down notices from monitoring this morning (better than 50!)
CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item
Not as bad for me, but still unsustainable if you have to get many:

$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
8.415

Monitoring e-mailed in the evening to say CGSpace was down
Idle connections in PostgreSQL again:

$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
66

At the time, the current DSpace pool size was 50…
I reduced the pool back to the default of 30, and reduced the db.maxidle settings from 10 to 8

2015-11-29

Still more alerts that CGSpace has been up and down all day
Current database settings for DSpace:

db.maxconnections = 30
db.maxwait = 5000
db.maxidle = 8
db.statementpool = true

And idle connections:

$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
49

Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace’s thirst can ever be quenched
On another note, SUNScholar’s notes suggest adjusting some other postgres variables: http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database
This might help with REST API speed (which I mentioned above and still need to do real tests)