Alan Orth d29bb00d69
Update notes for 2016-01
Signed-off-by: Alan Orth <>
2016-01-20 18:17:35 +02:00

377 lines
20 KiB
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="">
<title>CGSpace Notes</title>
<description>Recent content on CGSpace Notes</description>
<generator>Hugo --</generator>
<lastBuildDate>Wed, 13 Jan 2016 13:18:00 +0300</lastBuildDate>
<atom:link href="/cgspace-notes/index.xml" rel="self" type="application/rss+xml" />
<title>January, 2016</title>
<pubDate>Wed, 13 Jan 2016 13:18:00 +0300</pubDate>
&lt;h2 id=&#34;2016-01-13:3846b7fcbca60cdedafd373cb39cd76d&#34;&gt;2016-01-13&lt;/h2&gt;
&lt;li&gt;Move ILRI collection &lt;code&gt;10568/12503&lt;/code&gt; from &lt;code&gt;10568/27869&lt;/code&gt; to &lt;code&gt;10568/27629&lt;/code&gt; using the &lt;a href=&#34;;&gt;;/a&gt; script I wrote last year.&lt;/li&gt;
&lt;li&gt;I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.&lt;/li&gt;
&lt;li&gt;Update GitHub wiki for documentation of &lt;a href=&#34;;&gt;maintenance tasks&lt;/a&gt;.&lt;/li&gt;
&lt;h2 id=&#34;2016-01-14:3846b7fcbca60cdedafd373cb39cd76d&#34;&gt;2016-01-14&lt;/h2&gt;
&lt;li&gt;Update CCAFS project identifiers in input-forms.xml&lt;/li&gt;
&lt;li&gt;Run system updates and restart the server&lt;/li&gt;
&lt;h2 id=&#34;2016-01-18:3846b7fcbca60cdedafd373cb39cd76d&#34;&gt;2016-01-18&lt;/h2&gt;
&lt;li&gt;Change &amp;ldquo;Extension material&amp;rdquo; to &amp;ldquo;Extension Material&amp;rdquo; in input-forms.xml (a mistake that fell through the cracks when we fixed the others in DSpace 4 era)&lt;/li&gt;
&lt;h2 id=&#34;2016-01-19:3846b7fcbca60cdedafd373cb39cd76d&#34;&gt;2016-01-19&lt;/h2&gt;
&lt;li&gt;Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme&amp;rsquo;s primary color (&lt;a href=&#34;;&gt;#157&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Tweak date-based facets to show more values in drill-down ranges (&lt;a href=&#34;;&gt;#162&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Need to remember to clear the Cocoon cache after deployment or else you don&amp;rsquo;t see the new ranges immediately&lt;/li&gt;
<title>December, 2015</title>
<pubDate>Wed, 02 Dec 2015 13:18:00 +0300</pubDate>
&lt;h2 id=&#34;2015-12-02:012a628feed6d64ae1151cbd6151ccd6&#34;&gt;2015-12-02&lt;/h2&gt;
&lt;li&gt;Replace &lt;code&gt;lzop&lt;/code&gt; with &lt;code&gt;xz&lt;/code&gt; in log compression cron jobs on DSpace Test—it uses less space:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;# cd /home/
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
&lt;li&gt;I had used lrzip once, but it needs more memory and is harder to use as it requires the lrztar wrapper&lt;/li&gt;
&lt;li&gt;Need to remember to go check if everything is ok in a few days and then change CGSpace&lt;/li&gt;
&lt;li&gt;CGSpace went down again (due to PostgreSQL idle connections of course)&lt;/li&gt;
&lt;li&gt;Current database settings for DSpace are &lt;code&gt;db.maxconnections = 30&lt;/code&gt; and &lt;code&gt;db.maxidle = 8&lt;/code&gt;, yet idle connections are exceeding this:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
&lt;li&gt;I restarted PostgreSQL and Tomcat and it&amp;rsquo;s back&lt;/li&gt;
&lt;li&gt;On a related note of why CGSpace is so slow, I decided to finally try the &lt;code&gt;pgtune&lt;/code&gt; script to tune the postgres settings:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;# apt-get install pgtune
# pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
# mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig
# mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf
&lt;li&gt;It introduced the following new settings:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;default_statistics_target = 50
maintenance_work_mem = 480MB
constraint_exclusion = on
checkpoint_completion_target = 0.9
effective_cache_size = 5632MB
work_mem = 48MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 1920MB
max_connections = 80
&lt;li&gt;Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc&lt;/li&gt;
&lt;li&gt;For what it&amp;rsquo;s worth, now the REST API should be faster (because of these PostgreSQL tweaks):&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
&lt;li&gt;Last week it was an average of 8 seconds&amp;hellip; now this is &lt;sup&gt;1&lt;/sup&gt;&amp;frasl;&lt;sub&gt;4&lt;/sub&gt; of that&lt;/li&gt;
&lt;li&gt;CCAFS noticed that one of their items displays only the Atmire statlets: &lt;a href=&#34;;&gt;;/a&gt;&lt;/li&gt;
&lt;p&gt;&lt;img src=&#34;../images/2015/12/ccafs-item-no-metadata.png&#34; alt=&#34;CCAFS item&#34; /&gt;&lt;/p&gt;
&lt;li&gt;The authorizations for the item are all public READ, and I don&amp;rsquo;t see any errors in dspace.log when browsing that item&lt;/li&gt;
&lt;li&gt;I filed a ticket on Atmire&amp;rsquo;s issue tracker&lt;/li&gt;
&lt;li&gt;I also filed a ticket on Atmire&amp;rsquo;s issue tracker for the PostgreSQL stuff&lt;/li&gt;
&lt;h2 id=&#34;2015-12-03:012a628feed6d64ae1151cbd6151ccd6&#34;&gt;2015-12-03&lt;/h2&gt;
&lt;li&gt;CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)&lt;/li&gt;
&lt;li&gt;Idle postgres connections look like this (with no change in DSpace db settings lately):&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
&lt;li&gt;I restarted Tomcat and postgres&amp;hellip;&lt;/li&gt;
&lt;li&gt;Atmire commented that we should raise the JVM heap size by ~500M, so it is now &lt;code&gt;-Xms3584m -Xmx3584m&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;We weren&amp;rsquo;t out of heap yet, but it&amp;rsquo;s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it&amp;rsquo;s ok&lt;/li&gt;
&lt;li&gt;A possible side effect is that I see that the REST API is twice as fast for the request above now:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
&lt;h2 id=&#34;2015-12-05:012a628feed6d64ae1151cbd6151ccd6&#34;&gt;2015-12-05&lt;/h2&gt;
&lt;li&gt;CGSpace has been up and down all day and REST API is completely unresponsive&lt;/li&gt;
&lt;li&gt;PostgreSQL idle connections are currently:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;postgres@linode01:~$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
&lt;li&gt;I have reverted all the pgtune tweaks from the other day, as they didn&amp;rsquo;t fix the stability issues, so I&amp;rsquo;d rather not have them introducing more variables into the equation&lt;/li&gt;
&lt;li&gt;The PostgreSQL stats from Munin all point to something database-related with the DSpace 5 upgrade around midlate November&lt;/li&gt;
&lt;p&gt;&lt;img src=&#34;../images/2015/12/postgres_bgwriter-year.png&#34; alt=&#34;PostgreSQL bgwriter (year)&#34; /&gt;
&lt;img src=&#34;../images/2015/12/postgres_cache_cgspace-year.png&#34; alt=&#34;PostgreSQL cache (year)&#34; /&gt;
&lt;img src=&#34;../images/2015/12/postgres_locks_cgspace-year.png&#34; alt=&#34;PostgreSQL locks (year)&#34; /&gt;
&lt;img src=&#34;../images/2015/12/postgres_scans_cgspace-year.png&#34; alt=&#34;PostgreSQL scans (year)&#34; /&gt;&lt;/p&gt;
&lt;h2 id=&#34;2015-12-07:012a628feed6d64ae1151cbd6151ccd6&#34;&gt;2015-12-07&lt;/h2&gt;
&lt;li&gt;Atmire sent &lt;a href=&#34;;&gt;some fixes&lt;/a&gt; to DSpace&amp;rsquo;s REST API code that was leaving contexts open (causing the slow performance and database issues)&lt;/li&gt;
&lt;li&gt;After deploying the fix to CGSpace the REST API is consistently faster:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
$ curl -o /dev/null -s -w %{time_total}\\n
&lt;h2 id=&#34;2015-12-08:012a628feed6d64ae1151cbd6151ccd6&#34;&gt;2015-12-08&lt;/h2&gt;
&lt;li&gt;Switch CGSpace log compression cron jobs from using lzop to xz—the compression isn&amp;rsquo;t as good, but it&amp;rsquo;s much faster and causes less IO/CPU load&lt;/li&gt;
&lt;li&gt;Since we figured out (and fixed) the cause of the performance issue, I reverted Google Bot&amp;rsquo;s crawl rate to the &amp;ldquo;Let Google optimize&amp;rdquo; setting&lt;/li&gt;
<title>November, 2015</title>
<pubDate>Mon, 23 Nov 2015 17:00:57 +0300</pubDate>
&lt;h2 id=&#34;2015-11-22:3d03b850f8126f80d8144c2e17ea0ae7&#34;&gt;2015-11-22&lt;/h2&gt;
&lt;li&gt;CGSpace went down&lt;/li&gt;
&lt;li&gt;Looks like DSpace exhausted its PostgreSQL connection pool&lt;/li&gt;
&lt;li&gt;Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
&lt;li&gt;For now I have increased the limit from 60 to 90, run updates, and rebooted the server&lt;/li&gt;
&lt;h2 id=&#34;2015-11-24:3d03b850f8126f80d8144c2e17ea0ae7&#34;&gt;2015-11-24&lt;/h2&gt;
&lt;li&gt;CGSpace went down again&lt;/li&gt;
&lt;li&gt;Getting emails from uptimeRobot and uptimeButler that it&amp;rsquo;s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors&lt;/li&gt;
&lt;li&gt;Looks like there are still a bunch of idle PostgreSQL connections:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
&lt;li&gt;For some reason the number of idle connections is very high since we upgraded to DSpace 5&lt;/li&gt;
&lt;h2 id=&#34;2015-11-25:3d03b850f8126f80d8144c2e17ea0ae7&#34;&gt;2015-11-25&lt;/h2&gt;
&lt;li&gt;Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config&lt;/li&gt;
&lt;li&gt;The OAI application requests stylesheets and javascript files with the path &lt;code&gt;/oai/static/css&lt;/code&gt;, which gets matched here:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;# static assets we can load from the file system directly with nginx
location ~ /(themes|static|aspects/ReportingSuite) {
try_files $uri @tomcat;
&lt;li&gt;The document root is relative to the xmlui app, so this gets a 404—I&amp;rsquo;m not sure why it doesn&amp;rsquo;t pass to &lt;code&gt;@tomcat&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Anyways, I can&amp;rsquo;t find any URIs with path &lt;code&gt;/static&lt;/code&gt;, and the more important point is to handle all the static theme assets, so we can just remove &lt;code&gt;static&lt;/code&gt; from the regex for now (who cares if we can&amp;rsquo;t use nginx to send Etags for OAI CSS!)&lt;/li&gt;
&lt;li&gt;Also, I noticed we aren&amp;rsquo;t setting CSP headers on the static assets, because in nginx headers are inherited in child blocks, but if you use &lt;code&gt;add_header&lt;/code&gt; in a child block it doesn&amp;rsquo;t inherit the others&lt;/li&gt;
&lt;li&gt;We simply need to add &lt;code&gt;include extra-security.conf;&lt;/code&gt; to the above location block (but research and test first)&lt;/li&gt;
&lt;li&gt;We should add WOFF assets to the list of things to set expires for:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
&lt;li&gt;We should also add &lt;code&gt;aspects/Statistics&lt;/code&gt; to the location block for static assets (minus &lt;code&gt;static&lt;/code&gt; from above):&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
&lt;li&gt;Need to check &lt;code&gt;/about&lt;/code&gt; on CGSpace, as it&amp;rsquo;s blank on my local test server and we might need to add something there&lt;/li&gt;
&lt;li&gt;CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
&lt;li&gt;I looked closer at the idle connections and saw that many have been idle for hours (current time on server is &lt;code&gt;2015-11-25T20:20:42+0000&lt;/code&gt;):&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
20951 | cgspace | 10966 | 18205 | cgspace | | | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
20951 | cgspace | 10967 | 18205 | cgspace | | | | 37737 | 2015-11-25 13:13:03.069421+00 | | 20
&lt;li&gt;There is a relevant Jira issue about this: &lt;a href=&#34;;&gt;;/a&gt;&lt;/li&gt;
&lt;li&gt;It seems there is some sense changing DSpace&amp;rsquo;s default &lt;code&gt;db.maxidle&lt;/code&gt; from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)&lt;/li&gt;
&lt;li&gt;Change &lt;code&gt;db.maxidle&lt;/code&gt; from -1 to 10, reduce &lt;code&gt;db.maxconnections&lt;/code&gt; from 90 to 50, and restart postgres and tomcat7&lt;/li&gt;
&lt;li&gt;Also redeploy DSpace Test with a clean sync of CGSpace and mirror these database settings there as well&lt;/li&gt;
&lt;li&gt;Also deploy the nginx fixes for the &lt;code&gt;try_files&lt;/code&gt; location block as well as the expires block&lt;/li&gt;
&lt;h2 id=&#34;2015-11-26:3d03b850f8126f80d8144c2e17ea0ae7&#34;&gt;2015-11-26&lt;/h2&gt;
&lt;li&gt;CGSpace behaving much better since changing &lt;code&gt;db.maxidle&lt;/code&gt; yesterday, but still two up/down notices from monitoring this morning (better than 50!)&lt;/li&gt;
&lt;li&gt;CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item&lt;/li&gt;
&lt;li&gt;Not as bad for me, but still unsustainable if you have to get many:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ curl -o /dev/null -s -w %{time_total}\\n
&lt;li&gt;Monitoring e-mailed in the evening to say CGSpace was down&lt;/li&gt;
&lt;li&gt;Idle connections in PostgreSQL again:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
&lt;li&gt;At the time, the current DSpace pool size was 50&amp;hellip;&lt;/li&gt;
&lt;li&gt;I reduced the pool back to the default of 30, and reduced the &lt;code&gt;db.maxidle&lt;/code&gt; settings from 10 to 8&lt;/li&gt;
&lt;h2 id=&#34;2015-11-29:3d03b850f8126f80d8144c2e17ea0ae7&#34;&gt;2015-11-29&lt;/h2&gt;
&lt;li&gt;Still more alerts that CGSpace has been up and down all day&lt;/li&gt;
&lt;li&gt;Current database settings for DSpace:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;db.maxconnections = 30
db.maxwait = 5000
db.maxidle = 8
db.statementpool = true
&lt;li&gt;And idle connections:&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
&lt;li&gt;Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace&amp;rsquo;s thirst can ever be quenched&lt;/li&gt;
&lt;li&gt;On another note, SUNScholar&amp;rsquo;s notes suggest adjusting some other postgres variables: &lt;a href=&#34;;&gt;;/a&gt;&lt;/li&gt;
&lt;li&gt;This might help with REST API speed (which I mentioned above and still need to do real tests)&lt;/li&gt;