CGSpace Notes /cgspace-notes/ Recent content on CGSpace Notes Hugo -- gohugo.io en-us Wed, 02 Mar 2016 16:50:00 +0300 March, 2016 /cgspace-notes/2016-03/ Wed, 02 Mar 2016 16:50:00 +0300 /cgspace-notes/2016-03/ <h2 id="2016-03-02:5a28ddf3ee658c043c064ccddb151717">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> <li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> <h2 id="2016-03-07:5a28ddf3ee658c043c064ccddb151717">2016-03-07</h2> <ul> <li>Troubleshooting the issues with the slew of commits for Atmire modules in <a href="https://github.com/ilri/DSpace/pull/182">#182</a></li> <li>Their changes on <code>5_x-dev</code> branch work, but it is messy as hell with merge commits and old branch base</li> <li>When I rebase their branch on the latest <code>5_x-prod</code> I get blank white pages</li> <li>I identified one commit that causes the issue and let them know</li> <li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li> </ul> <pre><code>Exception in thread &quot;Lucene Merge Thread #19&quot; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device </code></pre> <h2 id="2016-03-08:5a28ddf3ee658c043c064ccddb151717">2016-03-08</h2> <ul> <li>Add a few new filters to Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li> <li>We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were</li> </ul> <h2 id="2016-03-10:5a28ddf3ee658c043c064ccddb151717">2016-03-10</h2> <ul> <li>Disable the lucene cron job on CGSpace as it shouldn&rsquo;t be needed anymore</li> <li>Discuss ORCiD and duplicate authors on Yammer</li> <li>Request new documentation for Atmire CUA and L&amp;R modules, as ours are from 2013</li> <li>Walk Sisay through some data cleaning workflows in OpenRefine</li> <li>Start cleaning up the configuration for Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/issues/185">#184</a>)</li> <li>It is very messed up because some labels are incorrect, fields are missing, etc</li> </ul> <p><img src="../images/2016/03/cua-label-mixup.png" alt="Mixed up label in Atmire CUA" /></p> February, 2016 /cgspace-notes/2016-02/ Fri, 05 Feb 2016 13:18:00 +0300 /cgspace-notes/2016-02/ <h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2> <ul> <li>Looking at some DAGRIS data for Abenet Yabowork</li> <li>Lots of issues with spaces, newlines, etc causing the import to fail</li> <li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li> </ul> <p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p> <ul> <li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li> <li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li> </ul> <h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2> <ul> <li>Found a way to get items with null/empty metadata values from SQL</li> <li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li> </ul> <pre><code>dspacetest=# select * from metadatafieldregistry; </code></pre> <ul> <li>In this case our country field is 78</li> <li>Now find all resources with type 2 (item) that have null/empty values for that field:</li> </ul> <pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL); </code></pre> <ul> <li>Then you can find the handle that owns it from its <code>resource_id</code>:</li> </ul> <pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678'; </code></pre> <ul> <li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li> </ul> <pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value=''; DELETE 25 </code></pre> <ul> <li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li> <li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 &ldquo;|||&rdquo; countries are still there</li> <li>Maybe I need to do a full re-index&hellip;</li> <li>Yep! The full re-index seems to work.</li> <li>Process the empty countries on CGSpace</li> </ul> <h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2> <ul> <li>Working on cleaning up Abenet&rsquo;s DAGRIS data with OpenRefine</li> <li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape(&quot;javascript&quot;)</code> which shows whitespace characters like <code>\r\n</code>!</li> <li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li> <li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace(&quot;\.0&quot;, &quot;&quot;)</code></li> <li>I need to start running DSpace in Mac OS X instead of a Linux VM</li> <li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li> </ul> <pre><code>$ postgres -D /opt/brew/var/postgres $ createuser --superuser postgres $ createuser --pwprompt dspacetest $ createdb -O dspacetest --encoding=UNICODE dspacetest $ psql postgres postgres=# alter user dspacetest createuser; postgres=# \q $ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup $ psql postgres postgres=# alter user dspacetest nocreateuser; postgres=# \q $ vacuumdb dspacetest $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost </code></pre> <ul> <li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li> </ul> <pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig $ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT $ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest $ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui $ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai $ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start </code></pre> <ul> <li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li> <li>For example:</li> </ul> <pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot; </code></pre> <ul> <li>After verifying that the site is working, start a full index:</li> </ul> <pre><code>$ ~/dspace/bin/dspace index-discovery -b </code></pre> <h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2> <ul> <li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li> <li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme&rsquo;s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li> </ul> <p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" /> <img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p> <h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2> <ul> <li>Re-sync DSpace Test with CGSpace</li> <li>Help Sisay with OpenRefine</li> <li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li> </ul> <pre><code>$ cd ~/src/git $ git clone https://github.com/letsencrypt/letsencrypt $ cd letsencrypt $ sudo service nginx stop # add port 443 to firewall rules $ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org $ sudo service nginx start $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass </code></pre> <ul> <li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li> <li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs&hellip;</li> <li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li> <li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li> <li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li> </ul> <pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space </code></pre> <ul> <li>or</li> </ul> <pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object </code></pre> <ul> <li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li> </ul> <pre><code># free -m total used free shared buffers cached Mem: 3950 3902 48 9 37 1311 -/+ buffers/cache: 2552 1397 Swap: 255 57 198 </code></pre> <ul> <li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li> </ul> <h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2> <ul> <li>Massaging some CIAT data in OpenRefine</li> <li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li> <li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li> </ul> <pre><code>value.split('/')[-1] </code></pre> <ul> <li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li> </ul> <pre><code>$ ./generate-thumbnails.py ciat-reports.csv Processing 64661.pdf &gt; Downloading 64661.pdf &gt; Creating thumbnail for 64661.pdf Processing 64195.pdf &gt; Downloading 64195.pdf &gt; Creating thumbnail for 64195.pdf </code></pre> <h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2> <ul> <li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li> <li>A few items are using the same exact PDF</li> <li>A few items are using HTM or DOC files</li> <li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li> <li>A few items have no item</li> <li>Also, I&rsquo;m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li> </ul> <h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2> <ul> <li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li> <li>265 items have dirty, URL-encoded filenames:</li> </ul> <pre><code>$ ls | grep -c -E &quot;%&quot; 265 </code></pre> <ul> <li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li> <li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li> </ul> <pre><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf </code></pre> <ul> <li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li> <li>They will be deployed on CGSpace the next time I re-deploy</li> </ul> <h2 id="2016-02-16:124a59adbaa8ef13e1518d003fc03981">2016-02-16</h2> <ul> <li>Turns out OpenRefine has an unescape function!</li> </ul> <pre><code>value.unescape(&quot;url&quot;) </code></pre> <ul> <li>This turns the URLs into human-readable versions that we can use as proper filenames</li> <li>Run web server and system updates on DSpace Test and reboot</li> <li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn&rsquo;t have the brackets, like <code>dc.identifier.url2</code></li> <li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li> <li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li> <li>To turn <code>dc.identifier.url</code> into filenames, create a new column based o</li> <li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li> <li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li> </ul> <h2 id="2016-02-17:124a59adbaa8ef13e1518d003fc03981">2016-02-17</h2> <ul> <li>Re-deploy CGSpace, run all system updates, and reboot</li> <li>More work on CIAT data, cleaning and doing a last metadata-only import into DSpace Test</li> <li>SAFBuilder has a bug preventing it from processing filenames containing more than one underscore</li> <li>Need to re-process the filename column to replace multiple underscores with one: <code>value.replace(/_{2,}/, &quot;_&quot;)</code></li> </ul> <h2 id="2016-02-20:124a59adbaa8ef13e1518d003fc03981">2016-02-20</h2> <ul> <li>Turns out the &ldquo;bug&rdquo; in SAFBuilder isn&rsquo;t a bug, it&rsquo;s a feature that allows you to encode extra information like the destintion bundle in the filename</li> <li>Also, it seems DSpace&rsquo;s SAF import tool doesn&rsquo;t like importing filenames that have accents in them:</li> </ul> <pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory) </code></pre> <ul> <li>Need to rename files to have no accents or umlauts, etc&hellip;</li> <li>Useful custom text facet for URLs ending with &ldquo;.pdf&rdquo;: <code>value.endsWith(&quot;.pdf&quot;)</code></li> </ul> <h2 id="2016-02-22:124a59adbaa8ef13e1518d003fc03981">2016-02-22</h2> <ul> <li>To change Spanish accents to ASCII in OpenRefine:</li> </ul> <pre><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n') </code></pre> <ul> <li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li> <li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li> </ul> <pre><code>Bitstream: tést.pdf Bitstream: tést señora.pdf Bitstream: tést señora alimentación.pdf </code></pre> <ul> <li>Seems it could be something with the HFS+ filesystem actually, as it&rsquo;s not UTF-8 (<a href="http://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-probably-the-worst-file-system-ever.html">it&rsquo;s something like UCS-2</a>)</li> <li>HFS+ stores filenames as a string, and filenames with accents get stored as <a href="https://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/">character+accent</a> whereas Linux&rsquo;s ext4 stores them as an array of bytes</li> <li>Running the SAFBuilder on Mac OS X works if you&rsquo;re going to import the resulting bundle on Mac OS X, but if your DSpace is running on Linux you need to run the SAFBuilder there where the filesystem&rsquo;s encoding matches</li> </ul> <h2 id="2016-02-29:124a59adbaa8ef13e1518d003fc03981">2016-02-29</h2> <ul> <li>Got notified by some CIFOR colleagues that the Google Scholar team had contacted them about CGSpace&rsquo;s incorrect ordering of authors in Google Scholar metadata</li> <li>Turns out there is a patch, and it was merged in DSpace 5.4: <a href="https://jira.duraspace.org/browse/DS-2679">https://jira.duraspace.org/browse/DS-2679</a></li> <li>I&rsquo;ve merged it into our <code>5_x-prod</code> branch that is currently based on DSpace 5.1</li> <li>We found a bug when a user searches from the homepage, sorts the results, and then tries to click &ldquo;View More&rdquo; in a sidebar facet</li> <li>I am not sure what causes it yet, but I opened an issue for it: <a href="https://github.com/ilri/DSpace/issues/179">https://github.com/ilri/DSpace/issues/179</a></li> <li>Have more problems with SAFBuilder on Mac OS X</li> <li>Now it doesn&rsquo;t recognize description hints in the filename column, like: <code>test.pdf__description:Blah</code></li> <li>But on Linux it works fine</li> <li>Trying to test Atmire&rsquo;s series of stats and CUA fixes from January and February, but their branch history is really messy and it&rsquo;s hard to see what&rsquo;s going on</li> <li>Rebasing their branch on top of our production branch results in a broken Tomcat, so I&rsquo;m going to tell them to fix their history and make a proper pull request</li> <li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li> <li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li> </ul> <pre><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_') </code></pre> <ul> <li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li> <li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li> </ul> January, 2016 /cgspace-notes/2016-01/ Wed, 13 Jan 2016 13:18:00 +0300 /cgspace-notes/2016-01/ <h2 id="2016-01-13:3846b7fcbca60cdedafd373cb39cd76d">2016-01-13</h2> <ul> <li>Move ILRI collection <code>10568/12503</code> from <code>10568/27869</code> to <code>10568/27629</code> using the <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">move_collections.sh</a> script I wrote last year.</li> <li>I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.</li> <li>Update GitHub wiki for documentation of <a href="https://github.com/ilri/DSpace/wiki/Maintenance-Tasks">maintenance tasks</a>.</li> </ul> <h2 id="2016-01-14:3846b7fcbca60cdedafd373cb39cd76d">2016-01-14</h2> <ul> <li>Update CCAFS project identifiers in input-forms.xml</li> <li>Run system updates and restart the server</li> </ul> <h2 id="2016-01-18:3846b7fcbca60cdedafd373cb39cd76d">2016-01-18</h2> <ul> <li>Change &ldquo;Extension material&rdquo; to &ldquo;Extension Material&rdquo; in input-forms.xml (a mistake that fell through the cracks when we fixed the others in DSpace 4 era)</li> </ul> <h2 id="2016-01-19:3846b7fcbca60cdedafd373cb39cd76d">2016-01-19</h2> <ul> <li>Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme&rsquo;s primary color (<a href="https://github.com/ilri/DSpace/issues/157">#157</a>)</li> <li>Tweak date-based facets to show more values in drill-down ranges (<a href="https://github.com/ilri/DSpace/issues/162">#162</a>)</li> <li>Need to remember to clear the Cocoon cache after deployment or else you don&rsquo;t see the new ranges immediately</li> <li>Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account</li> <li>Altmetrics&rsquo; support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li> </ul> <h2 id="2016-01-21:3846b7fcbca60cdedafd373cb39cd76d">2016-01-21</h2> <ul> <li>Still waiting for my IFTTT recipe to fire, two days later</li> <li>It looks like the Atom feed on CGSpace hasn&rsquo;t changed in two days, but there have definitely been new items</li> <li>The RSS feed is nearly as old, but has different old items there</li> <li>On a hunch I cleared the Cocoon cache and now the feeds are fresh</li> <li>Looks like there is configuration option related to this, <code>webui.feed.cache.age</code>, which defaults to 48 hours, though I&rsquo;m not sure what relation it has to the Cocoon cache</li> <li>In any case, we should change this cache to be something more like 6 hours, as we publish new items several times per day.</li> <li>Work around a CSS issue with long URLs in the item view (<a href="https://github.com/ilri/DSpace/issues/172">#172</a>)</li> </ul> <h2 id="2016-01-25:3846b7fcbca60cdedafd373cb39cd76d">2016-01-25</h2> <ul> <li>Re-deploy CGSpace and DSpace Test with latest <code>5_x-prod</code> branch</li> <li>This included the social icon fixes/updates, date-based facet tweaks, reducing the feed cache age, and fixing a layout issue in XMLUI item view when an item had long URLs</li> </ul> <h2 id="2016-01-26:3846b7fcbca60cdedafd373cb39cd76d">2016-01-26</h2> <ul> <li>Run nginx updates on CGSpace and DSpace Test (<a href="http://mailman.nginx.org/pipermail/nginx/2016-January/049700.html">1.8.1 and 1.9.10, respectively</a>)</li> <li>Run updates on DSpace Test and reboot for new Linode kernel <code>Linux 4.4.0-x86_64-linode63</code> (first update in months)</li> </ul> <h2 id="2016-01-28:3846b7fcbca60cdedafd373cb39cd76d">2016-01-28</h2> <ul> <li>Start looking at importing some Bioversity data that had been prepared earlier this week</li> <li><p>While checking the data I noticed something strange, there are 79 items but only 8 unique PDFs:</p> <p>$ ls SimpleArchiveForBio/ | wc -l 79 $ find SimpleArchiveForBio/ -iname &ldquo;*.pdf&rdquo; -exec basename {} \; | sort -u | wc -l 8</p></li> </ul> <h2 id="2016-01-29:3846b7fcbca60cdedafd373cb39cd76d">2016-01-29</h2> <ul> <li>Add five missing center-specific subjects to XMLUI item view (<a href="https://github.com/ilri/DSpace/issues/174">#174</a>)</li> <li>This <a href="https://cgspace.cgiar.org/handle/10568/67062">CCAFS item</a> Before:</li> </ul> <p><img src="../images/2016/01/xmlui-subjects-before.png" alt="XMLUI subjects before" /></p> <ul> <li>After:</li> </ul> <p><img src="../images/2016/01/xmlui-subjects-after.png" alt="XMLUI subjects after" /></p> December, 2015 /cgspace-notes/2015-12/ Wed, 02 Dec 2015 13:18:00 +0300 /cgspace-notes/2015-12/ <h2 id="2015-12-02:012a628feed6d64ae1151cbd6151ccd6">2015-12-02</h2> <ul> <li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> </ul> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz </code></pre> <ul> <li>I had used lrzip once, but it needs more memory and is harder to use as it requires the lrztar wrapper</li> <li>Need to remember to go check if everything is ok in a few days and then change CGSpace</li> <li>CGSpace went down again (due to PostgreSQL idle connections of course)</li> <li>Current database settings for DSpace are <code>db.maxconnections = 30</code> and <code>db.maxidle = 8</code>, yet idle connections are exceeding this:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 39 </code></pre> <ul> <li>I restarted PostgreSQL and Tomcat and it&rsquo;s back</li> <li>On a related note of why CGSpace is so slow, I decided to finally try the <code>pgtune</code> script to tune the postgres settings:</li> </ul> <pre><code># apt-get install pgtune # pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune # mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig # mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf </code></pre> <ul> <li>It introduced the following new settings:</li> </ul> <pre><code>default_statistics_target = 50 maintenance_work_mem = 480MB constraint_exclusion = on checkpoint_completion_target = 0.9 effective_cache_size = 5632MB work_mem = 48MB wal_buffers = 8MB checkpoint_segments = 16 shared_buffers = 1920MB max_connections = 80 </code></pre> <ul> <li>Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc</li> <li>For what it&rsquo;s worth, now the REST API should be faster (because of these PostgreSQL tweaks):</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.474 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 2.141 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.685 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.995 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.786 </code></pre> <ul> <li>Last week it was an average of 8 seconds&hellip; now this is <sup>1</sup>&frasl;<sub>4</sub> of that</li> <li>CCAFS noticed that one of their items displays only the Atmire statlets: <a href="https://cgspace.cgiar.org/handle/10568/42445">https://cgspace.cgiar.org/handle/10568/42445</a></li> </ul> <p><img src="../images/2015/12/ccafs-item-no-metadata.png" alt="CCAFS item" /></p> <ul> <li>The authorizations for the item are all public READ, and I don&rsquo;t see any errors in dspace.log when browsing that item</li> <li>I filed a ticket on Atmire&rsquo;s issue tracker</li> <li>I also filed a ticket on Atmire&rsquo;s issue tracker for the PostgreSQL stuff</li> </ul> <h2 id="2015-12-03:012a628feed6d64ae1151cbd6151ccd6">2015-12-03</h2> <ul> <li>CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)</li> <li>Idle postgres connections look like this (with no change in DSpace db settings lately):</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 29 </code></pre> <ul> <li>I restarted Tomcat and postgres&hellip;</li> <li>Atmire commented that we should raise the JVM heap size by ~500M, so it is now <code>-Xms3584m -Xmx3584m</code></li> <li>We weren&rsquo;t out of heap yet, but it&rsquo;s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it&rsquo;s ok</li> <li>A possible side effect is that I see that the REST API is twice as fast for the request above now:</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.368 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.968 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.006 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.849 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.806 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.854 </code></pre> <h2 id="2015-12-05:012a628feed6d64ae1151cbd6151ccd6">2015-12-05</h2> <ul> <li>CGSpace has been up and down all day and REST API is completely unresponsive</li> <li>PostgreSQL idle connections are currently:</li> </ul> <pre><code>postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 28 </code></pre> <ul> <li>I have reverted all the pgtune tweaks from the other day, as they didn&rsquo;t fix the stability issues, so I&rsquo;d rather not have them introducing more variables into the equation</li> <li>The PostgreSQL stats from Munin all point to something database-related with the DSpace 5 upgrade around mid–late November</li> </ul> <p><img src="../images/2015/12/postgres_bgwriter-year.png" alt="PostgreSQL bgwriter (year)" /> <img src="../images/2015/12/postgres_cache_cgspace-year.png" alt="PostgreSQL cache (year)" /> <img src="../images/2015/12/postgres_locks_cgspace-year.png" alt="PostgreSQL locks (year)" /> <img src="../images/2015/12/postgres_scans_cgspace-year.png" alt="PostgreSQL scans (year)" /></p> <h2 id="2015-12-07:012a628feed6d64ae1151cbd6151ccd6">2015-12-07</h2> <ul> <li>Atmire sent <a href="https://github.com/ilri/DSpace/pull/161">some fixes</a> to DSpace&rsquo;s REST API code that was leaving contexts open (causing the slow performance and database issues)</li> <li>After deploying the fix to CGSpace the REST API is consistently faster:</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.675 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.599 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.588 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.566 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.497 </code></pre> <h2 id="2015-12-08:012a628feed6d64ae1151cbd6151ccd6">2015-12-08</h2> <ul> <li>Switch CGSpace log compression cron jobs from using lzop to xz—the compression isn&rsquo;t as good, but it&rsquo;s much faster and causes less IO/CPU load</li> <li>Since we figured out (and fixed) the cause of the performance issue, I reverted Google Bot&rsquo;s crawl rate to the &ldquo;Let Google optimize&rdquo; setting</li> </ul> November, 2015 /cgspace-notes/2015-11/ Mon, 23 Nov 2015 17:00:57 +0300 /cgspace-notes/2015-11/ <h2 id="2015-11-22:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-22</h2> <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 </code></pre> <ul> <li>For now I have increased the limit from 60 to 90, run updates, and rebooted the server</li> </ul> <h2 id="2015-11-24:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-24</h2> <ul> <li>CGSpace went down again</li> <li>Getting emails from uptimeRobot and uptimeButler that it&rsquo;s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors</li> <li>Looks like there are still a bunch of idle PostgreSQL connections:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 96 </code></pre> <ul> <li>For some reason the number of idle connections is very high since we upgraded to DSpace 5</li> </ul> <h2 id="2015-11-25:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-25</h2> <ul> <li>Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config</li> <li>The OAI application requests stylesheets and javascript files with the path <code>/oai/static/css</code>, which gets matched here:</li> </ul> <pre><code># static assets we can load from the file system directly with nginx location ~ /(themes|static|aspects/ReportingSuite) { try_files $uri @tomcat; ... </code></pre> <ul> <li>The document root is relative to the xmlui app, so this gets a 404—I&rsquo;m not sure why it doesn&rsquo;t pass to <code>@tomcat</code></li> <li>Anyways, I can&rsquo;t find any URIs with path <code>/static</code>, and the more important point is to handle all the static theme assets, so we can just remove <code>static</code> from the regex for now (who cares if we can&rsquo;t use nginx to send Etags for OAI CSS!)</li> <li>Also, I noticed we aren&rsquo;t setting CSP headers on the static assets, because in nginx headers are inherited in child blocks, but if you use <code>add_header</code> in a child block it doesn&rsquo;t inherit the others</li> <li>We simply need to add <code>include extra-security.conf;</code> to the above location block (but research and test first)</li> <li>We should add WOFF assets to the list of things to set expires for:</li> </ul> <pre><code>location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ { </code></pre> <ul> <li>We should also add <code>aspects/Statistics</code> to the location block for static assets (minus <code>static</code> from above):</li> </ul> <pre><code>location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) { </code></pre> <ul> <li>Need to check <code>/about</code> on CGSpace, as it&rsquo;s blank on my local test server and we might need to add something there</li> <li>CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 93 </code></pre> <ul> <li>I looked closer at the idle connections and saw that many have been idle for hours (current time on server is <code>2015-11-25T20:20:42+0000</code>):</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | less -S datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | -------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+--- 20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20 20951 | cgspace | 10967 | 18205 | cgspace | | 127.0.0.1 | | 37737 | 2015-11-25 13:13:03.069421+00 | | 20 ... </code></pre> <ul> <li>There is a relevant Jira issue about this: <a href="https://jira.duraspace.org/browse/DS-1458">https://jira.duraspace.org/browse/DS-1458</a></li> <li>It seems there is some sense changing DSpace&rsquo;s default <code>db.maxidle</code> from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)</li> <li>Change <code>db.maxidle</code> from -1 to 10, reduce <code>db.maxconnections</code> from 90 to 50, and restart postgres and tomcat7</li> <li>Also redeploy DSpace Test with a clean sync of CGSpace and mirror these database settings there as well</li> <li>Also deploy the nginx fixes for the <code>try_files</code> location block as well as the expires block</li> </ul> <h2 id="2015-11-26:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-26</h2> <ul> <li>CGSpace behaving much better since changing <code>db.maxidle</code> yesterday, but still two up/down notices from monitoring this morning (better than 50!)</li> <li>CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item</li> <li>Not as bad for me, but still unsustainable if you have to get many:</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 8.415 </code></pre> <ul> <li>Monitoring e-mailed in the evening to say CGSpace was down</li> <li>Idle connections in PostgreSQL again:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 66 </code></pre> <ul> <li>At the time, the current DSpace pool size was 50&hellip;</li> <li>I reduced the pool back to the default of 30, and reduced the <code>db.maxidle</code> settings from 10 to 8</li> </ul> <h2 id="2015-11-29:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-29</h2> <ul> <li>Still more alerts that CGSpace has been up and down all day</li> <li>Current database settings for DSpace:</li> </ul> <pre><code>db.maxconnections = 30 db.maxwait = 5000 db.maxidle = 8 db.statementpool = true </code></pre> <ul> <li>And idle connections:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 49 </code></pre> <ul> <li>Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace&rsquo;s thirst can ever be quenched</li> <li>On another note, SUNScholar&rsquo;s notes suggest adjusting some other postgres variables: <a href="http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database">http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database</a></li> <li>This might help with REST API speed (which I mentioned above and still need to do real tests)</li> </ul>