CGSpace Notes /cgspace-notes/ Recent content on CGSpace Notes Hugo -- gohugo.io en-us Fri, 01 Jul 2016 10:53:00 +0300 July, 2016 /cgspace-notes/2016-07/ Fri, 01 Jul 2016 10:53:00 +0300 /cgspace-notes/2016-07/ <h2 id="2016-07-01:edc14796891c14dec087b4bb89c38aa9">2016-07-01</h2> <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> </ul> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and text_value ~ '^.+?,$'; text_value ------------ (0 rows) </code></pre> <ul> <li>In this case the select query was showing 95 results before the update</li> </ul> <h2 id="2016-07-02:edc14796891c14dec087b4bb89c38aa9">2016-07-02</h2> <ul> <li>Comment on DSpace Jira ticket about author lookup search text (<a href="https://jira.duraspace.org/browse/DS-2329">DS-2329</a>)</li> </ul> <h2 id="2016-07-04:edc14796891c14dec087b4bb89c38aa9">2016-07-04</h2> <ul> <li>Seems the database&rsquo;s author authority values mean nothing without the <code>authority</code> Solr core from the host where they were created!</li> </ul> <h2 id="2016-07-05:edc14796891c14dec087b4bb89c38aa9">2016-07-05</h2> <ul> <li>Amend <code>backup-solr.sh</code> script so it backs up the entire Solr folder</li> <li>We <em>really</em> only need <code>statistics</code> and <code>authority</code> but meh</li> <li>Fix metadata for species on DSpace Test:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu' </code></pre> <ul> <li>Will run later on CGSpace</li> <li>A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is &ldquo;ungraded&rdquo;</li> <li>I tested the <a href="https://jira.duraspace.org/browse/DS-2740">patch for DS-2740</a> that I had found last month and it seems to work</li> <li>I will merge it to <code>5_x-prod</code></li> </ul> <h2 id="2016-07-06:edc14796891c14dec087b4bb89c38aa9">2016-07-06</h2> <ul> <li>Delete 23 blank metadata values from CGSpace:</li> </ul> <pre><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value=''; DELETE 23 </code></pre> <ul> <li>Complete phase three of metadata migration, for the following fields: <ul> <li>dc.title.jtitle → dc.source</li> <li>dc.crsubject.crpsubject → cg.contributor.crp</li> <li>dc.contributor.affiliation → cg.contributor.affiliation</li> <li>dc.Species → cg.species</li> <li>dc.srplace.subregion → cg.coverage.subregion</li> <li>dc.contributor.corporate → dc.contributor.author</li> <li>dc.identifier.url → cg.identifier.url</li> <li>dc.identifier.doi → cg.identifier.doi</li> <li>dc.identifier.googleurl → cg.identifier.googleurl</li> <li>dc.identifier.dataurl → cg.identifier.dataurl</li> </ul></li> <li>Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)</li> </ul> <pre><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu' $ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu' $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu' </code></pre> <ul> <li>I then ran all server updates and rebooted the server</li> </ul> <h2 id="2016-07-11:edc14796891c14dec087b4bb89c38aa9">2016-07-11</h2> <ul> <li>Doing some author cleanups from Peter and Abenet:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu </code></pre> <h2 id="2016-07-13:edc14796891c14dec087b4bb89c38aa9">2016-07-13</h2> <ul> <li>Run the author cleanups on CGSpace and start a full Discovery re-index</li> </ul> <h2 id="2016-07-18:edc14796891c14dec087b4bb89c38aa9">2016-07-18</h2> <ul> <li>Adjust identifiers in XMLUI item display to be more prominent</li> <li>Add species and breed to the XMLUI item display</li> <li>CGSpace crashed late at night and the DSpace logs were showing:</li> </ul> <pre><code>2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object ... </code></pre> <ul> <li>I suspect it&rsquo;s someone hitting REST too much:</li> </ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3 710 66.249.78.38 1781 181.118.144.29 24904 70.32.99.142 </code></pre> <ul> <li>I just blocked access to <code>/rest</code> for that last IP for now:</li> </ul> <pre><code> # log rest requests location /rest { access_log /var/log/nginx/rest.log; proxy_pass http://127.0.0.1:8443; deny 70.32.99.142; } </code></pre> <h2 id="2016-07-21:edc14796891c14dec087b4bb89c38aa9">2016-07-21</h2> <ul> <li>Mitigate the <a href="https://httpoxy.org">HTTPoxy</a> vulnerability for Tomcat etc in nginx: <a href="https://github.com/ilri/rmg-ansible-public/pull/38">https://github.com/ilri/rmg-ansible-public/pull/38</a></li> <li>Unblock 70.32.99.142 from <code>/rest</code> as it has been blocked for a few days</li> </ul> <h2 id="2016-07-22:edc14796891c14dec087b4bb89c38aa9">2016-07-22</h2> <ul> <li>Help Paola from CCAFS with thumbnails for batch uploads</li> <li>She has been struggling to get the dimensions right, and manually enlarging smaller thumbnails, renaming PNGs to JPG, etc</li> <li>Altmetric reports having an issue with some of our authors being doubled&hellip;</li> <li>This is related to authority and confidence!</li> <li>We might need to use <code>index.authority.ignore-prefered=true</code> to tell the Discovery index to prefer the variation that exists in the metadatavalue rather than what it finds in the authority cache.</li> <li>Trying these on DSpace Test after a discussion by Daniel Scharon on the dspace-tech mailing list:</li> </ul> <pre><code>index.authority.ignore-prefered.dc.contributor.author=true index.authority.ignore-variants.dc.contributor.author=false </code></pre> <ul> <li>After reindexing I don&rsquo;t see any change in Discovery&rsquo;s display of authors, and still have entries like:</li> </ul> <pre><code>Grace, D. (464) Grace, D. (62) </code></pre> <ul> <li>I asked for clarification of the following options on the DSpace mailing list:</li> </ul> <pre><code>index.authority.ignore index.authority.ignore-prefered index.authority.ignore-variants </code></pre> <ul> <li>In the mean time, I will try these on DSpace Test (plus a reindex):</li> </ul> <pre><code>index.authority.ignore=true index.authority.ignore-prefered=true index.authority.ignore-variants=true </code></pre> <ul> <li>Enabled usage of <code>X-Forwarded-For</code> in DSpace admin control panel (<a href="https://github.com/ilri/DSpace/pull/255">#255</a></li> <li>It was misconfigured and disabled, but already working for some reason <em>sigh</em></li> </ul> June, 2016 /cgspace-notes/2016-06/ Wed, 01 Jun 2016 10:53:00 +0300 /cgspace-notes/2016-06/ <h2 id="2016-06-01:6783872e82b68b1517e00f494e6b6504">2016-06-01</h2> <ul> <li>Experimenting with IFPRI OAI (we want to harvest their publications)</li> <li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> <li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li> <li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li> <li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li> <li>Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in <code>dc.identifier.fund</code> to <code>cg.identifier.cpwfproject</code> and then the rest to <code>dc.description.sponsorship</code></li> </ul> <pre><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA'); UPDATE 497 dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75; UPDATE 14 </code></pre> <ul> <li>Fix a few minor miscellaneous issues in <code>dspace.cfg</code> (<a href="https://github.com/ilri/DSpace/pull/227">#227</a>)</li> </ul> <h2 id="2016-06-02:6783872e82b68b1517e00f494e6b6504">2016-06-02</h2> <ul> <li>Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with <code>cg.coverage.admin-unit</code></li> <li>Seems that the Browse configuration in <code>dspace.cfg</code> can&rsquo;t handle the &lsquo;-&rsquo; in the field name:</li> </ul> <pre><code>webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text </code></pre> <ul> <li>But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error</li> <li>I&rsquo;ve sent a message to the DSpace mailing list to ask about the Browse index definition</li> <li>A user was having problems with submission and from the stacktrace it looks like a Sherpa/Romeo issue</li> <li>I found a thread on the mailing list talking about it and there is bug report and a patch: <a href="https://jira.duraspace.org/browse/DS-2740">https://jira.duraspace.org/browse/DS-2740</a></li> <li>The patch applies successfully on DSpace 5.1 so I will try it later</li> </ul> <h2 id="2016-06-03:6783872e82b68b1517e00f494e6b6504">2016-06-03</h2> <ul> <li>Investigating the CCAFS authority issue, I exported the metadata for the Videos collection</li> <li>The top two authors are:</li> </ul> <pre><code>CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500 CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600 </code></pre> <ul> <li>So the only difference is the &ldquo;confidence&rdquo;</li> <li>Ok, well THAT is interesting:</li> </ul> <pre><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %'; text_value | authority | confidence ------------+--------------------------------------+------------ Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, Alan | | -1 Orth, Alan | | -1 Orth, Alan | | -1 Orth, Alan | | -1 Orth, A. | 05c2c622-d252-4efb-b9ed-95a07d3adf11 | -1 Orth, A. | 05c2c622-d252-4efb-b9ed-95a07d3adf11 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 | 600 Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 | 600 (13 rows) </code></pre> <ul> <li>And now an actually relevent example:</li> </ul> <pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500; count ------- 707 (1 row) dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500; count ------- 253 (1 row) </code></pre> <ul> <li>Trying something experimental:</li> </ul> <pre><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security'; UPDATE 960 </code></pre> <ul> <li>And then re-indexing authority and Discovery&hellip;?</li> <li>After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet</li> <li>The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:</li> </ul> <pre><code>webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority </code></pre> <ul> <li>That would only be for the &ldquo;Browse by&rdquo; function&hellip; so we&rsquo;ll have to see what effect that has later</li> </ul> <h2 id="2016-06-04:6783872e82b68b1517e00f494e6b6504">2016-06-04</h2> <ul> <li>Re-sync DSpace Test with CGSpace and perform test of metadata migration again</li> <li>Run phase two of metadata migrations on CGSpace (see the <a href="https://gist.github.com/alanorth/1a730bec5ac9457a8fb0e3e72c98d09c">migration notes</a>)</li> <li>Run all system updates and reboot CGSpace server</li> </ul> <h2 id="2016-06-07:6783872e82b68b1517e00f494e6b6504">2016-06-07</h2> <ul> <li>Figured out how to export a list of the unique values from a metadata field ordered by count:</li> </ul> <pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv; </code></pre> <ul> <li><p>Identified the next round of fields to migrate:</p> <ul> <li>dc.title.jtitle → dc.source</li> <li>dc.crsubject.crpsubject → cg.contributor.crp</li> <li>dc.contributor.affiliation → cg.contributor.affiliation</li> <li>dc.Species → cg.species</li> <li>dc.contributor.corporate → dc.contributor</li> <li>dc.identifier.url → cg.identifier.url</li> <li>dc.identifier.doi → cg.identifier.doi</li> <li>dc.identifier.googleurl → cg.identifier.googleurl</li> <li>dc.identifier.dataurl → cg.identifier.dataurl</li> </ul></li> <li><p>Discuss pulling data from IFPRI&rsquo;s ContentDM with Ryan Miller</p></li> <li><p>Looks like OAI is kinda obtuse for this, and if we use ContentDM&rsquo;s API we&rsquo;ll be able to access their internal field names (rather than trying to figure out how they stuffed them into various, repeated Dublin Core fields)</p></li> </ul> <h2 id="2016-06-08:6783872e82b68b1517e00f494e6b6504">2016-06-08</h2> <ul> <li>Discuss controlled vocabularies for ~28 fields</li> <li>Looks like this is all we need: <a href="https://wiki.duraspace.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies">https://wiki.duraspace.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies</a></li> <li>I wrote an XPath expression to extract the ILRI subjects from <code>input-forms.xml</code> (uses xmlstartlet):</li> </ul> <pre><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml </code></pre> <ul> <li>Write to Atmire about the use of <code>atmire.orcid.id</code> to see if we can change it</li> <li>Seems to be a virtual field that is queried from the authority cache&hellip; hmm</li> <li>In other news, I found out that the About page that we haven&rsquo;t been using lives in <code>dspace/config/about.xml</code>, so now we can update the text</li> <li>File bug about <code>closed=&quot;true&quot;</code> attribute of controlled vocabularies not working: <a href="https://jira.duraspace.org/browse/DS-3238">https://jira.duraspace.org/browse/DS-3238</a></li> </ul> <h2 id="2016-06-09:6783872e82b68b1517e00f494e6b6504">2016-06-09</h2> <ul> <li>Atmire explained that the <code>atmire.orcid.id</code> field doesn&rsquo;t exist in the schema, as it actually comes from the authority cache during XMLUI run time</li> <li>This means we don&rsquo;t see it when harvesting via OAI or REST, for example</li> <li>They opened a feature ticket on the DSpace tracker to ask for support of this: <a href="https://jira.duraspace.org/browse/DS-3239">https://jira.duraspace.org/browse/DS-3239</a></li> </ul> <h2 id="2016-06-10:6783872e82b68b1517e00f494e6b6504">2016-06-10</h2> <ul> <li>Investigating authority confidences</li> <li>It looks like the values are documented in <code>Choices.java</code></li> <li>Experiment with setting all 960 CCAFS author values to be 500:</li> </ul> <pre><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security'; dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security'; UPDATE 960 </code></pre> <ul> <li>After the database edit, I did a full Discovery re-index</li> <li>And now there are exactly 960 items in the authors facet for &lsquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rsquo;</li> <li>Now I ran the same on CGSpace</li> <li>Merge controlled vocabulary functionality for animal breeds to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/236">#236</a>)</li> <li>Write python script to update metadata values in batch via PostgreSQL: <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a></li> <li>We need to use this to correct some pretty ugly values in fields like <code>dc.description.sponsorship</code></li> <li>Merge item display tweaks from earlier this week (<a href="https://github.com/ilri/DSpace/pull/231">#231</a>)</li> <li>Merge controlled vocabulary functionality for subregions (<a href="https://github.com/ilri/DSpace/pull/238">#238</a>)</li> </ul> <h2 id="2016-06-11:6783872e82b68b1517e00f494e6b6504">2016-06-11</h2> <ul> <li>Merge controlled vocabulary for sponsorship field (<a href="https://github.com/ilri/DSpace/pull/239">#239</a>)</li> <li>Fix character encoding issues for animal breed lookup that I merged yesterday</li> </ul> <h2 id="2016-06-17:6783872e82b68b1517e00f494e6b6504">2016-06-17</h2> <ul> <li>Linode has free RAM upgrades for their 13th birthday so I migrated DSpace Test (4→8GB of RAM)</li> </ul> <h2 id="2016-06-18:6783872e82b68b1517e00f494e6b6504">2016-06-18</h2> <ul> <li>Clean up titles and hints in <code>input-forms.xml</code> to use title/sentence case and a few more consistency things (<a href="https://github.com/ilri/DSpace/pull/241">#241</a>)</li> <li><p>The final list of fields to migrate in the third phase of metadata migrations is:</p> <ul> <li>dc.title.jtitle → dc.source</li> <li>dc.crsubject.crpsubject → cg.contributor.crp</li> <li>dc.contributor.affiliation → cg.contributor.affiliation</li> <li>dc.srplace.subregion → cg.coverage.subregion</li> <li>dc.Species → cg.species</li> <li>dc.contributor.corporate → dc.contributor</li> <li>dc.identifier.url → cg.identifier.url</li> <li>dc.identifier.doi → cg.identifier.doi</li> <li>dc.identifier.googleurl → cg.identifier.googleurl</li> <li>dc.identifier.dataurl → cg.identifier.dataurl</li> </ul></li> <li><p>Interesting &ldquo;Sunburst&rdquo; visualization on a Digital Commons page: <a href="http://www.repository.law.indiana.edu/sunburst.html">http://www.repository.law.indiana.edu/sunburst.html</a></p></li> <li><p>Final testing on metadata fix/delete for <code>dc.description.sponsorship</code> cleanup</p></li> <li><p>Need to run <code>fix-metadata-values.py</code> and then <code>fix-metadata-values.py</code></p></li> </ul> <h2 id="2016-06-20:6783872e82b68b1517e00f494e6b6504">2016-06-20</h2> <ul> <li>CGSpace&rsquo;s HTTPS certificate expired last night and I didn&rsquo;t notice, had to renew:</li> </ul> <pre><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &quot;/usr/bin/service nginx stop&quot; --post-hook &quot;/usr/bin/service nginx start&quot; </code></pre> <ul> <li>I really need to fix that cron job&hellip;</li> </ul> <h2 id="2016-06-24:6783872e82b68b1517e00f494e6b6504">2016-06-24</h2> <ul> <li>Run the replacements/deletes for <code>dc.description.sponsorship</code> (investors) on CGSpace:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace </code></pre> <ul> <li>The scripts for this are here: <ul> <li><a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a></li> <li><a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a></li> </ul></li> <li>Add new sponsors to controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/244">#244</a>)</li> <li>Refine submission form labels and hints</li> </ul> <h2 id="2016-06-28:6783872e82b68b1517e00f494e6b6504">2016-06-28</h2> <ul> <li>Testing the cleanup of <code>dc.contributor.corporate</code> with 13 deletions and 121 replacements</li> <li>There are still ~97 fields that weren&rsquo;t indicated to do anything</li> <li>After the above deletions and replacements I regenerated a CSV and sent it to Peter <em>et al</em> to have a look</li> </ul> <pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv; </code></pre> <ul> <li>Re-evaluate <code>dc.contributor.corporate</code> and it seems we will move it to <code>dc.contributor.author</code> as this is more in line with how editors are actually using it</li> </ul> <h2 id="2016-06-29:6783872e82b68b1517e00f494e6b6504">2016-06-29</h2> <ul> <li>Test run of <code>migrate-fields.sh</code> with the following re-mappings:</li> </ul> <pre><code>72 55 #dc.source 86 230 #cg.contributor.crp 91 211 #cg.contributor.affiliation 94 212 #cg.species 107 231 #cg.coverage.subregion 126 3 #dc.contributor.author 73 219 #cg.identifier.url 74 220 #cg.identifier.doi 79 222 #cg.identifier.googleurl 89 223 #cg.identifier.dataurl </code></pre> <ul> <li>Run all cleanups and deletions of <code>dc.contributor.corporate</code> on CGSpace:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu' $ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu' $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu' </code></pre> <ul> <li>Re-deploy CGSpace and DSpace Test with latest June changes</li> <li>Now the sharing and Altmetric bits are more prominent:</li> </ul> <p><img src="../images/2016/06/xmlui-altmetric-sharing.png" alt="DSpace 5.1 XMLUI With Altmetric Badge" /></p> <ul> <li>Run all system updates on the servers and reboot</li> <li>Start working on config changes for phase three of the metadata migrations</li> </ul> <h2 id="2016-06-30:6783872e82b68b1517e00f494e6b6504">2016-06-30</h2> <ul> <li>Wow, there are 95 authors in the database who have &lsquo;,&rsquo; at the end of their name:</li> </ul> <pre><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,'; </code></pre> <ul> <li>We need to use something like this to fix them, need to write a proper regex later:</li> </ul> <pre><code># update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,'; </code></pre> May, 2016 /cgspace-notes/2016-05/ Sun, 01 May 2016 23:06:00 +0300 /cgspace-notes/2016-05/ <h2 id="2016-05-01:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-01</h2> <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> </ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 </code></pre> <ul> <li>The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29</li> <li>100% of the requests coming from Ethiopia are like this and result in an HTTP 500:</li> </ul> <pre><code>GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1 </code></pre> <ul> <li>For now I&rsquo;ll block just the Ethiopian IP</li> <li>The owner of that application has said that the <code>NaN</code> (not a number) is an error in his code and he&rsquo;ll fix it</li> </ul> <h2 id="2016-05-03:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-03</h2> <ul> <li>Update nginx to 1.10.x branch on CGSpace</li> <li>Fix a reference to <code>dc.type.output</code> in Discovery that I had missed when we migrated to <code>dc.type</code> last month (<a href="https://github.com/ilri/DSpace/pull/223">#223</a>)</li> </ul> <p><img src="../images/2016/05/discovery-types.png" alt="Item type in Discovery results" /></p> <h2 id="2016-05-06:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-06</h2> <ul> <li>DSpace Test is down, <code>catalina.out</code> has lots of messages about heap space from some time yesterday (!)</li> <li>It looks like Sisay was doing some batch imports</li> <li>Hmm, also disk space is full</li> <li>I decided to blow away the solr indexes, since they are 50GB and we don&rsquo;t really need all the Atmire stuff there right now</li> <li>I will re-generate the Discovery indexes after re-deploying</li> <li>Testing <code>renew-letsencrypt.sh</code> script for nginx</li> </ul> <pre><code>#!/usr/bin/env bash readonly SERVICE_BIN=/usr/sbin/service readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto # stop nginx so LE can listen on port 443 $SERVICE_BIN nginx stop $LETSENCRYPT_BIN renew -nvv --standalone --standalone-supported-challenges tls-sni-01 &gt; /var/log/letsencrypt/renew.log 2&gt;&amp;1 LE_RESULT=$? $SERVICE_BIN nginx start if [[ &quot;$LE_RESULT&quot; != 0 ]]; then echo 'Automated renewal failed:' cat /var/log/letsencrypt/renew.log exit 1 fi </code></pre> <ul> <li>Seems to work well</li> </ul> <h2 id="2016-05-10:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-10</h2> <ul> <li>Start looking at more metadata migrations</li> <li>There are lots of fields in <code>dcterms</code> namespace that look interesting, like: <ul> <li>dcterms.type</li> <li>dcterms.spatial</li> </ul></li> <li>Not sure what <code>dcterms</code> is&hellip;</li> <li>Looks like these were <a href="https://wiki.duraspace.org/display/DSDOC5x/Metadata+and+Bitstream+Format+Registries#MetadataandBitstreamFormatRegistries-DublinCoreTermsRegistry(DCTERMS)">added in DSpace 4</a> to allow for future work to make DSpace more flexible</li> <li>CGSpace&rsquo;s <code>dc</code> registry has 96 items, and the default DSpace one has 73.</li> </ul> <h2 id="2016-05-11:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-11</h2> <ul> <li><p>Identify and propose the next phase of CGSpace fields to migrate:</p> <ul> <li>dc.title.jtitle → cg.title.journal</li> <li>dc.identifier.status → cg.identifier.status</li> <li>dc.river.basin → cg.river.basin</li> <li>dc.Species → cg.species</li> <li>dc.targetaudience → cg.targetaudience</li> <li>dc.fulltextstatus → cg.fulltextstatus</li> <li>dc.editon → cg.edition</li> <li>dc.isijournal → cg.isijournal</li> </ul></li> <li><p>Start a test rebase of the <code>5_x-prod</code> branch on top of the <code>dspace-5.5</code> tag</p></li> <li><p>There were a handful of conflicts that I didn&rsquo;t understand</p></li> <li><p>After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:</p></li> </ul> <pre><code>[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -&gt; [Help 1] </code></pre> <ul> <li>I&rsquo;ve sent them a question about it</li> <li>A user mentioned having problems with uploading a 33 MB PDF</li> <li>I told her I would increase the limit temporarily tomorrow morning</li> <li>Turns out she was able to decrease the size of the PDF so we didn&rsquo;t have to do anything</li> </ul> <h2 id="2016-05-12:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-12</h2> <ul> <li>Looks like the issue that Abenet was having a few days ago with &ldquo;Connection Reset&rdquo; in Firefox might be due to a Firefox 46 issue: <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1268775">https://bugzilla.mozilla.org/show_bug.cgi?id=1268775</a></li> <li>I finally found a copy of the latest CG Core metadata guidelines and it looks like we can add a few more fields to our next migration: <ul> <li>dc.rplace.region → cg.coverage.region</li> <li>dc.cplace.country → cg.coverage.country</li> </ul></li> <li>Questions for CG people: <ul> <li>Our <code>dc.place</code> and <code>dc.srplace.subregion</code> could both map to <code>cg.coverage.admin-unit</code>?</li> <li>Should we use <code>dc.contributor.crp</code> or <code>cg.contributor.crp</code> for the CRP (ours is <code>dc.crsubject.crpsubject</code>)?</li> <li>Our <code>dc.contributor.affiliation</code> and <code>dc.contributor.corporate</code> could both map to <code>dc.contributor</code> and possibly <code>dc.contributor.center</code> depending on if it&rsquo;s a CG center or not</li> <li><code>dc.title.jtitle</code> could either map to <code>dc.publisher</code> or <code>dc.source</code> depending on how you read things</li> </ul></li> <li>Found ~200 messed up CIAT values in <code>dc.publisher</code>:</li> </ul> <pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &quot;% %&quot;; </code></pre> <h2 id="2016-05-13:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-13</h2> <ul> <li>More theorizing about CGcore</li> <li>Add two new fields: <ul> <li>dc.srplace.subregion → cg.coverage.admin-unit</li> <li>dc.place → cg.place</li> </ul></li> <li><code>dc.place</code> is our own field, so it&rsquo;s easy to move</li> <li>I&rsquo;ve removed <code>dc.title.jtitle</code> from the list for now because there&rsquo;s no use moving it out of DC until we know where it will go (see discussion yesterday)</li> </ul> <h2 id="2016-05-18:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-18</h2> <ul> <li>Work on 707 CCAFS records</li> <li>They have thumbnails on Flickr and elsewhere</li> <li>In OpenRefine I created a new <code>filename</code> column based on the <code>thumbnail</code> column with the following GREL:</li> </ul> <pre><code>if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1]) </code></pre> <ul> <li>Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL</li> <li>So for the <code>hqdefault.jpg</code> ones I just take the UUID (-2) and use it as the filename</li> <li>Before importing with SAFBuilder I tested adding &ldquo;__bundle:THUMBNAIL&rdquo; to the <code>filename</code> column and it works fine</li> </ul> <h2 id="2016-05-19:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-19</h2> <ul> <li>More quality control on <code>filename</code> field of CCAFS records to make processing in shell and SAFBuilder more reliable:</li> </ul> <pre><code>value.replace('_','').replace('-','') </code></pre> <ul> <li>We need to hold off on moving <code>dc.Species</code> to <code>cg.species</code> because it is only used for plants, and might be better to move it to something like <code>cg.species.plant</code></li> <li>And <code>dc.identifier.fund</code> is MOSTLY used for CPWF project identifier but has some other sponsorship things <ul> <li>We should move PN<em>, SG</em>, CBA, IA, and PHASE* values to <code>cg.identifier.cpwfproject</code></li> <li>The rest, like BMGF and USAID etc, might have to go to either <code>dc.description.sponsorship</code> or <code>cg.identifier.fund</code> (not sure yet)</li> <li>There are also some mistakes in CPWF&rsquo;s things, like &ldquo;PN 47&rdquo;</li> <li>This ought to catch all the CPWF values (there don&rsquo;t appear to be and SG* values):</li> </ul></li> </ul> <pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA'); </code></pre> <h2 id="2016-05-20:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-20</h2> <ul> <li>More work on CCAFS Video and Images records</li> <li>For SAFBuilder we need to modify filename column to have the thumbnail bundle: <br /></li> </ul> <pre><code>value + &quot;__bundle:THUMBNAIL&quot; </code></pre> <ul> <li>Also, I fixed some weird characters using OpenRefine&rsquo;s transform with the following GREL:</li> </ul> <pre><code>value.replace(/\u0081/,'') </code></pre> <ul> <li>Write shell script to resize thumbnails with height larger than 400: <a href="https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256">https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256</a></li> <li>Upload 707 CCAFS records to DSpace Test</li> <li>A few miscellaneous fixes for XMLUI display niggles (spaces in item lists and link target <code>_black</code>): <a href="https://github.com/ilri/DSpace/pull/224">#224</a></li> <li>Work on configuration changes for Phase 2 metadata migrations</li> </ul> <h2 id="2016-05-23:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-23</h2> <ul> <li>Try to import the CCAFS Images and Videos to CGSpace but had some issues with LibreOffice and OpenRefine</li> <li>LibreOffice excludes empty cells when it exports and all the fields shift over to the left and cause URLs to go to Subjects, etc.</li> <li>Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don&rsquo;t match!</li> <li>I will have to try later</li> </ul> <h2 id="2016-05-30:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-30</h2> <ul> <li>Export CCAFS video and image records from DSpace Test using the migrate option (<code>-m</code>):</li> </ul> <pre><code>$ mkdir ~/ccafs-images $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m </code></pre> <ul> <li>And then import to CGSpace:</li> </ul> <pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log </code></pre> <ul> <li>But now we have double authors for &ldquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rdquo; in the authority</li> <li>I&rsquo;m trying to do a Discovery index before messing with the authority index</li> <li>Looks like we are missing the <code>index-authority</code> cron job, so who knows what&rsquo;s up with our authority index</li> <li>Run system updates on DSpace Test, re-deploy code, and reboot the server</li> <li>Clean up and import ~200 CTA records to CGSpace via CSV like:</li> </ul> <pre><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &amp;&gt; ~/CTA-May30/CTA-42229.log </code></pre> <ul> <li>Discovery indexing took a few hours for some reason, and after that I started the <code>index-authority</code> script</li> </ul> <pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace index-authority </code></pre> <h2 id="2016-05-31:b7bf1a0f8f2415a40e1e11e343b04c0d">2016-05-31</h2> <ul> <li>The <code>index-authority</code> script ran over night and was finished in the morning</li> <li>Hopefully this was because we haven&rsquo;t been running it regularly and it will speed up next time</li> <li>I am running it again with a timer to see:</li> </ul> <pre><code>$ time /home/cgspace.cgiar.org/bin/dspace index-authority Retrieving all data Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer Cleaning the old index Writing new data All done ! real 37m26.538s user 2m24.627s sys 0m20.540s </code></pre> <ul> <li>Update <code>tomcat7</code> crontab on CGSpace and DSpace Test to have the <code>index-authority</code> script that we were missing</li> <li>Add new ILRI subject and CCAFS project tags to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/226">#226</a>, <a href="https://github.com/ilri/DSpace/pull/225">#225</a>)</li> <li>Manually mapped the authors of a few old CCAFS records to the new CCAFS authority UUID and re-indexed authority indexes to see if it helps correct those items.</li> <li>Re-sync DSpace Test data with CGSpace</li> <li>Clean up and import ~65 more CTA items into CGSpace</li> </ul> April, 2016 /cgspace-notes/2016-04/ Mon, 04 Apr 2016 11:06:00 +0300 /cgspace-notes/2016-04/ <h2 id="2016-04-04:c88be15f5b2f07c85f7742556a955e47">2016-04-04</h2> <ul> <li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li> <li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li> <li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li> <li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li> <li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li> </ul> <pre><code>Run start time: 03/06/2016 04:00:22 Error retrieving bitstream ID 71274 from asset store. java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.&lt;init&gt;(FileInputStream.java:146) at edu.sdsc.grid.io.local.LocalFileInputStream.open(LocalFileInputStream.java:171) at edu.sdsc.grid.io.GeneralFileInputStream.&lt;init&gt;(GeneralFileInputStream.java:145) at edu.sdsc.grid.io.local.LocalFileInputStream.&lt;init&gt;(LocalFileInputStream.java:139) at edu.sdsc.grid.io.FileFactory.newFileInputStream(FileFactory.java:630) at org.dspace.storage.bitstore.BitstreamStorageManager.retrieve(BitstreamStorageManager.java:525) at org.dspace.checker.BitstreamDAO.getBitstream(BitstreamDAO.java:60) at org.dspace.checker.CheckerCommand.processBitstream(CheckerCommand.java:303) at org.dspace.checker.CheckerCommand.checkBitstream(CheckerCommand.java:171) at org.dspace.checker.CheckerCommand.process(CheckerCommand.java:120) at org.dspace.app.checker.ChecksumChecker.main(ChecksumChecker.java:236) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77) ****************************************************** </code></pre> <ul> <li>So this would be the <code>tomcat7</code> Unix user, who seems to have a default limit of 1024 files in its shell</li> <li>For what it&rsquo;s worth, we have been setting the actual Tomcat 7 process&rsquo; limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li> <li>Looks like cron will read limits from <code>/etc/security/limits.*</code> so we can do something for the tomcat7 user there</li> <li>Submit pull request for Tomcat 7 limits in Ansible dspace role (<a href="https://github.com/ilri/rmg-ansible-public/pull/30">#30</a>)</li> </ul> <h2 id="2016-04-05:c88be15f5b2f07c85f7742556a955e47">2016-04-05</h2> <ul> <li>Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don&rsquo;t need!</li> </ul> <pre><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt # grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del </code></pre> <ul> <li>Also, adjust the cron jobs for backups so they only backup <code>dspace.log</code> and some stats files (.dat)</li> <li>Try to do some metadata field migrations using the Atmire batch UI (<code>dc.Species</code> → <code>cg.species</code>) but it took several hours and even missed a few records</li> </ul> <h2 id="2016-04-06:c88be15f5b2f07c85f7742556a955e47">2016-04-06</h2> <ul> <li>A better way to move metadata on this scale is via SQL, for example <code>dc.type.output</code> → <code>dc.type</code> (their IDs in the metadatafieldregistry are 66 and 109, respectively):</li> </ul> <pre><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66; UPDATE 40852 </code></pre> <ul> <li>After that an <code>index-discovery -bf</code> is required</li> <li>Start working on metadata migrations, add 25 or so new metadata fields to CGSpace</li> </ul> <h2 id="2016-04-07:c88be15f5b2f07c85f7742556a955e47">2016-04-07</h2> <ul> <li>Write shell script to do the migration of fields: <a href="https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b">https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b</a></li> <li>Testing with a few fields it seems to work well:</li> </ul> <pre><code>$ ./migrate-fields.sh UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66 UPDATE 40883 UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72 UPDATE 21420 UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76 UPDATE 51258 </code></pre> <h2 id="2016-04-08:c88be15f5b2f07c85f7742556a955e47">2016-04-08</h2> <ul> <li>Discuss metadata renaming with Abenet, we decided it&rsquo;s better to start with the center-specific subjects like ILRI, CIFOR, CCAFS, IWMI, and CPWF</li> <li>I&rsquo;ve e-mailed CCAFS and CPWF people to ask them how much time it will take for them to update their systems to cope with this change</li> </ul> <h2 id="2016-04-10:c88be15f5b2f07c85f7742556a955e47">2016-04-10</h2> <ul> <li>Looking at the DOI issue <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860">reported by Leroy from CIAT a few weeks ago</a></li> <li>It seems the <code>dx.doi.org</code> URLs are much more proper in our repository!</li> </ul> <pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%'; count ------- 5638 (1 row) dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%'; count ------- 3 </code></pre> <ul> <li>I will manually edit the <code>dc.identifier.doi</code> in <a href="https://cgspace.cgiar.org/handle/10568/72509?show=full"><sup>10568</sup>&frasl;<sub>72509</sub></a> and tweet the link, then check back in a week to see if the donut gets updated</li> </ul> <h2 id="2016-04-11:c88be15f5b2f07c85f7742556a955e47">2016-04-11</h2> <ul> <li>The donut is already updated and shows the correct number now</li> <li>CCAFS people say it will only take them an hour to update their code for the metadata renames, so I proposed we&rsquo;d do it tentatively on Monday the 18th.</li> </ul> <h2 id="2016-04-12:c88be15f5b2f07c85f7742556a955e47">2016-04-12</h2> <ul> <li>Looking at quality of WLE data (<code>cg.subject.iwmi</code>) in SQL:</li> </ul> <pre><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc; </code></pre> <ul> <li>Listings and Reports is still not returning reliable data for <code>dc.type</code></li> <li>I think we need to ask Atmire, as their documentation isn&rsquo;t too clear on the format of the filter configs</li> <li>Alternatively, I want to see if I move all the data from <code>dc.type.output</code> to <code>dc.type</code> and then re-index, if it behaves better</li> <li>Looking at our <code>input-forms.xml</code> I see we have two sets of ILRI subjects, but one has a few extra subjects</li> <li>Remove one set of ILRI subjects and remove duplicate <code>VALUE CHAINS</code> from existing list (<a href="https://github.com/ilri/DSpace/pull/216">#216</a>)</li> <li>I decided to keep the set of subjects that had <code>FMD</code> and <code>RANGELANDS</code> added, as it appears to have been requested to have been added, and might be the newer list</li> <li>I found 226 blank metadatavalues:</li> </ul> <pre><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value=''; </code></pre> <ul> <li>I think we should delete them and do a full re-index:</li> </ul> <pre><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value=''; DELETE 226 </code></pre> <ul> <li>I deleted them on CGSpace but I&rsquo;ll wait to do the re-index as we&rsquo;re going to be doing one in a few days for the metadata changes anyways</li> <li>In other news, moving the <code>dc.type.output</code> to <code>dc.type</code> and re-indexing seems to have fixed the Listings and Reports issue from above</li> <li>Unfortunately this isn&rsquo;t a very good solution, because Listings and Reports config should allow us to filter on <code>dc.type.*</code> but the documentation isn&rsquo;t very clear and I couldn&rsquo;t reach Atmire today</li> <li>We want to do the <code>dc.type.output</code> move on CGSpace anyways, but we should wait as it might affect other external people!</li> </ul> <h2 id="2016-04-14:c88be15f5b2f07c85f7742556a955e47">2016-04-14</h2> <ul> <li>Communicate with Macaroni Bros again about <code>dc.type</code></li> <li>Help Sisay with some rsync and Linux stuff</li> <li>Notify CIAT people of metadata changes (I had forgotten them last week)</li> </ul> <h2 id="2016-04-15:c88be15f5b2f07c85f7742556a955e47">2016-04-15</h2> <ul> <li>DSpace Test had crashed, so I ran all system updates, rebooted, and re-deployed DSpace code</li> </ul> <h2 id="2016-04-18:c88be15f5b2f07c85f7742556a955e47">2016-04-18</h2> <ul> <li>Talk to CIAT people about their portal again</li> <li>Start looking more at the fields we want to delete</li> <li>The following metadata fields have 0 items using them, so we can just remove them from the registry and any references in XMLUI, input forms, etc: <ul> <li>dc.description.abstractother</li> <li>dc.whatwasknown</li> <li>dc.whatisnew</li> <li>dc.description.nationalpartners</li> <li>dc.peerreviewprocess</li> <li>cg.species.animal</li> </ul></li> <li>Deleted!</li> <li>The following fields have some items using them and I have to decide what to do with them (delete or move): <ul> <li>dc.icsubject.icrafsubject: 6 items, mostly in CPWF collections</li> <li>dc.type.journal: 11 items, mostly in ILRI collections</li> <li>dc.publicationcategory: 1 item, in CPWF</li> <li>dc.GRP: 2 items, CPWF</li> <li>dc.Species.animal: 6 items, in ILRI and AnGR</li> <li>cg.livestock.agegroup: 9 items, in ILRI collections</li> <li>cg.livestock.function: 20 items, mostly in EADD</li> </ul></li> <li>Test metadata migration on local instance again:</li> </ul> <pre><code>$ ./migrate-fields.sh UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66 UPDATE 40885 UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76 UPDATE 51330 UPDATE metadatavalue SET metadata_field_id=208 WHERE metadata_field_id=82 UPDATE 5986 UPDATE metadatavalue SET metadata_field_id=210 WHERE metadata_field_id=88 UPDATE 2456 UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106 UPDATE 3872 UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108 UPDATE 46075 $ JAVA_OPTS=&quot;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace index-discovery -bf </code></pre> <ul> <li>CGSpace was down but I&rsquo;m not sure why, this was in <code>catalina.out</code>:</li> </ul> <pre><code>Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at org.dspace.rest.Resource.processFinally(Resource.java:163) at org.dspace.rest.HandleResource.getObject(HandleResource.java:81) at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) ... </code></pre> <ul> <li>Everything else in the system looked normal (50GB disk space available, nothing weird in dmesg, etc)</li> <li>After restarting Tomcat a few more of these errors were logged but the application was up</li> </ul> <h2 id="2016-04-19:c88be15f5b2f07c85f7742556a955e47">2016-04-19</h2> <ul> <li>Get handles for items that are using a given metadata field, ie <code>dc.Species.animal</code> (105):</li> </ul> <pre><code># select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105); handle ------------- 10568/10298 10568/16413 10568/16774 10568/34487 </code></pre> <ul> <li>Delete metadata values for <code>dc.GRP</code> and <code>dc.icsubject.icrafsubject</code>:</li> </ul> <pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=96; # delete from metadatavalue where resource_type_id=2 and metadata_field_id=83; </code></pre> <ul> <li>They are old ICRAF fields and we haven&rsquo;t used them since 2011 or so</li> <li>Also delete them from the metadata registry</li> <li>CGSpace went down again, <code>dspace.log</code> had this:</li> </ul> <pre><code>2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object </code></pre> <ul> <li>I restarted Tomcat and PostgreSQL and now it&rsquo;s back up</li> <li>I bet this is the same crash as yesterday, but I only saw the errors in <code>catalina.out</code></li> <li>Looks to be related to this, from <code>dspace.log</code>:</li> </ul> <pre><code>2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement. </code></pre> <ul> <li>We have 18,000 of these errors right now&hellip;</li> <li>Delete a few more old metadata values: <code>dc.Species.animal</code>, <code>dc.type.journal</code>, and <code>dc.publicationcategory</code>:</li> </ul> <pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=105; # delete from metadatavalue where resource_type_id=2 and metadata_field_id=85; # delete from metadatavalue where resource_type_id=2 and metadata_field_id=95; </code></pre> <ul> <li>And then remove them from the metadata registry</li> </ul> <h2 id="2016-04-20:c88be15f5b2f07c85f7742556a955e47">2016-04-20</h2> <ul> <li>Re-deploy DSpace Test with the new subject and type fields, run all system updates, and reboot the server</li> <li>Migrate fields and re-deploy CGSpace with the new subject and type fields, run all system updates, and reboot the server</li> <li>Field migration went well:</li> </ul> <pre><code>$ ./migrate-fields.sh UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66 UPDATE 40909 UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76 UPDATE 51419 UPDATE metadatavalue SET metadata_field_id=208 WHERE metadata_field_id=82 UPDATE 5986 UPDATE metadatavalue SET metadata_field_id=210 WHERE metadata_field_id=88 UPDATE 2458 UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106 UPDATE 3872 UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108 UPDATE 46075 </code></pre> <ul> <li>Also, I migrated CGSpace to using the PGDG PostgreSQL repo as the infrastructure playbooks had been using it for a while and it seemed to be working well</li> <li>Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)</li> <li>Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:</li> </ul> <pre><code>$ grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-20 21252 </code></pre> <ul> <li>I found a recent discussion on the DSpace mailing list and I&rsquo;ve asked for advice there</li> <li>Looks like this issue was noted and fixed in DSpace 5.5 (we&rsquo;re on 5.1): <a href="https://jira.duraspace.org/browse/DS-2936">https://jira.duraspace.org/browse/DS-2936</a></li> <li>I&rsquo;ve sent a message to Atmire asking about compatibility with DSpace 5.5</li> </ul> <h2 id="2016-04-21:c88be15f5b2f07c85f7742556a955e47">2016-04-21</h2> <ul> <li>Fix a bunch of metadata consistency issues with IITA Journal Articles (Peer review, Formally published, messed up DOIs, etc)</li> <li>Atmire responded with DSpace 5.5 compatible versions for their modules, so I&rsquo;ll start testing those in a few weeks</li> </ul> <h2 id="2016-04-22:c88be15f5b2f07c85f7742556a955e47">2016-04-22</h2> <ul> <li>Import 95 records into <a href="https://cgspace.cgiar.org/handle/10568/42219">CTA&rsquo;s Agrodok collection</a></li> </ul> <h2 id="2016-04-26:c88be15f5b2f07c85f7742556a955e47">2016-04-26</h2> <ul> <li>Test embargo during item upload</li> <li>Seems to be working but the help text is misleading as to the date format</li> <li>It turns out the <code>robots.txt</code> issue we thought we solved last month isn&rsquo;t solved because you can&rsquo;t use wildcards in URL patterns: <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> <li>Write some nginx rules to add <code>X-Robots-Tag</code> HTTP headers to the dynamic requests from <code>robots.txt</code> instead</li> <li>A few URLs to test with: <ul> <li><a href="https://dspacetest.cgiar.org/handle/10568/440/browse?type=bioversity">https://dspacetest.cgiar.org/handle/10568/440/browse?type=bioversity</a></li> <li><a href="https://dspacetest.cgiar.org/handle/10568/913/discover">https://dspacetest.cgiar.org/handle/10568/913/discover</a></li> <li><a href="https://dspacetest.cgiar.org/handle/10568/1/search-filter?filtertype_0=country&amp;filter_0=VIETNAM&amp;filter_relational_operator_0=equals&amp;field=country">https://dspacetest.cgiar.org/handle/10568/1/search-filter?filtertype_0=country&amp;filter_0=VIETNAM&amp;filter_relational_operator_0=equals&amp;field=country</a></li> </ul></li> </ul> <h2 id="2016-04-27:c88be15f5b2f07c85f7742556a955e47">2016-04-27</h2> <ul> <li>I woke up to ten or fifteen &ldquo;up&rdquo; and &ldquo;down&rdquo; emails from the monitoring website</li> <li>Looks like the last one was &ldquo;down&rdquo; from about four hours ago</li> <li>I think there must be something with this REST stuff:</li> </ul> <pre><code># grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-* dspace.log.2016-04-01:0 dspace.log.2016-04-02:0 dspace.log.2016-04-03:0 dspace.log.2016-04-04:0 dspace.log.2016-04-05:0 dspace.log.2016-04-06:0 dspace.log.2016-04-07:0 dspace.log.2016-04-08:0 dspace.log.2016-04-09:0 dspace.log.2016-04-10:0 dspace.log.2016-04-11:0 dspace.log.2016-04-12:235 dspace.log.2016-04-13:44 dspace.log.2016-04-14:0 dspace.log.2016-04-15:35 dspace.log.2016-04-16:0 dspace.log.2016-04-17:0 dspace.log.2016-04-18:11942 dspace.log.2016-04-19:28496 dspace.log.2016-04-20:28474 dspace.log.2016-04-21:28654 dspace.log.2016-04-22:28763 dspace.log.2016-04-23:28773 dspace.log.2016-04-24:28775 dspace.log.2016-04-25:28626 dspace.log.2016-04-26:28655 dspace.log.2016-04-27:7271 </code></pre> <ul> <li>I restarted tomcat and it is back up</li> <li>Add Spanish XMLUI strings so those users see &ldquo;CGSpace&rdquo; instead of &ldquo;DSpace&rdquo; in the user interface (<a href="https://github.com/ilri/DSpace/pull/222">#222</a>)</li> <li>Submit patch to upstream DSpace for the misleading help text in the embargo step of the item submission: <a href="https://jira.duraspace.org/browse/DS-3172">https://jira.duraspace.org/browse/DS-3172</a></li> <li>Update infrastructure playbooks for nginx 1.10.x (stable) release: <a href="https://github.com/ilri/rmg-ansible-public/issues/32">https://github.com/ilri/rmg-ansible-public/issues/32</a></li> <li>Currently running on DSpace Test, we&rsquo;ll give it a few days before we adjust CGSpace</li> <li>CGSpace down, restarted tomcat and it&rsquo;s back up</li> </ul> <h2 id="2016-04-28:c88be15f5b2f07c85f7742556a955e47">2016-04-28</h2> <ul> <li>Problems with stability again. I&rsquo;ve blocked access to <code>/rest</code> for now to see if the number of errors in the log files drop</li> <li>Later we could maybe start logging access to <code>/rest</code> and perhaps whitelist some IPs&hellip;</li> </ul> <h2 id="2016-04-30:c88be15f5b2f07c85f7742556a955e47">2016-04-30</h2> <ul> <li>Logs for today and yesterday have zero references to this REST error, so I&rsquo;m going to open back up the REST API but log all requests</li> </ul> <pre><code>location /rest { access_log /var/log/nginx/rest.log; proxy_pass http://127.0.0.1:8443; } </code></pre> <ul> <li>I will check the logs again in a few days to look for patterns, see who is accessing it, etc</li> </ul> March, 2016 /cgspace-notes/2016-03/ Wed, 02 Mar 2016 16:50:00 +0300 /cgspace-notes/2016-03/ <h2 id="2016-03-02:5a28ddf3ee658c043c064ccddb151717">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> <li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> <h2 id="2016-03-07:5a28ddf3ee658c043c064ccddb151717">2016-03-07</h2> <ul> <li>Troubleshooting the issues with the slew of commits for Atmire modules in <a href="https://github.com/ilri/DSpace/pull/182">#182</a></li> <li>Their changes on <code>5_x-dev</code> branch work, but it is messy as hell with merge commits and old branch base</li> <li>When I rebase their branch on the latest <code>5_x-prod</code> I get blank white pages</li> <li>I identified one commit that causes the issue and let them know</li> <li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li> </ul> <pre><code>Exception in thread &quot;Lucene Merge Thread #19&quot; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device </code></pre> <h2 id="2016-03-08:5a28ddf3ee658c043c064ccddb151717">2016-03-08</h2> <ul> <li>Add a few new filters to Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li> <li>We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were</li> </ul> <h2 id="2016-03-10:5a28ddf3ee658c043c064ccddb151717">2016-03-10</h2> <ul> <li>Disable the lucene cron job on CGSpace as it shouldn&rsquo;t be needed anymore</li> <li>Discuss ORCiD and duplicate authors on Yammer</li> <li>Request new documentation for Atmire CUA and L&amp;R modules, as ours are from 2013</li> <li>Walk Sisay through some data cleaning workflows in OpenRefine</li> <li>Start cleaning up the configuration for Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/issues/185">#184</a>)</li> <li>It is very messed up because some labels are incorrect, fields are missing, etc</li> </ul> <p><img src="../images/2016/03/cua-label-mixup.png" alt="Mixed up label in Atmire CUA" /></p> <ul> <li>Update documentation for Atmire modules</li> </ul> <h2 id="2016-03-11:5a28ddf3ee658c043c064ccddb151717">2016-03-11</h2> <ul> <li>As I was looking at the CUA config I realized our Discovery config is all messed up and confusing</li> <li>I&rsquo;ve opened an issue to track some of that work (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li> <li>I did some major cleanup work on Discovery and XMLUI stuff related to the <code>dc.type</code> indexes (<a href="https://github.com/ilri/DSpace/pull/187">#187</a>)</li> <li>We had been confusing <code>dc.type</code> (a Dublin Core value) with <code>dc.type.output</code> (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.</li> <li>There is still some more work to be done to remove references to old <code>outputtype</code> and <code>output</code></li> </ul> <h2 id="2016-03-14:5a28ddf3ee658c043c064ccddb151717">2016-03-14</h2> <ul> <li>Fix some items that had invalid dates (I noticed them in the log during a re-indexing)</li> <li>Reset <code>search.index.*</code> to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): <a href="https://github.com/ilri/DSpace/pull/188">#188</a></li> <li>Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li> <li>Also four or so center-specific subject strings were missing for Discovery</li> </ul> <p><img src="../images/2016/03/missing-xmlui-string.png" alt="Missing XMLUI string" /></p> <h2 id="2016-03-15:5a28ddf3ee658c043c064ccddb151717">2016-03-15</h2> <ul> <li>Create simple theme for new AVCD community just for a unique Google Tracking ID (<a href="https://github.com/ilri/DSpace/pull/191">#191</a>)</li> </ul> <h2 id="2016-03-16:5a28ddf3ee658c043c064ccddb151717">2016-03-16</h2> <ul> <li>Still having problems deploying Atmire&rsquo;s CUA updates and fixes from January!</li> <li>More discussion on the GitHub issue here: <a href="https://github.com/ilri/DSpace/pull/182">https://github.com/ilri/DSpace/pull/182</a></li> <li>Clean up Atmire CUA config (<a href="https://github.com/ilri/DSpace/pull/193">#193</a>)</li> <li>Help Sisay with some PostgreSQL queries to clean up the incorrect <code>dc.contributor.corporateauthor</code> field</li> <li>I noticed that we have some weird values in <code>dc.language</code>:</li> </ul> <pre><code># select * from metadatavalue where metadata_field_id=37; metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------ 1942571 | 35342 | 37 | hi | | 1 | | -1 | 2 1942468 | 35345 | 37 | hi | | 1 | | -1 | 2 1942479 | 35337 | 37 | hi | | 1 | | -1 | 2 1942505 | 35336 | 37 | hi | | 1 | | -1 | 2 1942519 | 35338 | 37 | hi | | 1 | | -1 | 2 1942535 | 35340 | 37 | hi | | 1 | | -1 | 2 1942555 | 35341 | 37 | hi | | 1 | | -1 | 2 1942588 | 35343 | 37 | hi | | 1 | | -1 | 2 1942610 | 35346 | 37 | hi | | 1 | | -1 | 2 1942624 | 35347 | 37 | hi | | 1 | | -1 | 2 1942639 | 35339 | 37 | hi | | 1 | | -1 | 2 </code></pre> <ul> <li>It seems this <code>dc.language</code> field isn&rsquo;t really used, but we should delete these values</li> <li>Also, <code>dc.language.iso</code> has some weird values, like &ldquo;En&rdquo; and &ldquo;English&rdquo;</li> </ul> <h2 id="2016-03-17:5a28ddf3ee658c043c064ccddb151717">2016-03-17</h2> <ul> <li>It turns out <code>hi</code> is the ISO 639 language code for Hindi, but these should be in <code>dc.language.iso</code> instead of <code>dc.language</code></li> <li>I fixed the eleven items with <code>hi</code> as well as some using the incorrect <code>vn</code> for Vietnamese</li> <li>Start discussing CG core with Abenet and Sisay</li> <li>Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches</li> <li>The patches work fine with a clean database, so the error was caused by some mismatch in CUA versions and the database during my testing</li> </ul> <h2 id="2016-03-18:5a28ddf3ee658c043c064ccddb151717">2016-03-18</h2> <ul> <li>Merge Atmire fixes into <code>5_x-prod</code></li> <li>Discuss thumbnails with Francesca from Bioversity</li> <li>Some of their items end up with thumbnails that have a big white border around them:</li> </ul> <p><img src="../images/2016/03/bioversity-thumbnail-bad.jpg" alt="Excessive whitespace in thumbnail" /></p> <ul> <li>Turns out we can add <code>-trim</code> to the GraphicsMagick options to trim the whitespace</li> </ul> <p><img src="../images/2016/03/bioversity-thumbnail-good.jpg" alt="Trimmed thumbnail" /></p> <ul> <li>Command used:</li> </ul> <pre><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg </code></pre> <ul> <li>Also, it looks like adding <code>-sharpen 0x1.0</code> really improves the quality of the image for only a few KB</li> </ul> <h2 id="2016-03-21:5a28ddf3ee658c043c064ccddb151717">2016-03-21</h2> <ul> <li>Fix 66 site errors in Google&rsquo;s webmaster tools</li> <li>I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed</li> <li>We also have 1,300 &ldquo;soft 404&rdquo; errors for URLs like: <a href="https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity">https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity</a></li> <li>I&rsquo;ve marked them as fixed as well since the ones I tested were working fine</li> <li>This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem&hellip;</li> <li>Results pages like this give items that Google already knows from the sitemap: <a href="https://cgspace.cgiar.org/discover?filtertype=author&amp;filter_relational_operator=equals&amp;filter=Orth%2C+A">https://cgspace.cgiar.org/discover?filtertype=author&amp;filter_relational_operator=equals&amp;filter=Orth%2C+A</a>.</li> <li>There are some access denied errors on JSPUI links (of course! we forbid them!), but I&rsquo;m not sure why Google is trying to index them&hellip;</li> <li>For example: <ul> <li>This: <a href="https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf">https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf</a></li> <li>Linked from: <a href="https://cgspace.cgiar.org/jspui/handle/10568/809">https://cgspace.cgiar.org/jspui/handle/10568/809</a></li> </ul></li> <li>I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!</li> <li>Google says the first time it saw this particular error was September 29, 2015&hellip; so maybe it accidentally saw it somehow&hellip;</li> <li>On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content</li> </ul> <p><img src="../images/2016/03/google-index.png" alt="CGSpace pages in Google index" /></p> <ul> <li>Turns out this is a problem with DSpace&rsquo;s <code>robots.txt</code>, and there&rsquo;s a Jira ticket since December, 2015: <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> <li>I am not sure if I want to apply it yet</li> <li>For now I&rsquo;ve just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools</li> </ul> <p><img src="../images/2016/03/url-parameters.png" alt="URL parameters cause millions of dynamic pages" /> <img src="../images/2016/03/url-parameters2.png" alt="Setting pages with the filter_0 param not to show in search results" /></p> <ul> <li>Move AVCD collection to new community and update <code>move_collection.sh</code> script: <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">https://gist.github.com/alanorth/392c4660e8b022d99dfa</a></li> <li>It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs</li> <li>De-deploy CGSpace with latest <code>5_x-prod</code> branch</li> <li>Run updates on CGSpace and reboot server (new kernel, <code>4.5.0</code>)</li> <li>Deploy Let&rsquo;s Encrypt certificate for cgspace.cgiar.org, but still need to work it into the ansible playbooks</li> </ul> <h2 id="2016-03-22:5a28ddf3ee658c043c064ccddb151717">2016-03-22</h2> <ul> <li>Merge robots.txt patch and disallow indexing of browse pages as our sitemap is consumed correctly (<a href="https://github.com/ilri/DSpace/issues/198">#198</a>)</li> </ul> <h2 id="2016-03-23:5a28ddf3ee658c043c064ccddb151717">2016-03-23</h2> <ul> <li>Abenet is having problems saving group memberships, and she gets this error: <a href="https://gist.github.com/alanorth/87281c061c2de57b773e">https://gist.github.com/alanorth/87281c061c2de57b773e</a></li> </ul> <pre><code>Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967) </code></pre> <ul> <li>I can reproduce the same error on DSpace Test and on my Mac</li> <li>Looks to be an issue with the Atmire modules, I&rsquo;ve submitted a ticket to their tracker.</li> </ul> <h2 id="2016-03-24:5a28ddf3ee658c043c064ccddb151717">2016-03-24</h2> <ul> <li>Atmire sent a patch for the group saving issue: <a href="https://github.com/ilri/DSpace/pull/201">https://github.com/ilri/DSpace/pull/201</a></li> <li>I tested it locally and it works, so I merged it to <code>5_x-prod</code> and will deploy on CGSpace this week</li> </ul> <h2 id="2016-03-25:5a28ddf3ee658c043c064ccddb151717">2016-03-25</h2> <ul> <li>Having problems with Listings and Reports, seems to be caused by a rogue reference to <code>dc.type.output</code></li> <li>This is the error we get when we proceed to the second page of Listings and Reports: <a href="https://gist.github.com/alanorth/b2d7fb5b82f94898caaf">https://gist.github.com/alanorth/b2d7fb5b82f94898caaf</a></li> <li>Commenting out the line works, but I haven&rsquo;t figured out the proper syntax for referring to <code>dc.type.*</code></li> </ul> <h2 id="2016-03-28:5a28ddf3ee658c043c064ccddb151717">2016-03-28</h2> <ul> <li>Look into enabling the embargo during item submission, see: <a href="https://wiki.duraspace.org/display/DSDOC5x/Embargo#Embargo-SubmissionProcess">https://wiki.duraspace.org/display/DSDOC5x/Embargo#Embargo-SubmissionProcess</a></li> <li>Seems we only want <code>AccessStep</code> because <code>UploadWithEmbargoStep</code> disables the ability to edit embargos at the item level</li> <li>This pull request enables the ability to set an item-level embargo during submission: <a href="https://github.com/ilri/DSpace/pull/203">https://github.com/ilri/DSpace/pull/203</a></li> <li>I figured out that the problem with Listings and Reports was because I disabled the <code>search.index.*</code> last week, and they are still used by JSPUI apparently</li> <li>This pull request re-enables them: <a href="https://github.com/ilri/DSpace/pull/202">https://github.com/ilri/DSpace/pull/202</a></li> <li>Re-deploy DSpace Test, run all system updates, and restart the server</li> <li>Looks like the Listings and Reports fix was NOT due to the search indexes (which are actually not used), and rather due to the filter configuration in the Listings and Reports config</li> <li>This pull request simply updates the config for the dc.type.output → dc.type change that was made last week: <a href="https://github.com/ilri/DSpace/pull/204">https://github.com/ilri/DSpace/pull/204</a></li> <li>Deploy robots.txt fix, embargo for item submissions, and listings and reports fix on CGSpace</li> </ul> <h2 id="2016-03-29:5a28ddf3ee658c043c064ccddb151717">2016-03-29</h2> <ul> <li>Skype meeting with Peter and Addis team to discuss metadata changes for Dublin Core, CGcore, and CGSpace-specific fields</li> <li>We decided to proceed with some deletes first, then identify CGSpace-specific fields to clean/move to <code>cg.*</code>, and then worry about broader changes to DC</li> <li>Before we move or rename and fields we need to circulate a list of fields we intend to change to CCAFS, CWPF, etc who might be harvesting the fields</li> <li>After all of this we need to start implementing controlled vocabularies for fields, either with the Javascript lookup or like existing ILRI subjects</li> </ul> February, 2016 /cgspace-notes/2016-02/ Fri, 05 Feb 2016 13:18:00 +0300 /cgspace-notes/2016-02/ <h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2> <ul> <li>Looking at some DAGRIS data for Abenet Yabowork</li> <li>Lots of issues with spaces, newlines, etc causing the import to fail</li> <li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li> </ul> <p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p> <ul> <li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li> <li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li> </ul> <h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2> <ul> <li>Found a way to get items with null/empty metadata values from SQL</li> <li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li> </ul> <pre><code>dspacetest=# select * from metadatafieldregistry; </code></pre> <ul> <li>In this case our country field is 78</li> <li>Now find all resources with type 2 (item) that have null/empty values for that field:</li> </ul> <pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL); </code></pre> <ul> <li>Then you can find the handle that owns it from its <code>resource_id</code>:</li> </ul> <pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678'; </code></pre> <ul> <li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li> </ul> <pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value=''; DELETE 25 </code></pre> <ul> <li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li> <li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 &ldquo;|||&rdquo; countries are still there</li> <li>Maybe I need to do a full re-index&hellip;</li> <li>Yep! The full re-index seems to work.</li> <li>Process the empty countries on CGSpace</li> </ul> <h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2> <ul> <li>Working on cleaning up Abenet&rsquo;s DAGRIS data with OpenRefine</li> <li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape(&quot;javascript&quot;)</code> which shows whitespace characters like <code>\r\n</code>!</li> <li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li> <li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace(&quot;\.0&quot;, &quot;&quot;)</code></li> <li>I need to start running DSpace in Mac OS X instead of a Linux VM</li> <li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li> </ul> <pre><code>$ postgres -D /opt/brew/var/postgres $ createuser --superuser postgres $ createuser --pwprompt dspacetest $ createdb -O dspacetest --encoding=UNICODE dspacetest $ psql postgres postgres=# alter user dspacetest createuser; postgres=# \q $ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup $ psql postgres postgres=# alter user dspacetest nocreateuser; postgres=# \q $ vacuumdb dspacetest $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost </code></pre> <ul> <li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li> </ul> <pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig $ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT $ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest $ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui $ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai $ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start </code></pre> <ul> <li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li> <li>For example:</li> </ul> <pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot; </code></pre> <ul> <li>After verifying that the site is working, start a full index:</li> </ul> <pre><code>$ ~/dspace/bin/dspace index-discovery -b </code></pre> <h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2> <ul> <li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li> <li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme&rsquo;s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li> </ul> <p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" /> <img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p> <h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2> <ul> <li>Re-sync DSpace Test with CGSpace</li> <li>Help Sisay with OpenRefine</li> <li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li> </ul> <pre><code>$ cd ~/src/git $ git clone https://github.com/letsencrypt/letsencrypt $ cd letsencrypt $ sudo service nginx stop # add port 443 to firewall rules $ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org $ sudo service nginx start $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass </code></pre> <ul> <li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li> <li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs&hellip;</li> <li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li> <li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li> <li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li> </ul> <pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space </code></pre> <ul> <li>or</li> </ul> <pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object </code></pre> <ul> <li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li> </ul> <pre><code># free -m total used free shared buffers cached Mem: 3950 3902 48 9 37 1311 -/+ buffers/cache: 2552 1397 Swap: 255 57 198 </code></pre> <ul> <li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li> </ul> <h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2> <ul> <li>Massaging some CIAT data in OpenRefine</li> <li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li> <li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li> </ul> <pre><code>value.split('/')[-1] </code></pre> <ul> <li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li> </ul> <pre><code>$ ./generate-thumbnails.py ciat-reports.csv Processing 64661.pdf &gt; Downloading 64661.pdf &gt; Creating thumbnail for 64661.pdf Processing 64195.pdf &gt; Downloading 64195.pdf &gt; Creating thumbnail for 64195.pdf </code></pre> <h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2> <ul> <li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li> <li>A few items are using the same exact PDF</li> <li>A few items are using HTM or DOC files</li> <li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li> <li>A few items have no item</li> <li>Also, I&rsquo;m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li> </ul> <h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2> <ul> <li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li> <li>265 items have dirty, URL-encoded filenames:</li> </ul> <pre><code>$ ls | grep -c -E &quot;%&quot; 265 </code></pre> <ul> <li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li> <li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li> </ul> <pre><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf </code></pre> <ul> <li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li> <li>They will be deployed on CGSpace the next time I re-deploy</li> </ul> <h2 id="2016-02-16:124a59adbaa8ef13e1518d003fc03981">2016-02-16</h2> <ul> <li>Turns out OpenRefine has an unescape function!</li> </ul> <pre><code>value.unescape(&quot;url&quot;) </code></pre> <ul> <li>This turns the URLs into human-readable versions that we can use as proper filenames</li> <li>Run web server and system updates on DSpace Test and reboot</li> <li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn&rsquo;t have the brackets, like <code>dc.identifier.url2</code></li> <li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li> <li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li> <li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li> <li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li> </ul> <h2 id="2016-02-17:124a59adbaa8ef13e1518d003fc03981">2016-02-17</h2> <ul> <li>Re-deploy CGSpace, run all system updates, and reboot</li> <li>More work on CIAT data, cleaning and doing a last metadata-only import into DSpace Test</li> <li>SAFBuilder has a bug preventing it from processing filenames containing more than one underscore</li> <li>Need to re-process the filename column to replace multiple underscores with one: <code>value.replace(/_{2,}/, &quot;_&quot;)</code></li> </ul> <h2 id="2016-02-20:124a59adbaa8ef13e1518d003fc03981">2016-02-20</h2> <ul> <li>Turns out the &ldquo;bug&rdquo; in SAFBuilder isn&rsquo;t a bug, it&rsquo;s a feature that allows you to encode extra information like the destintion bundle in the filename</li> <li>Also, it seems DSpace&rsquo;s SAF import tool doesn&rsquo;t like importing filenames that have accents in them:</li> </ul> <pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory) </code></pre> <ul> <li>Need to rename files to have no accents or umlauts, etc&hellip;</li> <li>Useful custom text facet for URLs ending with &ldquo;.pdf&rdquo;: <code>value.endsWith(&quot;.pdf&quot;)</code></li> </ul> <h2 id="2016-02-22:124a59adbaa8ef13e1518d003fc03981">2016-02-22</h2> <ul> <li>To change Spanish accents to ASCII in OpenRefine:</li> </ul> <pre><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n') </code></pre> <ul> <li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li> <li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li> </ul> <pre><code>Bitstream: tést.pdf Bitstream: tést señora.pdf Bitstream: tést señora alimentación.pdf </code></pre> <ul> <li>Seems it could be something with the HFS+ filesystem actually, as it&rsquo;s not UTF-8 (<a href="http://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-probably-the-worst-file-system-ever.html">it&rsquo;s something like UCS-2</a>)</li> <li>HFS+ stores filenames as a string, and filenames with accents get stored as <a href="https://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/">character+accent</a> whereas Linux&rsquo;s ext4 stores them as an array of bytes</li> <li>Running the SAFBuilder on Mac OS X works if you&rsquo;re going to import the resulting bundle on Mac OS X, but if your DSpace is running on Linux you need to run the SAFBuilder there where the filesystem&rsquo;s encoding matches</li> </ul> <h2 id="2016-02-29:124a59adbaa8ef13e1518d003fc03981">2016-02-29</h2> <ul> <li>Got notified by some CIFOR colleagues that the Google Scholar team had contacted them about CGSpace&rsquo;s incorrect ordering of authors in Google Scholar metadata</li> <li>Turns out there is a patch, and it was merged in DSpace 5.4: <a href="https://jira.duraspace.org/browse/DS-2679">https://jira.duraspace.org/browse/DS-2679</a></li> <li>I&rsquo;ve merged it into our <code>5_x-prod</code> branch that is currently based on DSpace 5.1</li> <li>We found a bug when a user searches from the homepage, sorts the results, and then tries to click &ldquo;View More&rdquo; in a sidebar facet</li> <li>I am not sure what causes it yet, but I opened an issue for it: <a href="https://github.com/ilri/DSpace/issues/179">https://github.com/ilri/DSpace/issues/179</a></li> <li>Have more problems with SAFBuilder on Mac OS X</li> <li>Now it doesn&rsquo;t recognize description hints in the filename column, like: <code>test.pdf__description:Blah</code></li> <li>But on Linux it works fine</li> <li>Trying to test Atmire&rsquo;s series of stats and CUA fixes from January and February, but their branch history is really messy and it&rsquo;s hard to see what&rsquo;s going on</li> <li>Rebasing their branch on top of our production branch results in a broken Tomcat, so I&rsquo;m going to tell them to fix their history and make a proper pull request</li> <li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li> <li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li> </ul> <pre><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_') </code></pre> <ul> <li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li> <li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li> </ul> January, 2016 /cgspace-notes/2016-01/ Wed, 13 Jan 2016 13:18:00 +0300 /cgspace-notes/2016-01/ <h2 id="2016-01-13:3846b7fcbca60cdedafd373cb39cd76d">2016-01-13</h2> <ul> <li>Move ILRI collection <code>10568/12503</code> from <code>10568/27869</code> to <code>10568/27629</code> using the <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">move_collections.sh</a> script I wrote last year.</li> <li>I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.</li> <li>Update GitHub wiki for documentation of <a href="https://github.com/ilri/DSpace/wiki/Maintenance-Tasks">maintenance tasks</a>.</li> </ul> <h2 id="2016-01-14:3846b7fcbca60cdedafd373cb39cd76d">2016-01-14</h2> <ul> <li>Update CCAFS project identifiers in input-forms.xml</li> <li>Run system updates and restart the server</li> </ul> <h2 id="2016-01-18:3846b7fcbca60cdedafd373cb39cd76d">2016-01-18</h2> <ul> <li>Change &ldquo;Extension material&rdquo; to &ldquo;Extension Material&rdquo; in input-forms.xml (a mistake that fell through the cracks when we fixed the others in DSpace 4 era)</li> </ul> <h2 id="2016-01-19:3846b7fcbca60cdedafd373cb39cd76d">2016-01-19</h2> <ul> <li>Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme&rsquo;s primary color (<a href="https://github.com/ilri/DSpace/issues/157">#157</a>)</li> <li>Tweak date-based facets to show more values in drill-down ranges (<a href="https://github.com/ilri/DSpace/issues/162">#162</a>)</li> <li>Need to remember to clear the Cocoon cache after deployment or else you don&rsquo;t see the new ranges immediately</li> <li>Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account</li> <li>Altmetrics&rsquo; support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li> </ul> <h2 id="2016-01-21:3846b7fcbca60cdedafd373cb39cd76d">2016-01-21</h2> <ul> <li>Still waiting for my IFTTT recipe to fire, two days later</li> <li>It looks like the Atom feed on CGSpace hasn&rsquo;t changed in two days, but there have definitely been new items</li> <li>The RSS feed is nearly as old, but has different old items there</li> <li>On a hunch I cleared the Cocoon cache and now the feeds are fresh</li> <li>Looks like there is configuration option related to this, <code>webui.feed.cache.age</code>, which defaults to 48 hours, though I&rsquo;m not sure what relation it has to the Cocoon cache</li> <li>In any case, we should change this cache to be something more like 6 hours, as we publish new items several times per day.</li> <li>Work around a CSS issue with long URLs in the item view (<a href="https://github.com/ilri/DSpace/issues/172">#172</a>)</li> </ul> <h2 id="2016-01-25:3846b7fcbca60cdedafd373cb39cd76d">2016-01-25</h2> <ul> <li>Re-deploy CGSpace and DSpace Test with latest <code>5_x-prod</code> branch</li> <li>This included the social icon fixes/updates, date-based facet tweaks, reducing the feed cache age, and fixing a layout issue in XMLUI item view when an item had long URLs</li> </ul> <h2 id="2016-01-26:3846b7fcbca60cdedafd373cb39cd76d">2016-01-26</h2> <ul> <li>Run nginx updates on CGSpace and DSpace Test (<a href="http://mailman.nginx.org/pipermail/nginx/2016-January/049700.html">1.8.1 and 1.9.10, respectively</a>)</li> <li>Run updates on DSpace Test and reboot for new Linode kernel <code>Linux 4.4.0-x86_64-linode63</code> (first update in months)</li> </ul> <h2 id="2016-01-28:3846b7fcbca60cdedafd373cb39cd76d">2016-01-28</h2> <ul> <li>Start looking at importing some Bioversity data that had been prepared earlier this week</li> <li><p>While checking the data I noticed something strange, there are 79 items but only 8 unique PDFs:</p> <p>$ ls SimpleArchiveForBio/ | wc -l 79 $ find SimpleArchiveForBio/ -iname &ldquo;*.pdf&rdquo; -exec basename {} \; | sort -u | wc -l 8</p></li> </ul> <h2 id="2016-01-29:3846b7fcbca60cdedafd373cb39cd76d">2016-01-29</h2> <ul> <li>Add five missing center-specific subjects to XMLUI item view (<a href="https://github.com/ilri/DSpace/issues/174">#174</a>)</li> <li>This <a href="https://cgspace.cgiar.org/handle/10568/67062">CCAFS item</a> Before:</li> </ul> <p><img src="../images/2016/01/xmlui-subjects-before.png" alt="XMLUI subjects before" /></p> <ul> <li>After:</li> </ul> <p><img src="../images/2016/01/xmlui-subjects-after.png" alt="XMLUI subjects after" /></p> December, 2015 /cgspace-notes/2015-12/ Wed, 02 Dec 2015 13:18:00 +0300 /cgspace-notes/2015-12/ <h2 id="2015-12-02:012a628feed6d64ae1151cbd6151ccd6">2015-12-02</h2> <ul> <li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li> </ul> <pre><code># cd /home/dspacetest.cgiar.org/log # ls -lh dspace.log.2015-11-18* -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18 -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz </code></pre> <ul> <li>I had used lrzip once, but it needs more memory and is harder to use as it requires the lrztar wrapper</li> <li>Need to remember to go check if everything is ok in a few days and then change CGSpace</li> <li>CGSpace went down again (due to PostgreSQL idle connections of course)</li> <li>Current database settings for DSpace are <code>db.maxconnections = 30</code> and <code>db.maxidle = 8</code>, yet idle connections are exceeding this:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 39 </code></pre> <ul> <li>I restarted PostgreSQL and Tomcat and it&rsquo;s back</li> <li>On a related note of why CGSpace is so slow, I decided to finally try the <code>pgtune</code> script to tune the postgres settings:</li> </ul> <pre><code># apt-get install pgtune # pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune # mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig # mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf </code></pre> <ul> <li>It introduced the following new settings:</li> </ul> <pre><code>default_statistics_target = 50 maintenance_work_mem = 480MB constraint_exclusion = on checkpoint_completion_target = 0.9 effective_cache_size = 5632MB work_mem = 48MB wal_buffers = 8MB checkpoint_segments = 16 shared_buffers = 1920MB max_connections = 80 </code></pre> <ul> <li>Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc</li> <li>For what it&rsquo;s worth, now the REST API should be faster (because of these PostgreSQL tweaks):</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.474 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 2.141 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.685 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.995 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.786 </code></pre> <ul> <li>Last week it was an average of 8 seconds&hellip; now this is <sup>1</sup>&frasl;<sub>4</sub> of that</li> <li>CCAFS noticed that one of their items displays only the Atmire statlets: <a href="https://cgspace.cgiar.org/handle/10568/42445">https://cgspace.cgiar.org/handle/10568/42445</a></li> </ul> <p><img src="../images/2015/12/ccafs-item-no-metadata.png" alt="CCAFS item" /></p> <ul> <li>The authorizations for the item are all public READ, and I don&rsquo;t see any errors in dspace.log when browsing that item</li> <li>I filed a ticket on Atmire&rsquo;s issue tracker</li> <li>I also filed a ticket on Atmire&rsquo;s issue tracker for the PostgreSQL stuff</li> </ul> <h2 id="2015-12-03:012a628feed6d64ae1151cbd6151ccd6">2015-12-03</h2> <ul> <li>CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)</li> <li>Idle postgres connections look like this (with no change in DSpace db settings lately):</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 29 </code></pre> <ul> <li>I restarted Tomcat and postgres&hellip;</li> <li>Atmire commented that we should raise the JVM heap size by ~500M, so it is now <code>-Xms3584m -Xmx3584m</code></li> <li>We weren&rsquo;t out of heap yet, but it&rsquo;s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it&rsquo;s ok</li> <li>A possible side effect is that I see that the REST API is twice as fast for the request above now:</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.368 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.968 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 1.006 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.849 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.806 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.854 </code></pre> <h2 id="2015-12-05:012a628feed6d64ae1151cbd6151ccd6">2015-12-05</h2> <ul> <li>CGSpace has been up and down all day and REST API is completely unresponsive</li> <li>PostgreSQL idle connections are currently:</li> </ul> <pre><code>postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 28 </code></pre> <ul> <li>I have reverted all the pgtune tweaks from the other day, as they didn&rsquo;t fix the stability issues, so I&rsquo;d rather not have them introducing more variables into the equation</li> <li>The PostgreSQL stats from Munin all point to something database-related with the DSpace 5 upgrade around mid–late November</li> </ul> <p><img src="../images/2015/12/postgres_bgwriter-year.png" alt="PostgreSQL bgwriter (year)" /> <img src="../images/2015/12/postgres_cache_cgspace-year.png" alt="PostgreSQL cache (year)" /> <img src="../images/2015/12/postgres_locks_cgspace-year.png" alt="PostgreSQL locks (year)" /> <img src="../images/2015/12/postgres_scans_cgspace-year.png" alt="PostgreSQL scans (year)" /></p> <h2 id="2015-12-07:012a628feed6d64ae1151cbd6151ccd6">2015-12-07</h2> <ul> <li>Atmire sent <a href="https://github.com/ilri/DSpace/pull/161">some fixes</a> to DSpace&rsquo;s REST API code that was leaving contexts open (causing the slow performance and database issues)</li> <li>After deploying the fix to CGSpace the REST API is consistently faster:</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.675 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.599 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.588 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.566 $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 0.497 </code></pre> <h2 id="2015-12-08:012a628feed6d64ae1151cbd6151ccd6">2015-12-08</h2> <ul> <li>Switch CGSpace log compression cron jobs from using lzop to xz—the compression isn&rsquo;t as good, but it&rsquo;s much faster and causes less IO/CPU load</li> <li>Since we figured out (and fixed) the cause of the performance issue, I reverted Google Bot&rsquo;s crawl rate to the &ldquo;Let Google optimize&rdquo; setting</li> </ul> November, 2015 /cgspace-notes/2015-11/ Mon, 23 Nov 2015 17:00:57 +0300 /cgspace-notes/2015-11/ <h2 id="2015-11-22:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-22</h2> <ul> <li>CGSpace went down</li> <li>Looks like DSpace exhausted its PostgreSQL connection pool</li> <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 </code></pre> <ul> <li>For now I have increased the limit from 60 to 90, run updates, and rebooted the server</li> </ul> <h2 id="2015-11-24:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-24</h2> <ul> <li>CGSpace went down again</li> <li>Getting emails from uptimeRobot and uptimeButler that it&rsquo;s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors</li> <li>Looks like there are still a bunch of idle PostgreSQL connections:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 96 </code></pre> <ul> <li>For some reason the number of idle connections is very high since we upgraded to DSpace 5</li> </ul> <h2 id="2015-11-25:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-25</h2> <ul> <li>Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config</li> <li>The OAI application requests stylesheets and javascript files with the path <code>/oai/static/css</code>, which gets matched here:</li> </ul> <pre><code># static assets we can load from the file system directly with nginx location ~ /(themes|static|aspects/ReportingSuite) { try_files $uri @tomcat; ... </code></pre> <ul> <li>The document root is relative to the xmlui app, so this gets a 404—I&rsquo;m not sure why it doesn&rsquo;t pass to <code>@tomcat</code></li> <li>Anyways, I can&rsquo;t find any URIs with path <code>/static</code>, and the more important point is to handle all the static theme assets, so we can just remove <code>static</code> from the regex for now (who cares if we can&rsquo;t use nginx to send Etags for OAI CSS!)</li> <li>Also, I noticed we aren&rsquo;t setting CSP headers on the static assets, because in nginx headers are inherited in child blocks, but if you use <code>add_header</code> in a child block it doesn&rsquo;t inherit the others</li> <li>We simply need to add <code>include extra-security.conf;</code> to the above location block (but research and test first)</li> <li>We should add WOFF assets to the list of things to set expires for:</li> </ul> <pre><code>location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ { </code></pre> <ul> <li>We should also add <code>aspects/Statistics</code> to the location block for static assets (minus <code>static</code> from above):</li> </ul> <pre><code>location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) { </code></pre> <ul> <li>Need to check <code>/about</code> on CGSpace, as it&rsquo;s blank on my local test server and we might need to add something there</li> <li>CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 93 </code></pre> <ul> <li>I looked closer at the idle connections and saw that many have been idle for hours (current time on server is <code>2015-11-25T20:20:42+0000</code>):</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | less -S datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | -------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+--- 20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20 20951 | cgspace | 10967 | 18205 | cgspace | | 127.0.0.1 | | 37737 | 2015-11-25 13:13:03.069421+00 | | 20 ... </code></pre> <ul> <li>There is a relevant Jira issue about this: <a href="https://jira.duraspace.org/browse/DS-1458">https://jira.duraspace.org/browse/DS-1458</a></li> <li>It seems there is some sense changing DSpace&rsquo;s default <code>db.maxidle</code> from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)</li> <li>Change <code>db.maxidle</code> from -1 to 10, reduce <code>db.maxconnections</code> from 90 to 50, and restart postgres and tomcat7</li> <li>Also redeploy DSpace Test with a clean sync of CGSpace and mirror these database settings there as well</li> <li>Also deploy the nginx fixes for the <code>try_files</code> location block as well as the expires block</li> </ul> <h2 id="2015-11-26:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-26</h2> <ul> <li>CGSpace behaving much better since changing <code>db.maxidle</code> yesterday, but still two up/down notices from monitoring this morning (better than 50!)</li> <li>CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item</li> <li>Not as bad for me, but still unsustainable if you have to get many:</li> </ul> <pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all 8.415 </code></pre> <ul> <li>Monitoring e-mailed in the evening to say CGSpace was down</li> <li>Idle connections in PostgreSQL again:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 66 </code></pre> <ul> <li>At the time, the current DSpace pool size was 50&hellip;</li> <li>I reduced the pool back to the default of 30, and reduced the <code>db.maxidle</code> settings from 10 to 8</li> </ul> <h2 id="2015-11-29:3d03b850f8126f80d8144c2e17ea0ae7">2015-11-29</h2> <ul> <li>Still more alerts that CGSpace has been up and down all day</li> <li>Current database settings for DSpace:</li> </ul> <pre><code>db.maxconnections = 30 db.maxwait = 5000 db.maxidle = 8 db.statementpool = true </code></pre> <ul> <li>And idle connections:</li> </ul> <pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle 49 </code></pre> <ul> <li>Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace&rsquo;s thirst can ever be quenched</li> <li>On another note, SUNScholar&rsquo;s notes suggest adjusting some other postgres variables: <a href="http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database">http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database</a></li> <li>This might help with REST API speed (which I mentioned above and still need to do real tests)</li> </ul>