Posts on CGSpace Notes https://alanorth.github.io/cgspace-notes/post/index.xml Recent content in Posts on CGSpace Notes Hugo -- gohugo.io en-us Wed, 01 Mar 2017 17:08:52 +0200 March, 2017 https://alanorth.github.io/cgspace-notes/2017-03/ Wed, 01 Mar 2017 17:08:52 +0200 https://alanorth.github.io/cgspace-notes/2017-03/ <h2 id="2017-03-01">2017-03-01</h2> <ul> <li>Run the 279 CIAT author corrections on CGSpace</li> </ul> <h2 id="2017-03-02">2017-03-02</h2> <ul> <li>Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace</li> <li>CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles</li> <li>They might come in at the top level in one &ldquo;CGIAR System&rdquo; community, or with several communities</li> <li>I need to spend a bit of time looking at the multiple handle support in DSpace and see if new content can be minted in both handles, or just one?</li> <li>Need to send Peter and Michael some notes about this in a few days</li> <li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li> <li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li> <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li> <li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999"><sup>10568</sup>&frasl;<sub>51999</sub></a>):</li> </ul> <pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 </code></pre> <p></p> <ul> <li>This results in discolored thumbnails when compared to the original PDF, for example sRGB and CMYK:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/03/thumbnail-srgb.jpg" alt="Thumbnail in sRGB colorspace" /></p> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/03/thumbnail-cmyk.jpg" alt="Thumbnial in CMYK colorspace" /></p> <ul> <li>I filed an issue for the color space thing: <a href="https://jira.duraspace.org/browse/DS-3517">DS-3517</a></li> </ul> <h2 id="2017-03-03">2017-03-03</h2> <ul> <li>I created a patch for DS-3517 and made a pull request against upstream <code>dspace-5_x</code>: <a href="https://github.com/DSpace/DSpace/pull/1669">https://github.com/DSpace/DSpace/pull/1669</a></li> <li>Looks like <code>-colorspace sRGB</code> alone isn&rsquo;t enough, we need to use profiles:</li> </ul> <pre><code>$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg </code></pre> <ul> <li>This reads the input file, applies the CMYK profile, applies the RGB profile, then writes the file</li> <li>Note that you should set the first profile immediately after the input file</li> <li>Also, it is better to use profiles than setting <code>-colorspace</code></li> <li>This is a great resource describing the color stuff: <a href="http://www.imagemagick.org/Usage/formats/#profiles">http://www.imagemagick.org/Usage/formats/#profiles</a></li> <li>Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)</li> <li>This is trivial with <code>identify</code> (even by the <a href="http://im4java.sourceforge.net/api/org/im4java/core/IMOps.html#identify">Java ImageMagick API</a>):</li> </ul> <pre><code>$ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\] DirectClass CMYK $ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\] DirectClass sRGB Alpha </code></pre> <h2 id="2017-03-04">2017-03-04</h2> <ul> <li>Spent more time looking at the ImageMagick CMYK issue</li> <li>The <code>default_cmyk.icc</code> and <code>default_rgb.icc</code> files are both part of the Ghostscript GPL distribution, but according to DSpace&rsquo;s <code>LICENSES_THIRD_PARTY</code> file, DSpace doesn&rsquo;t allow distribution of dependencies that are licensed solely under the GPL</li> <li>So this issue is kinda pointless now, as the ICC profiles are absolutely necessary to make a meaningful CMYK→sRGB conversion</li> </ul> <h2 id="2017-03-05">2017-03-05</h2> <ul> <li>Look into helping developers from landportal.info with a query for items related to LAND on the REST API</li> <li>They want something like the items that are returned by the general &ldquo;LAND&rdquo; query in the search interface, but we cannot do that</li> <li>We can only return specific results for metadata fields, like:</li> </ul> <pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;LAND REFORM&quot;, &quot;language&quot;: null}' | json_pp </code></pre> <ul> <li>But there are hundreds of combinations of fields and values (like <code>dc.subject</code> and all the center subjects), and we can&rsquo;t use wildcards in REST!</li> <li>Reading about enabling multiple handle prefixes in DSpace</li> <li>There is a mailing list thread from 2011 about it: <a href="http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html">http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html</a></li> <li>And a comment from Atmire&rsquo;s Bram about it on the DSpace wiki: <a href="https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296">https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296</a></li> <li>Bram mentions an undocumented configuration option <code>handle.plugin.checknameauthority</code>, but I noticed another one in <code>dspace.cfg</code>:</li> </ul> <pre><code># List any additional prefixes that need to be managed by this handle server # (as for examle handle prefix coming from old dspace repository merged in # that repository) # handle.additional.prefixes = prefix1[, prefix2] </code></pre> <ul> <li>Because of this I noticed that our Handle server&rsquo;s <code>config.dct</code> was potentially misconfigured!</li> <li>We had some default values still present:</li> </ul> <pre><code>&quot;300:0.NA/YOUR_NAMING_AUTHORITY&quot; </code></pre> <ul> <li>I&rsquo;ve changed them to the following and restarted the handle server:</li> </ul> <pre><code>&quot;300:0.NA/10568&quot; </code></pre> <ul> <li>In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk</li> <li>From <code>dspace/config/crosswalks/google-metadata.properties</code>:</li> </ul> <pre><code>google.citation_doi = cg.identifier.doi </code></pre> <ul> <li>This works, and makes DSpace output the following metadata on the item view page:</li> </ul> <pre><code>&lt;meta content=&quot;https://dx.doi.org/10.1186/s13059-017-1153-y&quot; name=&quot;citation_doi&quot;&gt; </code></pre> <ul> <li>Submitted and merged pull request for this: <a href="https://github.com/ilri/DSpace/pull/305">https://github.com/ilri/DSpace/pull/305</a></li> <li>Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of &ldquo;,&rdquo;: <a href="https://github.com/ilri/DSpace/pull/306">https://github.com/ilri/DSpace/pull/306</a></li> <li>I want to show it briefly to Abenet and Peter to get feedback</li> </ul> <h2 id="2017-03-06">2017-03-06</h2> <ul> <li>Someone on the mailing list said that <code>handle.plugin.checknameauthority</code> should be false if we&rsquo;re using multiple handle prefixes</li> </ul> <h2 id="2017-03-07">2017-03-07</h2> <ul> <li>I set up a top-level community as a test for the CGIAR Library and imported one item with the the 10947 handle prefix</li> <li>When testing the Handle resolver locally it shows the item to be on the local repository</li> <li>So this seems to work, with the following caveats: <ul> <li>New items will have the default handle</li> <li>Communities and collections will have the default handle</li> <li>Only items imported manually can have the other handles</li> </ul></li> <li>I need to talk to Michael and Peter to share the news, and discuss the structure of their community(s) and try some actual test data</li> <li>We&rsquo;ll need to do some data cleaning to make sure they are using the same fields we are, like <code>dc.type</code> and <code>cg.identifier.status</code></li> <li>Another thing is that the import process creates new <code>dc.date.accessioned</code> and <code>dc.date.available</code> fields, so we end up with duplicates (is it important to preserve the originals for these?)</li> <li>Report DS-3520 issue to Atmire</li> </ul> <h2 id="2017-03-08">2017-03-08</h2> <ul> <li>Merge the author separator changes to <code>5_x-prod</code>, as everyone has responded positively about it, and it&rsquo;s the default in Mirage2 afterall!</li> <li>Cherry pick the <code>commons-collections</code> patch from DSpace&rsquo;s <code>dspace-5_x</code> branch to address DS-3520: <a href="https://jira.duraspace.org/browse/DS-3520">https://jira.duraspace.org/browse/DS-3520</a></li> </ul> <h2 id="2017-03-09">2017-03-09</h2> <ul> <li>Export list of sponsors so Peter can clean it up:</li> </ul> <pre><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv; COPY 285 </code></pre> <h2 id="2017-03-12">2017-03-12</h2> <ul> <li>Test the sponsorship fixes and deletes from Peter:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu </code></pre> <ul> <li>Generate a new list of unique sponsors so we can update the controlled vocabulary:</li> </ul> <pre><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv; </code></pre> <ul> <li>Pull request for controlled vocabulary if Peter approves: <a href="https://github.com/ilri/DSpace/pull/308">https://github.com/ilri/DSpace/pull/308</a></li> <li>Review Sisay&rsquo;s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: <a href="https://github.com/ilri/DSpace/pull/307">https://github.com/ilri/DSpace/pull/307</a></li> <li>Created an issue to track the progress on the Livestock CRP theme: <a href="https://github.com/ilri/DSpace/issues/309">https://github.com/ilri/DSpace/issues/309</a></li> <li>Created a basic theme for the Livestock CRP community</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/03/livestock-theme.png" alt="Livestock CRP theme" /></p> <h2 id="2017-03-15">2017-03-15</h2> <ul> <li>Merge pull request for controlled vocabulary updates for sponsor: <a href="https://github.com/ilri/DSpace/pull/308">https://github.com/ilri/DSpace/pull/308</a></li> <li>Merge pull request for Livestock CRP theme: <a href="https://github.com/ilri/DSpace/issues/309">https://github.com/ilri/DSpace/issues/309</a></li> <li>Create pull request for PABRA subjects (waiting for confirmation from Abenet before merging): <a href="https://github.com/ilri/DSpace/pull/310">https://github.com/ilri/DSpace/pull/310</a></li> <li>Create pull request for CCAFS Phase II migrations (waiting for confirmation from CCAFS people): <a href="https://github.com/ilri/DSpace/pull/311">https://github.com/ilri/DSpace/pull/311</a></li> <li>I also need to ask if either of these new fields need to be added to Discovery facets, search, and Atmire modules</li> <li>Run all system updates on DSpace Test and re-deploy CGSpace</li> </ul> <h2 id="2017-03-16">2017-03-16</h2> <ul> <li>Merge pull request for PABRA subjects: <a href="https://github.com/ilri/DSpace/pull/310">https://github.com/ilri/DSpace/pull/310</a></li> <li>Abenet and Peter say we can add them to Discovery, Atmire modules, etc, but I might not have time to do it now</li> <li>Help Sisay with RTB theme again</li> <li>Remove ICARDA subject from Discovery sidebar facets: <a href="https://github.com/ilri/DSpace/pull/312">https://github.com/ilri/DSpace/pull/312</a></li> <li>Remove ICARDA subject from Browse and item submission form: <a href="https://github.com/ilri/DSpace/pull/313">https://github.com/ilri/DSpace/pull/313</a></li> <li>Merge the CCAFS Phase II changes but hold off on doing the flagship metadata updates until Macaroni Bros gets their importer updated</li> <li>Deploy latest changes and investor fixes/deletions on CGSpace</li> <li>Run system updates on CGSpace and reboot server</li> </ul> <h2 id="2017-03-20">2017-03-20</h2> <ul> <li>Create basic XMLUI theme for PABRA community: <a href="https://github.com/ilri/DSpace/pull/315">https://github.com/ilri/DSpace/pull/315</a></li> </ul> February, 2017 https://alanorth.github.io/cgspace-notes/2017-02/ Tue, 07 Feb 2017 07:04:52 -0800 https://alanorth.github.io/cgspace-notes/2017-02/ <h2 id="2017-02-07">2017-02-07</h2> <ul> <li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li> </ul> <pre><code>dspace=# select * from collection2item where item_id = '80278'; id | collection_id | item_id -------+---------------+--------- 92551 | 313 | 80278 92550 | 313 | 80278 90774 | 1051 | 80278 (3 rows) dspace=# delete from collection2item where id = 92551 and item_id = 80278; DELETE 1 </code></pre> <ul> <li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li> <li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li> </ul> <p></p> <h2 id="2017-02-08">2017-02-08</h2> <ul> <li>We also need to rename some of the CCAFS Phase I flagships: <ul> <li>CLIMATE-SMART AGRICULTURAL PRACTICES → CLIMATE-SMART TECHNOLOGIES AND PRACTICES</li> <li>CLIMATE RISK MANAGEMENT → CLIMATE SERVICES AND SAFETY NETS</li> <li>LOW EMISSIONS AGRICULTURE → LOW EMISSIONS DEVELOPMENT</li> <li>POLICIES AND INSTITUTIONS → PRIORITIES AND POLICIES FOR CSA</li> </ul></li> <li>The climate risk management one doesn&rsquo;t exist, so I will have to ask Magdalena if they want me to add it to the input forms</li> <li>Start testing some nearly 500 author corrections that CCAFS sent me:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu </code></pre> <h2 id="2017-02-09">2017-02-09</h2> <ul> <li>More work on CCAFS Phase II stuff</li> <li>Looks like simply adding a new metadata field to <code>dspace/config/registries/cgiar-types.xml</code> and restarting DSpace causes the field to get added to the rregistry</li> <li>It requires a restart but at least it allows you to manage the registry programmatically</li> <li>It&rsquo;s not a very good way to manage the registry, though, as removing one there doesn&rsquo;t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created</li> <li>Testing some corrections on CCAFS Phase II flagships (<code>cg.subject.ccafs</code>):</li> </ul> <pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu </code></pre> <h2 id="2017-02-10">2017-02-10</h2> <ul> <li>CCAFS said they want to wait on the flagship updates (<code>cg.subject.ccafs</code>) on CGSpace, perhaps for a month or so</li> <li>Help Marianne Gadeberg (WLE) with some user permissions as it seems she had previously been using a personal email account, and is now on a CGIAR one</li> <li>I manually added her new account to ~25 authorizations that her hold user was on</li> </ul> <h2 id="2017-02-14">2017-02-14</h2> <ul> <li>Add <code>SCALING</code> to ILRI subjects (<a href="https://github.com/ilri/DSpace/pull/304">#304</a>), as Sisay&rsquo;s attempts were all sloppy</li> <li>Cherry pick some patches from the DSpace 5.7 branch: <ul> <li>DS-3363 CSV import error says &ldquo;row&rdquo;, means &ldquo;column&rdquo;: f7b6c83e991db099003ee4e28ca33d3c7bab48c0</li> <li>DS-3479 avoid adding empty metadata values during import: 329f3b48a6de7fad074d825fd12118f7e181e151</li> <li>[DS-3456] 5x Clarify command line options for statisics import/export tools (#1623): 567ec083c8a94eb2bcc1189816eb4f767745b278</li> <li>[DS-3458]5x Allow Shard Process to Append to an existing repo: 3c8ecb5d1fd69a1dcfee01feed259e80abbb7749</li> </ul></li> <li>I still need to test these, especially as the last two which change some stuff with Solr maintenance</li> </ul> <h2 id="2017-02-15">2017-02-15</h2> <ul> <li>Update rvm on DSpace Test and CGSpace as there was a <a href="https://github.com/justinsteven/advisories/blob/master/2017_rvm_cd_command_execution.md">security disclosure about versions less than 1.28.0</a></li> </ul> <h2 id="2017-02-16">2017-02-16</h2> <ul> <li>Looking at memory info from munin on CGSpace:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/02/meminfo_phisical-week.png" alt="CGSpace meminfo" /></p> <ul> <li>We are using only ~8GB of RAM for applications, and 16GB for caches!</li> <li>The Linode machine we&rsquo;re on has 24GB of RAM but only because that&rsquo;s the only instance that had enough disk space for us (384GB)&hellip;</li> <li>We should probably look into Google Compute Engine or Digital Ocean where we can get more storage without having to follow a linear increase in instance pricing for CPU/memory as well</li> <li>Especially because we only use 2 out of 8 CPUs basically:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/02/cpu-week.png" alt="CGSpace CPU" /></p> <ul> <li>Fix issue with duplicate declaration of in atmire-dspace-xmlui <code>pom.xml</code> (causing non-fatal warnings during the maven build)</li> <li>Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site&rsquo;s properties file:</li> </ul> <pre><code>handle.canonical.prefix = https://hdl.handle.net/ </code></pre> <ul> <li>And then a SQL command to update existing records:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri'); UPDATE 58193 </code></pre> <ul> <li>Seems to work fine!</li> <li>I noticed a few items that have incorrect DOI links (<code>dc.identifier.doi</code>), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:</li> </ul> <pre><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%'; </code></pre> <ul> <li>This will replace any that begin with <code>10.</code> and change them to <code>https://dx.doi.org/10.</code>:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%'; </code></pre> <ul> <li>This will get any that begin with <code>doi:10.</code> and change them to <code>https://dx.doi.org/10.x</code>:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%'; </code></pre> <ul> <li>Fix DOIs like <code>dx.doi.org/10.</code> to be <code>https://dx.doi.org/10.</code>:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%'; </code></pre> <ul> <li>Fix DOIs like <code>http//</code>:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%'; </code></pre> <ul> <li>Fix DOIs like <code>dx.doi.org./</code>:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%' </code></pre> <ul> <li>Delete some invalid DOIs:</li> </ul> <pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb'); </code></pre> <ul> <li>Fix some other random outliers:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003'; dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200'; dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898'; dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012'; dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570'; dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052'; </code></pre> <ul> <li>And do another round of <code>http://</code> → <code>https://</code> cleanups:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%'; </code></pre> <ul> <li>Run all DOI corrections on CGSpace</li> <li>Something to think about here is to write a <a href="https://wiki.duraspace.org/display/DSDOC5x/Curation+System#CurationSystem-ScriptedTasks">Curation Task</a> in Java to do these sanity checks / corrections every night</li> <li>Then we could add a cron job for them and run them from the command line like:</li> </ul> <pre><code>[dspace]/bin/dspace curate -t noop -i 10568/79891 </code></pre> <h2 id="2017-02-20">2017-02-20</h2> <ul> <li>Run all system updates on DSpace Test and reboot the server</li> <li>Run CCAFS author corrections on DSpace Test and CGSpace and force a full discovery reindex</li> <li>Fix label of CCAFS subjects in Atmire Listings and Reports module</li> <li>Help Sisay with SQL commands</li> <li>Help Paola from CCAFS with the Atmire Listings and Reports module</li> <li>Testing the <code>fix-metadata-values.py</code> script on macOS and it seems like we don&rsquo;t need to use <code>.encode('utf-8')</code> anymore when printing strings to the screen</li> <li>It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string &ldquo;Entwicklung &amp; Ländlicher Raum&rdquo; without the <code>encode()</code> call, but print it as a bytes when it <em>is</em> used:</li> </ul> <pre><code>$ python Python 3.6.0 (default, Dec 25 2016, 17:30:53) &gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum') Entwicklung &amp; Ländlicher Raum &gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum'.encode()) b'Entwicklung &amp; L\xc3\xa4ndlicher Raum' </code></pre> <ul> <li>So for now I will remove the encode call from the script (though it was never used on the versions on the Linux hosts), leading me to believe it really <em>was</em> a temporary problem, perhaps due to macOS or the Python build I was using.</li> </ul> <h2 id="2017-02-21">2017-02-21</h2> <ul> <li>Testing regenerating PDF thumbnails, like I started in 2016-11</li> <li>It seems there is a bug in <code>filter-media</code> that causes it to process formats that aren&rsquo;t part of its configuration:</li> </ul> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &quot;ImageMagick PDF Thumbnail&quot; File: earlywinproposal_esa_postharvest.pdf.jpg FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg' File: postHarvest.jpg.jpg FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg' </code></pre> <ul> <li>According to <code>dspace.cfg</code> the ImageMagick PDF Thumbnail plugin should only process PDFs:</li> </ul> <pre><code>filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000 filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF </code></pre> <ul> <li>I&rsquo;ve sent a message to the mailing list and might file a Jira issue</li> <li>Ask Atmire about the failed interpolation of the <code>dspace.internalUrl</code> variable in <code>atmire-cua.cfg</code></li> </ul> <h2 id="2017-02-22">2017-02-22</h2> <ul> <li>Atmire said I can add <code>dspace.internalUrl</code> to my build properties and the error will go away</li> <li>It should be the local URL for accessing Tomcat from the server&rsquo;s own perspective, ie: <a href="http://localhost:8080">http://localhost:8080</a></li> </ul> <h2 id="2017-02-26">2017-02-26</h2> <ul> <li>Find all fields with &ldquo;<a href="http://hdl.handle.net&quot;">http://hdl.handle.net&quot;</a> values (most are in <code>dc.identifier.uri</code>, but some are in other URL-related fields like <code>cg.link.reference</code>, <code>cg.identifier.dataurl</code>, and <code>cg.identifier.url</code>):</li> </ul> <pre><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%'; UPDATE 58633 </code></pre> <ul> <li>This works but I&rsquo;m thinking I&rsquo;ll wait on the replacement as there are perhaps some other places that rely on <code>http://hdl.handle.net</code> (grep the code, it&rsquo;s scary how many things are hard coded)</li> <li>Send message to dspace-tech mailing list with concerns about this</li> </ul> <h2 id="2017-02-27">2017-02-27</h2> <ul> <li>LDAP users cannot log in today, looks to be an issue with CGIAR&rsquo;s LDAP server:</li> </ul> <pre><code>$ openssl s_client -connect svcgroot2.cgiarad.org:3269 CONNECTED(00000003) depth=0 CN = SVCGROOT2.CGIARAD.ORG verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 CN = SVCGROOT2.CGIARAD.ORG verify error:num=21:unable to verify the first certificate verify return:1 --- Certificate chain 0 s:/CN=SVCGROOT2.CGIARAD.ORG i:/CN=CGIARAD-RDWA-CA --- </code></pre> <ul> <li>For some reason it is now signed by a private certificate authority</li> <li>This error seems to have started on 2017-02-25:</li> </ul> <pre><code>$ grep -c &quot;unable to find valid certification path&quot; [dspace]/log/dspace.log.2017-02-* [dspace]/log/dspace.log.2017-02-01:0 [dspace]/log/dspace.log.2017-02-02:0 [dspace]/log/dspace.log.2017-02-03:0 [dspace]/log/dspace.log.2017-02-04:0 [dspace]/log/dspace.log.2017-02-05:0 [dspace]/log/dspace.log.2017-02-06:0 [dspace]/log/dspace.log.2017-02-07:0 [dspace]/log/dspace.log.2017-02-08:0 [dspace]/log/dspace.log.2017-02-09:0 [dspace]/log/dspace.log.2017-02-10:0 [dspace]/log/dspace.log.2017-02-11:0 [dspace]/log/dspace.log.2017-02-12:0 [dspace]/log/dspace.log.2017-02-13:0 [dspace]/log/dspace.log.2017-02-14:0 [dspace]/log/dspace.log.2017-02-15:0 [dspace]/log/dspace.log.2017-02-16:0 [dspace]/log/dspace.log.2017-02-17:0 [dspace]/log/dspace.log.2017-02-18:0 [dspace]/log/dspace.log.2017-02-19:0 [dspace]/log/dspace.log.2017-02-20:0 [dspace]/log/dspace.log.2017-02-21:0 [dspace]/log/dspace.log.2017-02-22:0 [dspace]/log/dspace.log.2017-02-23:0 [dspace]/log/dspace.log.2017-02-24:0 [dspace]/log/dspace.log.2017-02-25:7 [dspace]/log/dspace.log.2017-02-26:8 [dspace]/log/dspace.log.2017-02-27:90 </code></pre> <ul> <li>Also, it seems that we need to use a different user for LDAP binds, as we&rsquo;re still using the temporary one from the root migration, so maybe we can go back to the previous user we were using</li> <li>So it looks like the certificate is invalid AND the bind users we had been using were deleted</li> <li>Biruk Debebe recreated the bind user and now we are just waiting for CGNET to update their certificates</li> <li>Regarding the <code>filter-media</code> issue I found earlier, it seems that the ImageMagick PDF plugin will also process JPGs if they are in the &ldquo;Content Files&rdquo; (aka <code>ORIGINAL</code>) bundle</li> <li>The problem likely lies in the logic of <code>ImageMagickThumbnailFilter.java</code>, as <code>ImageMagickPdfThumbnailFilter.java</code> extends it</li> <li>Run CIAT corrections on CGSpace</li> </ul> <pre><code>dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture'; </code></pre> <ul> <li>CGNET has fixed the certificate chain on their LDAP server</li> <li>Redeploy CGSpace and DSpace Test to on latest <code>5_x-prod</code> branch with fixes for LDAP bind user</li> <li>Run all system updates on CGSpace server and reboot</li> </ul> <h2 id="2017-02-28">2017-02-28</h2> <ul> <li>After running the CIAT corrections and updating the Discovery and authority indexes, there is still no change in the number of items listed for CIAT in Discovery</li> <li>Ah, this is probably because some items have the <code>International Center for Tropical Agriculture</code> author twice, which I first noticed in 2016-12 but couldn&rsquo;t figure out how to fix</li> <li>I think I can do it by first exporting all metadatavalues that have the author <code>International Center for Tropical Agriculture</code></li> </ul> <pre><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv; COPY 1968 </code></pre> <ul> <li>And then use awk to print the duplicate lines to a separate file:</li> </ul> <pre><code>$ awk -F',' 'seen[$1]++' /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv </code></pre> <ul> <li>From that file I can create a list of 279 deletes and put them in a batch script like:</li> </ul> <pre><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061; </code></pre> January, 2017 https://alanorth.github.io/cgspace-notes/2017-01/ Mon, 02 Jan 2017 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2017-01/ <h2 id="2017-01-02">2017-01-02</h2> <ul> <li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li> <li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li> <li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li> </ul> <p></p> <h2 id="2017-01-04">2017-01-04</h2> <ul> <li>I tried to shard my local dev instance and it fails the same way:</li> </ul> <pre><code>$ JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace stats-util -s Moving: 9318 into core statistics-2016 Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016 org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291) at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) Caused by: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448) ... 10 more Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request failed. at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) ... 14 more Caused by: java.net.SocketException: Broken pipe (Write failed) at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181) at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124) at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181) at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132) at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89) at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117) at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265) at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) ... 16 more </code></pre> <ul> <li>And the DSpace log shows:</li> </ul> <pre><code>2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016 2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016 2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}-&gt;http://localhost:8081: Broken pipe (Write failed) 2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-&gt;http://localhost:8081 </code></pre> <ul> <li>Despite failing instantly, a <code>statistics-2016</code> directory was created, but it only has a data dir (no conf)</li> <li>The Tomcat access logs show more:</li> </ul> <pre><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107 127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 423 127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 77 127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63 127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&quot; 200 4359517 127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 16248 127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 409 156 127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &quot;POST /solr/datatables/update?wt=javabin&amp;version=2 HTTP/1.1&quot; 200 41 127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &quot;POST /solr/datatables/update HTTP/1.1&quot; 200 40 </code></pre> <ul> <li>Very interesting&hellip; it creates the core and then fails somehow</li> </ul> <h2 id="2017-01-08">2017-01-08</h2> <ul> <li>Put Sisay&rsquo;s <code>item-view.xsl</code> code to show mapped collections on CGSpace (<a href="https://github.com/ilri/DSpace/pull/295">#295</a>)</li> </ul> <h2 id="2017-01-09">2017-01-09</h2> <ul> <li>A user wrote to tell me that the new display of an item&rsquo;s mappings had a crazy bug for at least one item: <a href="https://cgspace.cgiar.org/handle/10568/78596">https://cgspace.cgiar.org/handle/10568/78596</a></li> <li>She said she only mapped it once, but it appears to be mapped 184 times</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2017/01/mapping-crazy-duplicate.png" alt="Crazy item mapping" /></p> <h2 id="2017-01-10">2017-01-10</h2> <ul> <li>I tried to clean up the duplicate mappings by exporting the item&rsquo;s metadata to CSV, editing, and re-importing, but DSpace said &ldquo;no changes were detected&rdquo;</li> <li>I&rsquo;ve asked on the dspace-tech mailing list to see if anyone can help</li> <li>I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help</li> <li>For example, this shows 186 mappings for the item, the first three of which are real:</li> </ul> <pre><code>dspace=# select * from collection2item where item_id = '80596'; </code></pre> <ul> <li>Then I deleted the others:</li> </ul> <pre><code>dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807); </code></pre> <ul> <li>And in the item view it now shows the correct mappings</li> <li>I will have to ask the DSpace people if this is a valid approach</li> <li>Finish looking at the Journal Title corrections of the top 500 Journal Titles so we can make a controlled vocabulary from it</li> </ul> <h2 id="2017-01-11">2017-01-11</h2> <ul> <li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li> <li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li> </ul> <pre><code>Traceback (most recent call last): File &quot;./fix-metadata-values.py&quot;, line 80, in &lt;module&gt; print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0])) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128) </code></pre> <ul> <li>Seems we need to encode as UTF-8 before printing to screen, ie:</li> </ul> <pre><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8'))) </code></pre> <ul> <li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li> <li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li> <li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu' </code></pre> <ul> <li>Now get the top 500 journal titles:</li> </ul> <pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv; </code></pre> <ul> <li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li> <li>I will have to go through these and fix some more before making the controlled vocabulary</li> <li>Added 30 more corrections or so, now there are 49 total and I&rsquo;ll have to get the top 500 after applying them</li> </ul> <h2 id="2017-01-13">2017-01-13</h2> <ul> <li>Add <code>FOOD SYSTEMS</code> to CIAT subjects, waiting to merge: <a href="https://github.com/ilri/DSpace/pull/296">https://github.com/ilri/DSpace/pull/296</a></li> </ul> <h2 id="2017-01-16">2017-01-16</h2> <ul> <li>Fix the two items Maria found with duplicate mappings with this script:</li> </ul> <pre><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */ delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807); /* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */ delete from collection2item where id = '91082'; </code></pre> <h2 id="2017-01-17">2017-01-17</h2> <ul> <li>Helping clean up some file names in the 232 CIAT records that Sisay worked on last week</li> <li>There are about 30 files with <code>%20</code> (space) and Spanish accents in the file name</li> <li>At first I thought we should fix these, but actually it is <a href="https://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1">prescribed by the W3 working group to convert these to UTF8 and URL encode them</a>!</li> <li>And the file names don&rsquo;t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore</li> <li>Seems like the only ones I should replace are the <code>'</code> apostrophe characters, as <code>%27</code>:</li> </ul> <pre><code>value.replace(&quot;'&quot;,'%27') </code></pre> <ul> <li>Add the item&rsquo;s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:</li> </ul> <pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value </code></pre> <ul> <li>Test importing of the new CIAT records (actually there are 232, not 234):</li> </ul> <pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log </code></pre> <ul> <li>Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB</li> <li>These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:</li> </ul> <pre><code>$ convert -compress Zip -density 150x150 input.pdf output.pdf $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf </code></pre> <ul> <li>Somewhere on the Internet suggested using a DPI of 144</li> </ul> <h2 id="2017-01-19">2017-01-19</h2> <ul> <li>In testing a random sample of CIAT&rsquo;s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are</li> <li>Import 232 CIAT records into CGSpace:</li> </ul> <pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log </code></pre> <h2 id="2017-01-22">2017-01-22</h2> <ul> <li>Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel&rsquo;s CSV exporter)</li> <li>There were also some issues with an invalid dc.date.issued field, and I trimmed leading / trailing whitespace and cleaned up some URLs with unneeded parameters like ?show=full</li> </ul> <h2 id="2017-01-23">2017-01-23</h2> <ul> <li>I merged Atmire&rsquo;s pull request into the development branch so they can deploy it on DSpace Test</li> <li>Move some old ILRI Program communities to a new subcommunity for former programs (<sup>10568</sup>&frasl;<sub>79164</sub>):</li> </ul> <pre><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&quot;$community&quot; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&quot;$community&quot;; done </code></pre> <ul> <li>Move some collections with <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move-collections.sh</code></a> using the following config:</li> </ul> <pre><code>10568/42161 10568/171 10568/79341 10568/41914 10568/171 10568/79340 </code></pre> <h2 id="2017-01-24">2017-01-24</h2> <ul> <li>Run all updates on DSpace Test and reboot the server</li> <li>Run fixes for Journal titles on CGSpace:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password' </code></pre> <ul> <li>Create a new list of the top 500 journal titles from the database:</li> </ul> <pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv; </code></pre> <ul> <li>Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li> <li>This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (<a href="https://github.com/ilri/DSpace/pull/69">#69</a>)</li> </ul> <h2 id="2017-01-25">2017-01-25</h2> <ul> <li>Atmire says the <code>com.atmire.statistics.util.UpdateSolrStorageReports</code> and <code>com.atmire.utils.ReportSender</code> are no longer necessary because they are using a Spring scheduler for these tasks now</li> <li>Pull request to remove them from the Ansible templates: <a href="https://github.com/ilri/rmg-ansible-public/pull/80">https://github.com/ilri/rmg-ansible-public/pull/80</a></li> <li>Still testing the Atmire modules on DSpace Test, and it looks like a few issues we had reported are now fixed: <ul> <li>XLS Export from Content statistics</li> <li>Most popular items</li> <li>Show statistics on collection pages</li> </ul></li> <li>But now we have a new issue with the &ldquo;Types&rdquo; in Content statistics not being respected—we only get the defaults, despite having custom settings in <code>dspace/config/modules/atmire-cua.cfg</code></li> </ul> <h2 id="2017-01-27">2017-01-27</h2> <ul> <li>Magdalena pointed out that somehow the Anonymous group had been added to the Administrators group on CGSpace (!)</li> <li>Discuss plans to update CCAFS metadata and communities for their new flagships and phase II project identifiers</li> <li>The flagships are in <code>cg.subject.ccafs</code>, and we need to probably make a new field for the phase II project identifiers</li> </ul> <h2 id="2017-01-28">2017-01-28</h2> <ul> <li>Merge controlled vocabulary for journal titles (<code>dc.source</code>) into CGSpace (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li> <li>Merge new CIAT subject into CGSpace (<a href="https://github.com/ilri/DSpace/pull/296">#296</a>)</li> </ul> <h2 id="2017-01-29">2017-01-29</h2> <ul> <li>Run all system updates on DSpace Test, redeploy DSpace code, and reboot the server</li> <li>Run all system updates on CGSpace, redeploy DSpace code, and reboot the server</li> </ul> December, 2016 https://alanorth.github.io/cgspace-notes/2016-12/ Fri, 02 Dec 2016 10:43:00 +0300 https://alanorth.github.io/cgspace-notes/2016-12/ <h2 id="2016-12-02">2016-12-02</h2> <ul> <li>CGSpace was down for five hours in the morning while I was sleeping</li> <li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li> </ul> <pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;) 2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;) </code></pre> <ul> <li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li> <li>I&rsquo;ve raised a ticket with Atmire to ask</li> <li>Another worrying error from dspace.log is:</li> </ul> <p></p> <pre><code>org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery; at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789) at javax.servlet.http.HttpServlet.service(HttpServlet.java:646) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:111) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170) at com.googlecode.psiprobe.Tomcat70AgentValve.invoke(Tomcat70AgentValve.java:44) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1041) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:313) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery; at com.atmire.statistics.generator.TopNDSODatasetGenerator.toDatasetQuery(SourceFile:39) at com.atmire.statistics.display.StatisticsDataVisitsMultidata.createDataset(SourceFile:108) at org.dspace.statistics.content.StatisticsDisplay.createDataset(SourceFile:384) at org.dspace.statistics.content.StatisticsDisplay.getDataset(SourceFile:404) at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generateJsonData(SourceFile:170) at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generate(SourceFile:246) at com.atmire.app.xmlui.aspect.statistics.JSONStatsMostPopular.generate(JSONStatsMostPopular.java:145) at sun.reflect.GeneratedMethodAccessor296.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71) at com.sun.proxy.$Proxy96.process(Unknown Source) at org.apache.cocoon.components.treeprocessor.sitemap.ReadNode.invoke(ReadNode.java:94) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55) at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55) at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78) at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78) at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81) at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239) at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171) at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247) at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55) at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78) at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143) at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78) at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81) at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239) at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171) at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247) at org.apache.cocoon.servlet.RequestProcessor.process(RequestProcessor.java:351) at org.apache.cocoon.servlet.RequestProcessor.service(RequestProcessor.java:169) at org.apache.cocoon.sitemap.SitemapServlet.service(SitemapServlet.java:84) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:468) at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443) at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202) at com.sun.proxy.$Proxy89.service(Unknown Source) at org.dspace.springmvc.CocoonView.render(CocoonView.java:113) at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1180) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:950) ... 35 more </code></pre> <ul> <li>The first error I see in dspace.log this morning is:</li> </ul> <pre><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&quot;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&quot; org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority </code></pre> <ul> <li>Looking through DSpace&rsquo;s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries</li> <li>The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:</li> </ul> <pre><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&quot;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&quot;++AND+subject_mtdt:&quot;PARTNERSHIPS&quot;+AND+subject_mtdt:&quot;RESEARCH&quot;+AND+subject_mtdt:&quot;AGRICULTURE&quot;+AND+subject_mtdt:&quot;DEVELOPMENT&quot;++AND+iso_mtdt:&quot;en&quot;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19 2016-12-02 08:28:23,908 INFO org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init() </code></pre> <ul> <li>DSpace&rsquo;s own Solr logs don&rsquo;t give IP addresses, so I will have to enable Nginx&rsquo;s logging of <code>/solr</code> so I can see where this request came from</li> <li>I enabled logging of <code>/rest/</code> and I think I&rsquo;ll leave it on for good</li> <li>Also, the disk is nearly full because of log file issues, so I&rsquo;m running some compression on DSpace logs</li> <li>Normally these stay uncompressed for a month just in case we need to look at them, so now I&rsquo;ve just compressed anything older than 2 weeks so we can get some disk space back</li> </ul> <h2 id="2016-12-04">2016-12-04</h2> <ul> <li>I got a weird report from the CGSpace checksum checker this morning</li> <li>It says 732 bitstreams have potential issues, for example:</li> </ul> <pre><code>------------------------------------------------ Bitstream Id = 6 Process Start Date = Dec 4, 2016 Process End Date = Dec 4, 2016 Checksum Expected = a1d9eef5e2d85f50f67ce04d0329e96a Checksum Calculated = a1d9eef5e2d85f50f67ce04d0329e96a Result = Bitstream marked deleted in bitstream table ----------------------------------------------- ... ------------------------------------------------ Bitstream Id = 77581 Process Start Date = Dec 4, 2016 Process End Date = Dec 4, 2016 Checksum Expected = 9959301aa4ca808d00957dff88214e38 Checksum Calculated = Result = The bitstream could not be found ----------------------------------------------- </code></pre> <ul> <li>The first one seems ok, but I don&rsquo;t know what to make of the second one&hellip;</li> <li>I had a look and there is indeed no file with the second checksum in the assetstore (ie, looking in <code>[dspace-dir]/assetstore/99/59/30/...</code>)</li> <li>For what it&rsquo;s worth, there is no item on DSpace Test or S3 backups with that checksum either&hellip;</li> <li>In other news, I&rsquo;m looking at JVM settings from the Solr 4.10.2 release, from <code>bin/solr.in.sh</code>:</li> </ul> <pre><code># These GC settings have shown to work well for a number of common Solr workloads GC_TUNE=&quot;-XX:-UseSuperWord \ -XX:NewRatio=3 \ -XX:SurvivorRatio=4 \ -XX:TargetSurvivorRatio=90 \ -XX:MaxTenuringThreshold=8 \ -XX:+UseConcMarkSweepGC \ -XX:+UseParNewGC \ -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \ -XX:+CMSScavengeBeforeRemark \ -XX:PretenureSizeThreshold=64m \ -XX:CMSFullGCsBeforeCompaction=1 \ -XX:+UseCMSInitiatingOccupancyOnly \ -XX:CMSInitiatingOccupancyFraction=50 \ -XX:CMSTriggerPermRatio=80 \ -XX:CMSMaxAbortablePrecleanTime=6000 \ -XX:+CMSParallelRemarkEnabled \ -XX:+ParallelRefProcEnabled \ -XX:+AggressiveOpts&quot; </code></pre> <ul> <li>I need to try these because they are recommended by the Solr project itself</li> <li>Also, as always, I need to read <a href="https://wiki.apache.org/solr/ShawnHeisey">Shawn Heisey&rsquo;s wiki page on Solr</a></li> </ul> <h2 id="2016-12-05">2016-12-05</h2> <ul> <li>I did some basic benchmarking on a local DSpace before and after the JVM settings above, but there wasn&rsquo;t anything amazingly obvious</li> <li>I want to make the changes on DSpace Test and monitor the JVM heap graphs for a few days to see if they change the JVM GC patterns or anything (munin graphs)</li> <li>Spin up new CGSpace server on Linode</li> <li>I did a few traceroutes from Jordan and Kenya and it seems that Linode&rsquo;s Frankfurt datacenter is a few less hops and perhaps less packet loss than the London one, so I put the new server in Frankfurt</li> <li>Do initial provisioning</li> <li>Atmire responded about the MQM warnings in the DSpace logs</li> <li>Apparently we need to change the batch edit consumers in <code>dspace/config/dspace.cfg</code>:</li> </ul> <pre><code>event.consumer.batchedit.filters = Community|Collection+Create </code></pre> <ul> <li>I haven&rsquo;t tested it yet, but I created a pull request: <a href="https://github.com/ilri/DSpace/pull/289">#289</a></li> </ul> <h2 id="2016-12-06">2016-12-06</h2> <ul> <li>Some author authority corrections and name standardizations for Peter:</li> </ul> <pre><code>dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%'; UPDATE 11 dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%'; UPDATE 36 dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$'; UPDATE 14 dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%'; UPDATE 42 dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%'; UPDATE 360 dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%'; UPDATE 561 </code></pre> <ul> <li>Pay attention to the regex to prevent false positives in tricky cases with Dutch names!</li> <li>I will run these updates on DSpace Test and then force a Discovery reindex, and then run them on CGSpace next week</li> <li>More work on the KM4Dev Journal article</li> <li>In other news, it seems the batch edit patch is working, there are no more WARN errors in the logs and the batch edit seems to work</li> <li>I need to check the CGSpace logs to see if there are still errors there, and then deploy/monitor it there</li> <li>Paola from CCAFS mentioned she also has the &ldquo;take task&rdquo; bug on CGSpace</li> <li>Reading about <a href="https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html"><code>shared_buffers</code> in PostgreSQL configuration</a> (default is 128MB)</li> <li>Looks like we have ~5GB of memory used by caches on the test server (after OS and JVM heap!), so we might as well bump up the buffers for Postgres</li> <li>The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn&rsquo;t dedicated (also runs Solr, which can benefit from OS cache) so let&rsquo;s try 1024MB</li> <li>In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):</li> </ul> <pre><code>$ time JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace index-authority Retrieving all data Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer Exception: null java.lang.NullPointerException at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82) at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39) at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201) at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132) at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144) at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144) at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159) at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144) at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144) at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) real 8m39.913s user 1m54.190s sys 0m22.647s </code></pre> <h2 id="2016-12-07">2016-12-07</h2> <ul> <li>For what it&rsquo;s worth, after running the same SQL updates on my local test server, <code>index-authority</code> runs and completes just fine</li> <li>I will have to test more</li> <li>Anyways, I noticed that some of the authority values I set actually have versions of author names we don&rsquo;t want, ie &ldquo;Grace, D.&rdquo;</li> <li>For example, do a Solr query for &ldquo;first_name:Grace&rdquo; and look at the results</li> <li>Querying that ID shows the fields that need to be changed:</li> </ul> <pre><code>{ &quot;responseHeader&quot;: { &quot;status&quot;: 0, &quot;QTime&quot;: 1, &quot;params&quot;: { &quot;q&quot;: &quot;id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b&quot;, &quot;indent&quot;: &quot;true&quot;, &quot;wt&quot;: &quot;json&quot;, &quot;_&quot;: &quot;1481102189244&quot; } }, &quot;response&quot;: { &quot;numFound&quot;: 1, &quot;start&quot;: 0, &quot;docs&quot;: [ { &quot;id&quot;: &quot;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&quot;, &quot;field&quot;: &quot;dc_contributor_author&quot;, &quot;value&quot;: &quot;Grace, D.&quot;, &quot;deleted&quot;: false, &quot;creation_date&quot;: &quot;2016-11-10T15:13:40.318Z&quot;, &quot;last_modified_date&quot;: &quot;2016-11-10T15:13:40.318Z&quot;, &quot;authority_type&quot;: &quot;person&quot;, &quot;first_name&quot;: &quot;D.&quot;, &quot;last_name&quot;: &quot;Grace&quot; } ] } } </code></pre> <ul> <li>I think I can just update the <code>value</code>, <code>first_name</code>, and <code>last_name</code> fields&hellip;</li> <li>The update syntax should be something like this, but I&rsquo;m getting errors from Solr:</li> </ul> <pre><code>$ curl 'localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true' -H 'Content-type:application/json' -d '[{&quot;id&quot;:&quot;1&quot;,&quot;price&quot;:{&quot;set&quot;:100}}]' { &quot;responseHeader&quot;:{ &quot;status&quot;:400, &quot;QTime&quot;:0}, &quot;error&quot;:{ &quot;msg&quot;:&quot;Unexpected character '[' (code 91) in prolog; expected '&lt;'\n at [row,col {unknown-source}]: [1,1]&quot;, &quot;code&quot;:400}} </code></pre> <ul> <li>When I try using the XML format I get an error that the <code>updateLog</code> needs to be configured for that core</li> <li>Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?</li> </ul> <pre><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%'; UPDATE 561 </code></pre> <ul> <li>Then I&rsquo;ll reindex discovery and authority and see how the authority Solr core looks</li> <li>After this, now there are authorities for some of the &ldquo;Grace, D.&rdquo; and &ldquo;Grace, Delia&rdquo; text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):</li> </ul> <pre><code>$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true' { &quot;responseHeader&quot;:{ &quot;status&quot;:0, &quot;QTime&quot;:0, &quot;params&quot;:{ &quot;q&quot;:&quot;id:18ea1525-2513-430a-8817-a834cd733fbc&quot;, &quot;indent&quot;:&quot;true&quot;, &quot;wt&quot;:&quot;json&quot;}}, &quot;response&quot;:{&quot;numFound&quot;:1,&quot;start&quot;:0,&quot;docs&quot;:[ { &quot;id&quot;:&quot;18ea1525-2513-430a-8817-a834cd733fbc&quot;, &quot;field&quot;:&quot;dc_contributor_author&quot;, &quot;value&quot;:&quot;Grace, Delia&quot;, &quot;deleted&quot;:false, &quot;creation_date&quot;:&quot;2016-12-07T10:54:34.356Z&quot;, &quot;last_modified_date&quot;:&quot;2016-12-07T10:54:34.356Z&quot;, &quot;authority_type&quot;:&quot;person&quot;, &quot;first_name&quot;:&quot;Delia&quot;, &quot;last_name&quot;:&quot;Grace&quot;}] }} </code></pre> <ul> <li>So now I could set them all to this ID and the name would be ok, but there has to be a better way!</li> <li>In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!</li> <li>Better to use:</li> </ul> <pre><code>dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%'; </code></pre> <ul> <li>This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!</li> <li>Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID</li> <li>Deploy MQM WARN fix on CGSpace (<a href="https://github.com/ilri/DSpace/pull/289">#289</a>)</li> <li>Deploy &ldquo;take task&rdquo; hack/fix on CGSpace (<a href="https://github.com/ilri/DSpace/pull/290">#290</a>)</li> <li>I ran the following author corrections and then reindexed discovery:</li> </ul> <pre><code>update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%'; update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%'; update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$'; update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%'; update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%'; update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%'; </code></pre> <h2 id="2016-12-08">2016-12-08</h2> <ul> <li>Something weird happened and Peter Thorne&rsquo;s names all ended up as &ldquo;Thorne&rdquo;, I guess because the original authority had that as its name value:</li> </ul> <pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%'; text_value | authority | confidence ------------------+--------------------------------------+------------ Thorne, P.J. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600 Thorne | 18349f29-61b1-44d7-ac60-89e55546e812 | 600 Thorne-Lyman, A. | 0781e13a-1dc8-4e3f-82e8-5c422b44a344 | -1 Thorne, M. D. | 54c52649-cefd-438d-893f-3bcef3702f07 | -1 Thorne, P.J | 18349f29-61b1-44d7-ac60-89e55546e812 | 600 Thorne, P. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600 (6 rows) </code></pre> <ul> <li>I generated a new UUID using <code>uuidgen | tr [A-Z] [a-z]</code> and set it along with correct name variation for all records:</li> </ul> <pre><code>dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812'; UPDATE 43 </code></pre> <ul> <li>Apparently we also need to normalize Phil Thornton&rsquo;s names to <code>Thornton, Philip K.</code>: <br /></li> </ul> <pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*'; text_value | authority | confidence ---------------------+--------------------------------------+------------ Thornton, P | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton, P K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton, P K | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton. P.K. | 3e1e6639-d4fb-449e-9fce-ce06b5b0f702 | -1 Thornton, P K . | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton, P.K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton, P.K | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton, Philip K | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton, Philip K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 Thornton, P. K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600 (10 rows) </code></pre> <ul> <li>Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:</li> </ul> <pre><code>dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*'; UPDATE 362 </code></pre> <ul> <li>It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)</li> <li>Everything looks ok after authority and discovery reindex</li> <li>In other news, I think we should really be using more RAM for PostgreSQL&rsquo;s <code>shared_buffers</code></li> <li>The <a href="https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html">PostgreSQL documentation</a> recommends using 25% of the system&rsquo;s RAM on dedicated systems, but we should use a bit less since we also have a massive JVM heap and also benefit from some RAM being used by the OS cache</li> </ul> <h2 id="2016-12-09">2016-12-09</h2> <ul> <li>More work on finishing rough draft of KM4Dev article</li> <li>Set PostgreSQL&rsquo;s <code>shared_buffers</code> on CGSpace to 10% of system RAM (1200MB)</li> <li>Run the following author corrections on CGSpace:</li> </ul> <pre><code>dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993'; dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*'; </code></pre> <ul> <li>The authority IDs were different now than when I was looking a few days ago so I had to adjust them here</li> </ul> <h2 id="2016-12-11">2016-12-11</h2> <ul> <li>After enabling a sizable <code>shared_buffers</code> for CGSpace&rsquo;s PostgreSQL configuration the number of connections to the database dropped significantly</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/12/postgres_bgwriter-week.png" alt="postgres_bgwriter-week" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/12/postgres_connections_ALL-week.png" alt="postgres_connections_ALL-week" /></p> <ul> <li>Looking at CIAT records from last week again, they have a lot of double authors like:</li> </ul> <pre><code>International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600 International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::500 International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::0 </code></pre> <ul> <li>Some in the same <code>dc.contributor.author</code> field, and some in others like <code>dc.contributor.author[en_US]</code> etc</li> <li>Removing the duplicates in OpenRefine and uploading a CSV to DSpace says &ldquo;no changes detected&rdquo;</li> <li>Seems like the only way to sortof clean these up would be to start in SQL:</li> </ul> <pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture'; text_value | authority | confidence -----------------------------------------------+--------------------------------------+------------ International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | -1 International Center for Tropical Agriculture | | 600 International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 500 International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | 600 International Center for Tropical Agriculture | | -1 International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | 500 International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 600 International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | -1 International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 0 dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture'; UPDATE 1693 dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%'; UPDATE 35 </code></pre> <ul> <li>Work on article for KM4Dev journal</li> </ul> <h2 id="2016-12-13">2016-12-13</h2> <ul> <li>Checking in on CGSpace postgres stats again, looks like the <code>shared_buffers</code> change from a few days ago really made a big impact:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/12/postgres_bgwriter-week-2016-12-13.png" alt="postgres_bgwriter-week" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/12/postgres_connections_ALL-week-2016-12-13.png" alt="postgres_connections_ALL-week" /></p> <ul> <li>Looking at logs, it seems we need to evaluate which logs we keep and for how long</li> <li>Basically the only ones we <em>need</em> are <code>dspace.log</code> because those are used for legacy statistics (need to keep for 1 month)</li> <li>Other logs will be an issue because they don&rsquo;t have date stamps</li> <li>I will add date stamps to the logs we&rsquo;re storing from the tomcat7 user&rsquo;s cron jobs at least, using: <code>$(date --iso-8601)</code></li> <li>Would probably be better to make custom logrotate files for them in the future</li> <li>Clean up some unneeded log files from 2014 (they weren&rsquo;t large, just don&rsquo;t need them)</li> <li>So basically, new cron jobs for logs should look something like this:</li> <li>Find any file named <code>*.log*</code> that isn&rsquo;t <code>dspace.log*</code>, isn&rsquo;t already zipped, and is older than one day, and zip it:</li> </ul> <pre><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &quot;.*\.log.*&quot; ! -iregex &quot;.*dspace\.log.*&quot; ! -iregex &quot;.*\.(gz|lrz|lzo|xz)&quot; ! -newermt &quot;Yesterday&quot; -exec schedtool -B -e ionice -c2 -n7 xz {} \; </code></pre> <ul> <li>Since there is <code>xzgrep</code> and <code>xzless</code> we can actually just zip them after one day, why not?!</li> <li>We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that</li> <li>I use <code>schedtool -B</code> and <code>ionice -c2 -n7</code> to set the CPU scheduling to <code>SCHED_BATCH</code> and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less</li> <li>When the tasks are running you can see that the policies do apply:</li> </ul> <pre><code>$ schedtool $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}') &amp;&amp; ionice -p $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}') PID 17049: PRIO 0, POLICY B: SCHED_BATCH , NICE 0, AFFINITY 0xf best-effort: prio 7 </code></pre> <ul> <li>All in all this should free up a few gigs (we were at 9.3GB free when I started)</li> <li>Next thing to look at is whether we need Tomcat&rsquo;s access logs</li> <li>I just looked and it seems that we saved 10GB by zipping these logs</li> <li>Some users pointed out issues with the &ldquo;most popular&rdquo; stats on a community or collection</li> <li>This error appears in the logs when you try to view them:</li> </ul> <pre><code>2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request! org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery; at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882) at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789) at javax.servlet.http.HttpServlet.service(HttpServlet.java:650) at javax.servlet.http.HttpServlet.service(HttpServlet.java:731) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:111) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery; at com.atmire.statistics.generator.TopNDSODatasetGenerator.toDatasetQuery(SourceFile:39) at com.atmire.statistics.display.StatisticsDataVisitsMultidata.createDataset(SourceFile:108) at org.dspace.statistics.content.StatisticsDisplay.createDataset(SourceFile:384) at org.dspace.statistics.content.StatisticsDisplay.getDataset(SourceFile:404) at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generateJsonData(SourceFile:170) at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generate(SourceFile:246) at com.atmire.app.xmlui.aspect.statistics.JSONStatsMostPopular.generate(JSONStatsMostPopular.java:145) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) </code></pre> <ul> <li>It happens on development and production, so I will have to ask Atmire</li> <li>Most likely an issue with installation/configuration</li> </ul> <h2 id="2016-12-14">2016-12-14</h2> <ul> <li>Atmire sent a quick fix for the <code>last-update.txt</code> file not found error</li> <li>After applying pull request <a href="https://github.com/ilri/DSpace/pull/291">#291</a> on DSpace Test I no longer see the error in the logs after the <code>UpdateSolrStorageReports</code> task runs</li> <li>Also, I&rsquo;m toying with the idea of moving the <code>tomcat7</code> user&rsquo;s cron jobs to <code>/etc/cron.d</code> so we can manage them in Ansible</li> <li>Made a pull request with a template for the cron jobs (<a href="https://github.com/ilri/rmg-ansible-public/pull/75">#75</a>)</li> <li>Testing SMTP from the new CGSpace server and it&rsquo;s not working, I&rsquo;ll have to tell James</li> </ul> <h2 id="2016-12-15">2016-12-15</h2> <ul> <li>Start planning for server migration this weekend, letting users know</li> <li>I am trying to figure out what the process is to <a href="http://handle.net/hnr_support.html">update the server&rsquo;s IP in the Handle system</a>, and emailing the hdladmin account bounces(!)</li> <li>I will contact the Jane Euler directly as I know I&rsquo;ve corresponded with her in the past</li> <li>She said that I should indeed just re-run the <code>[dspace]/bin/dspace make-handle-server</code> command and submit the new <code>sitebndl.zip</code> file to the CNRI website</li> <li>Also I was troubleshooting some workflow issues from Bizuwork</li> <li>I re-created the same scenario by adding a non-admin account and submitting an item, but I was able to successfully approve and commit it</li> <li>So it turns out it&rsquo;s not a bug, it&rsquo;s just that Peter was added as a reviewer/admin AFTER the items were submitted</li> <li>This is how DSpace works, and I need to ask if there is a way to override someone&rsquo;s submission, as the other reviewer seems to not be paying attention, or has perhaps taken the item from the task pool?</li> <li>Run a batch edit to add &ldquo;RANGELANDS&rdquo; ILRI subject to all items containing the word &ldquo;RANGELANDS&rdquo; in their metadata for Peter Ballantyne</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/12/batch-edit1.png" alt="Select all items with &quot;rangelands&quot; in metadata" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/12/batch-edit2.png" alt="Add RANGELANDS ILRI subject" /></p> <h2 id="2016-12-18">2016-12-18</h2> <ul> <li>Add four new CRP subjects for 2017 and sort the input forms alphabetically (<a href="https://github.com/ilri/DSpace/pull/294">#294</a>)</li> <li>Test the SMTP on the new server and it&rsquo;s working</li> <li>Last week, when we asked CGNET to update the DNS records this weekend, they misunderstood and did it immediately</li> <li>We quickly told them to undo it, but I just realized they didn&rsquo;t undo the IPv6 AAAA record!</li> <li>None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then</li> <li>Update some names and authorities in the database:</li> </ul> <pre><code>dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*'; UPDATE 204 dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%'; UPDATE 89 dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f3684', confidence=600, text_value='Lukuyu, Ben A.' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Lukuyu, B%'; UPDATE 140 </code></pre> <ul> <li>Generated a new UUID for Ben using <code>uuidgen | tr [A-Z] [a-z]</code> as the one in Solr had his ORCID but the name format was incorrect</li> <li>In theory DSpace should be able to check names from ORCID and update the records in the database, but I find that this doesn&rsquo;t work (see Jira bug <a href="https://jira.duraspace.org/browse/DS-3302">DS-3302</a>)</li> <li>I need to run these updates along with the other one for CIAT that I found last week</li> <li>Enable OCSP stapling for hosts &gt;= Ubuntu 16.04 in our Ansible playbooks (<a href="https://github.com/ilri/rmg-ansible-public/pull/76">#76</a>)</li> <li>Working for DSpace Test on the second response:</li> </ul> <pre><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status ... OCSP response: no response sent $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status ... OCSP Response Data: ... Cert Status: good </code></pre> <ul> <li>Migrate CGSpace to new server, roughly following these steps:</li> <li>On old server:</li> </ul> <pre><code># service tomcat7 stop # /home/backup/scripts/postgres_backup.sh </code></pre> <ul> <li>On new server:</li> </ul> <pre><code># systemctl stop tomcat7 # rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/assetstore/ /home/cgspace.cgiar.org/assetstore/ # rsync -4 -av --delete 178.79.187.182:/home/backup/ /home/backup/ # rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/solr/ /home/cgspace.cgiar.org/solr # su - postgres $ dropdb cgspace $ createdb -O cgspace --encoding=UNICODE cgspace $ psql cgspace -c 'alter user cgspace createuser;' $ pg_restore -O -U cgspace -d cgspace -W -h localhost /home/backup/postgres/cgspace_2016-12-18.backup $ psql cgspace -c 'alter user cgspace nocreateuser;' $ psql -U cgspace -f ~tomcat7/src/git/DSpace/dspace/etc/postgres/update-sequences.sql cgspace -h localhost $ vacuumdb cgspace $ psql cgspace postgres=# \i /tmp/author-authority-updates-2016-12-11.sql postgres=# \q $ exit # chown -R tomcat7:tomcat7 /home/cgspace.cgiar.org # rsync -4 -av 178.79.187.182:/home/cgspace.cgiar.org/log/*.dat /home/cgspace.cgiar.org/log/ # rsync -4 -av 178.79.187.182:/home/cgspace.cgiar.org/log/dspace.log.2016-1[12]* /home/cgspace.cgiar.org/log/ # su - tomcat7 $ cd src/git/DSpace/dspace/target/dspace-installer $ ant update clean_backups $ exit # systemctl start tomcat7 </code></pre> <ul> <li>It took about twenty minutes and afterwards I had to check a few things, like: <ul> <li>check and enable systemd timer for let&rsquo;s encrypt</li> <li>enable root cron jobs</li> <li>disable root cron jobs on old server after!</li> <li>enable tomcat7 cron jobs</li> <li>disable tomcat7 cron jobs on old server after!</li> <li>regenerate <code>sitebndl.zip</code> with new IP for handle server and submit it to Handle.net</li> </ul></li> </ul> <h2 id="2016-12-22">2016-12-22</h2> <ul> <li>Abenet wanted a CSV of the IITA community, but the web export doesn&rsquo;t include the <code>dc.date.accessioned</code> field</li> <li>I had to export it from the command line using the <code>-a</code> flag:</li> </ul> <pre><code>$ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616 </code></pre> <h2 id="2016-12-28">2016-12-28</h2> <ul> <li>We&rsquo;ve been getting two alerts per day about CPU usage on the new server from Linode</li> <li>These are caused by the batch jobs for Solr etc that run in the early morning hours</li> <li>The Linode default is to alert at 90% CPU usage for two hours, but I see the old server was at 150%, so maybe we just need to adjust it</li> <li>Speaking of the old server (linode01), I think we can decommission it now</li> <li>I checked the S3 logs on the new server (linode18) to make sure the backups have been running and everything looks good</li> <li>In other news, I was looking at the Munin graphs for PostgreSQL on the new server and it looks slightly worrying:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/12/postgres_size_ALL-week.png" alt="munin postgres stats" /></p> <ul> <li>I will have to check later why the size keeps increasing</li> </ul> November, 2016 https://alanorth.github.io/cgspace-notes/2016-11/ Tue, 01 Nov 2016 09:21:00 +0300 https://alanorth.github.io/cgspace-notes/2016-11/ <h2 id="2016-11-01">2016-11-01</h2> <ul> <li>Add <code>dc.type</code> to the output options for Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type" /></p> <p></p> <h2 id="2016-11-02">2016-11-02</h2> <ul> <li>Migrate DSpace Test to DSpace 5.5 (<a href="https://gist.github.com/alanorth/61013895c6efe7095d7f81000953d1cf">notes</a>)</li> <li>Run all updates on DSpace Test and reboot the server</li> <li>Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (<a href="https://github.com/ilri/DSpace/issues/63">#63</a>)</li> <li>Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes</li> <li>At the end it appeared to finish correctly but there were lots of errors right after it finished:</li> </ul> <pre><code>2016-11-02 15:09:48,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index 2016-11-02 15:09:48,584 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index 2016-11-02 15:09:48,589 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76455 to Index 2016-11-02 15:09:48,590 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/51693 to Index 2016-11-02 15:09:48,590 INFO org.dspace.discovery.IndexClient @ Done with indexing 2016-11-02 15:09:48,600 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76456 to Index 2016-11-02 15:09:48,613 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/55536 to Index 2016-11-02 15:09:48,616 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76457 to Index 2016-11-02 15:09:48,634 ERROR com.atmire.dspace.discovery.AtmireSolrService @ java.lang.NullPointerException at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57) at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824) at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821) at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898) at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370) at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945) </code></pre> <ul> <li>DSpace is still up, and a few minutes later I see the default DSpace indexer is still running</li> <li>Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:</li> </ul> <pre><code>2016-11-02 15:09:28,545 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index 2016-11-02 15:09:28,633 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index 2016-11-02 15:09:28,678 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557 2016-11-02 15:09:28,688 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476 </code></pre> <ul> <li>I will raise a ticket with Atmire to ask them</li> </ul> <h2 id="2016-11-06">2016-11-06</h2> <ul> <li>After re-deploying and re-indexing I didn&rsquo;t see the same issue, and the indexing completed in 85 minutes, which is about how long it is supposed to take</li> </ul> <h2 id="2016-11-07">2016-11-07</h2> <ul> <li>Horrible one liner to get Linode ID from certain Ansible host vars:</li> </ul> <pre><code>$ grep -A 3 contact_info * | grep -E &quot;(Orth|Sisay|Peter|Daniel|Tsega)&quot; | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id </code></pre> <ul> <li>I noticed some weird CRPs in the database, and they don&rsquo;t show up in Discovery for some reason, perhaps the <code>:</code></li> <li>I&rsquo;ll export these and fix them in batch:</li> </ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv; COPY 22 </code></pre> <ul> <li>Test running the replacements:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu' </code></pre> <ul> <li>Add <code>AMR</code> to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/288">#288</a>)</li> </ul> <h2 id="2016-11-08">2016-11-08</h2> <ul> <li>Atmire&rsquo;s Listings and Reports module seems to be broken on DSpace 5.5</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/listings-and-reports-55.png" alt="Listings and Reports broken in DSpace 5.5" /></p> <ul> <li>I&rsquo;ve filed a ticket with Atmire</li> <li>Thinking about batch updates for ORCIDs and authors</li> <li>Playing with <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> in Python to query Solr</li> <li>All records in the authority core are either <code>authority_type:orcid</code> or <code>authority_type:person</code></li> <li>There is a <code>deleted</code> field and all items seem to be <code>false</code>, but might be important sanity check to remember</li> <li>The way to go is probably to have a CSV of author names and authority IDs, then to batch update them in PostgreSQL</li> <li>Dump of the top ~200 authors in CGSpace:</li> </ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv; </code></pre> <h2 id="2016-11-09">2016-11-09</h2> <ul> <li>CGSpace crashed so I quickly ran system updates, applied one or two of the waiting changes from the <code>5_x-prod</code> branch, and rebooted the server</li> <li>The error was <code>Timeout waiting for idle object</code> but I haven&rsquo;t looked into the Tomcat logs to see what happened</li> <li>Also, I ran the corrections for CRPs from earlier this week</li> </ul> <h2 id="2016-11-10">2016-11-10</h2> <ul> <li>Helping Megan Zandstra and CIAT with some questions about the REST API</li> <li>Playing with <code>find-by-metadata-field</code>, this works:</li> </ul> <pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}' </code></pre> <ul> <li>But the results are deceiving because metadata fields can have text languages and your query must match exactly!</li> </ul> <pre><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS'; text_value | text_lang ------------+----------- SEEDS | SEEDS | SEEDS | en_US (3 rows) </code></pre> <ul> <li>So basically, the text language here could be null, blank, or en_US</li> <li>To query metadata with these properties, you can do:</li> </ul> <pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}' | jq length 55 $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;&quot;}' | jq length 34 $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;en_US&quot;}' | jq length </code></pre> <ul> <li>The results (55+34=89) don&rsquo;t seem to match those from the database:</li> </ul> <pre><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null; count ------- 15 dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang=''; count ------- 4 dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US'; count ------- 66 </code></pre> <ul> <li>So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85&hellip;</li> <li>And the <code>find-by-metadata-field</code> endpoint doesn&rsquo;t seem to have a way to get all items with the field, or a wildcard value</li> <li>I&rsquo;ll ask a question on the dspace-tech mailing list</li> <li>And speaking of <code>text_lang</code>, this is interesting:</li> </ul> <pre><code>dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2; text_lang ----------- ethnob en spa EN es frn en_ en_US EN_US eng en_U fr (14 rows) </code></pre> <ul> <li>Generate a list of all these so I can maybe fix them in batch:</li> </ul> <pre><code>dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv; COPY 14 </code></pre> <ul> <li>Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:</li> </ul> <pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS'; UPDATE 85 </code></pre> <ul> <li>The <code>fix-metadata.py</code> script I have is meant for specific metadata values, so if I want to update some <code>text_lang</code> values I should just do it directly in the database</li> <li>For example, on a limited set:</li> </ul> <pre><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang=''; UPDATE 420 </code></pre> <ul> <li>And assuming I want to do it for all fields:</li> </ul> <pre><code>dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang=''; UPDATE 183726 </code></pre> <ul> <li>After that restarted Tomcat and PostgreSQL (because I&rsquo;m superstitious about caches) and now I see the following in REST API query:</li> </ul> <pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}' | jq length 71 $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;&quot;}' | jq length 0 $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;en_US&quot;}' | jq length </code></pre> <ul> <li>Not sure what&rsquo;s going on, but Discovery shows 83 values, and database shows 85, so I&rsquo;m going to reindex Discovery just in case</li> </ul> <h2 id="2016-11-14">2016-11-14</h2> <ul> <li>I applied Atmire&rsquo;s suggestions to fix Listings and Reports for DSpace 5.5 and now it works</li> <li>There were some issues with the <code>dspace/modules/jspui/pom.xml</code>, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire&rsquo;s installation procedure must have changed</li> <li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li> <li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li> </ul> <pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' HTTP/1.1 200 OK Connection: keep-alive Content-Encoding: gzip Content-Language: en-US Content-Type: text/html;charset=utf-8 Date: Mon, 14 Nov 2016 19:47:29 GMT Server: nginx Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly Transfer-Encoding: chunked Vary: Accept-Encoding X-Cocoon-Version: 2.2.0 X-Robots-Tag: none $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' HTTP/1.1 200 OK Connection: keep-alive Content-Encoding: gzip Content-Language: en-US Content-Type: text/html;charset=utf-8 Date: Mon, 14 Nov 2016 19:47:35 GMT Server: nginx Transfer-Encoding: chunked Vary: Accept-Encoding X-Cocoon-Version: 2.2.0 </code></pre> <ul> <li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li> <li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li> </ul> <h2 id="2016-11-15">2016-11-15</h2> <ul> <li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p> <ul> <li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li> </ul> <pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' HTTP/1.1 200 OK Connection: keep-alive Content-Encoding: gzip Content-Language: en-US Content-Type: text/html;charset=utf-8 Date: Tue, 15 Nov 2016 08:49:54 GMT Server: nginx Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly Transfer-Encoding: chunked Vary: Accept-Encoding X-Cocoon-Version: 2.2.0 $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' HTTP/1.1 200 OK Connection: keep-alive Content-Encoding: gzip Content-Language: en-US Content-Type: text/html;charset=utf-8 Date: Tue, 15 Nov 2016 08:49:59 GMT Server: nginx Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly Transfer-Encoding: chunked Vary: Accept-Encoding X-Cocoon-Version: 2.2.0 </code></pre> <ul> <li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li> </ul> <pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt; &lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot; crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt; </code></pre> <ul> <li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li> </ul> <pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot; Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot; Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot; Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; </code></pre> <ul> <li>Neat maven trick to exclude some modules from being built:</li> </ul> <pre><code>$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package </code></pre> <ul> <li>We absolutely don&rsquo;t use those modules, so we shouldn&rsquo;t build them in the first place</li> </ul> <h2 id="2016-11-17">2016-11-17</h2> <ul> <li>Generate a list of journal titles for Peter and Abenet to look through so we can make a controlled vocabulary out of them:</li> </ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc) to /tmp/journal-titles.csv with csv; COPY 2515 </code></pre> <ul> <li>Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test</li> <li>Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:</li> </ul> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%'; UPDATE 164 dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%'; UPDATE 7 </code></pre> <ul> <li>Had to run it twice to get all (not sure about &ldquo;global&rdquo; regex in PostgreSQL)</li> <li>Run the updates on CGSpace as well</li> <li>Run through some collections and manually regenerate some PDF thumbnails for items from before 2016 on DSpace Test to compare with CGSpace</li> <li>I&rsquo;m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn&rsquo;t as good</li> <li>The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:</li> </ul> <pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p &quot;ImageMagick PDF Thumbnail&quot; </code></pre> <ul> <li>In related news, I&rsquo;m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace&rsquo;s media filter has made thumbnails of THEM):</li> </ul> <pre><code>dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg'; </code></pre> <ul> <li>I&rsquo;m not sure if there&rsquo;s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore&hellip;</li> </ul> <h2 id="2016-11-18">2016-11-18</h2> <ul> <li>Enable Tomcat Crawler Session Manager on CGSpace</li> </ul> <h2 id="2016-11-21">2016-11-21</h2> <ul> <li>More work on Ansible playbooks for PostgreSQL 9.3→9.5 and Java 7→8 work</li> <li>CGSpace virtual managers meeting</li> <li>I need to look into making the item thumbnail clickable</li> <li>Macaroni Bros said they tested the DSpace Test (DSpace 5.5) REST API for CCAFS and WLE sites and it works as expected</li> </ul> <h2 id="2016-11-23">2016-11-23</h2> <ul> <li>Upgrade Java from 7 to 8 on CGSpace</li> <li>I had started planning the inplace PostgreSQL 9.3→9.5 upgrade but decided that I will have to <code>pg_dump</code> and <code>pg_restore</code> when I move to the new server soon anyways, so there&rsquo;s no need to upgrade the database right now</li> <li>Chat with Carlos about CGCore and the CGSpace metadata registry</li> <li>Dump CGSpace metadata field registry for Carlos: <a href="https://gist.github.com/alanorth/8cbd0bb2704d4bbec78025b4742f8e70">https://gist.github.com/alanorth/8cbd0bb2704d4bbec78025b4742f8e70</a></li> <li>Send some feedback to Carlos on CG Core so they can better understand how DSpace/CGSpace uses metadata</li> <li>Notes about PostgreSQL tuning from James: <a href="https://paste.fedoraproject.org/488776/14798952/">https://paste.fedoraproject.org/488776/14798952/</a></li> <li>Play with Creative Commons stuff in DSpace submission step</li> <li>It seems to work but it doesn&rsquo;t let you choose a version of CC (like 4.0), and we would need to customize the XMLUI item display so it doesn&rsquo;t display the gross CC badges</li> </ul> <h2 id="2016-11-24">2016-11-24</h2> <ul> <li>Bizuwork was testing DSpace Test on DSPace 5.5 and noticed that the Listings and Reports module seems to be case sensitive, whereas CGSpace&rsquo;s Listings and Reports isn&rsquo;t (ie, a search for &ldquo;orth, alan&rdquo; vs &ldquo;Orth, Alan&rdquo; returns the same results on CGSpace, but different on DSpace Test)</li> <li>I have raised a ticket with Atmire</li> <li>Looks like this issue is actually the new Listings and Reports module honoring the Solr search queries more correctly</li> </ul> <h2 id="2016-11-27">2016-11-27</h2> <ul> <li>Run system updates on DSpace Test and reboot the server</li> <li>Deploy DSpace 5.5 on CGSpace: <ul> <li>maven package</li> <li>stop tomcat</li> <li>backup postgresql</li> <li>run Atmire 5.5 schema deletions</li> <li>delete the deployed spring folder</li> <li>ant update</li> <li>run system updates</li> <li>reboot server</li> </ul></li> <li>Need to do updates for ansible infrastructure role defaults, and switch the GitHub branch to the new 5.5 one</li> <li>Testing DSpace 5.5 on CGSpace, it seems CUA&rsquo;s export as XLS works for Usage statistics, but not Content statistics</li> <li>I will raise a bug with Atmire</li> </ul> <h2 id="2016-11-28">2016-11-28</h2> <ul> <li>One user says they are still getting a blank page when he logs in (just CGSpace header, but no community list)</li> <li>Looking at the Catlina logs I see there is some super long-running indexing process going on:</li> </ul> <pre><code>INFO: FrameworkServlet 'oai': initialization completed in 2600 ms [&gt; ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18 [&gt; ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19 [&gt; ] 0% time remaining: 23 hour(s) 4 minute(s) 28 seconds. timestamp: 2016-11-28 03:00:19 [&gt; ] 0% time remaining: 15 hour(s) 35 minute(s) 23 seconds. timestamp: 2016-11-28 03:00:19 [&gt; ] 0% time remaining: 14 hour(s) 5 minute(s) 56 seconds. timestamp: 2016-11-28 03:00:19 [&gt; ] 0% time remaining: 11 hour(s) 23 minute(s) 49 seconds. timestamp: 2016-11-28 03:00:19 [&gt; ] 0% time remaining: 11 hour(s) 21 minute(s) 57 seconds. timestamp: 2016-11-28 03:00:20 </code></pre> <ul> <li>It says OAI, and seems to start at 3:00 AM, but I only see the <code>filter-media</code> cron job set to start then</li> <li>Double checking the <a href="https://wiki.duraspace.org/display/DSDOC5x/Upgrading+DSpace">DSpace 5.x upgrade notes</a> for anything I missed, or troubleshooting tips</li> <li>Running some manual processes just in case:</li> </ul> <pre><code>$ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dcterms-types.xml $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dublin-core-types.xml $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/eperson-types.xml $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/workflow-types.xml </code></pre> <ul> <li>Start working on paper for KM4Dev journal</li> <li>Wow, Bram from Atmire pointed out this solution for using multiple handles with one DSpace instance: <a href="https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296">https://wiki.duraspace.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296</a></li> <li>We might be able to migrate the <a href="http://library.cgiar.org/">CGIAR Library</a> now, as they had wanted to keep their handles</li> </ul> <h2 id="2016-11-29">2016-11-29</h2> <ul> <li>Sisay tried deleting and re-creating Goshu&rsquo;s account but he still can&rsquo;t see any communities on the homepage after he logs in</li> <li>Around the time of his login I see this in the DSpace logs:</li> </ul> <pre><code>2016-11-29 07:56:36,350 INFO org.dspace.authenticate.LDAPAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:no DN found for user g.cherinet@cgiar.org 2016-11-29 07:56:36,350 INFO org.dspace.authenticate.PasswordAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:authenticate:attempting password auth of user=g.cherinet@cgiar.org 2016-11-29 07:56:36,352 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:email=g.cherinet@cgiar.org, realm=null, result=2 2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Start processing item 10568/50391 id:51744 2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item stats 2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date 2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item's bitstream stats 2016-11-29 07:56:36,608 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date 2016-11-29 07:56:36,701 INFO org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23 2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets org.dspace.discovery.SearchServiceException: Error executing query at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1618) at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1600) at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1583) at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.performSearch(SidebarFacetsTransformer.java:165) at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.addOptions(SidebarFacetsTransformer.java:174) at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228) at sun.reflect.GeneratedMethodAccessor277.invoke(Unknown Source) ... </code></pre> <ul> <li>At about the same time in the solr log I see a super long query:</li> </ul> <pre><code>2016-11-29 07:56:36,734 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&amp;fl=dateIssued.year,handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=dateIssued.year:[*+TO+*]&amp;fq=read:(g0+OR+e574+OR+g0+OR+g3+OR+g9+OR+g10+OR+g14+OR+g16+OR+g18+OR+g20+OR+g23+OR+g24+OR+g2072+OR+g2074+OR+g28+OR+g2076+OR+g29+OR+g2078+OR+g2080+OR+g34+OR+g2082+OR+g2084+OR+g38+OR+g2086+OR+g2088+OR+g2091+OR+g43+OR+g2092+OR+g2093+OR+g2095+OR+g2097+OR+g50+OR+g2099+OR+g51+OR+g2103+OR+g62+OR+g65+OR+g2115+OR+g2117+OR+g2119+OR+g2121+OR+g2123+OR+g2125+OR+g77+OR+g78+OR+g79+OR+g2127+OR+g80+OR+g2129+OR+g2131+OR+g2133+OR+g2134+OR+g2135+OR+g2136+OR+g2137+OR+g2138+OR+g2139+OR+g2140+OR+g2141+OR+g2142+OR+g2148+OR+g2149+OR+g2150+OR+g2151+OR+g2152+OR+g2153+OR+g2154+OR+g2156+OR+g2165+OR+g2167+OR+g2171+OR+g2174+OR+g2175+OR+g129+OR+g2182+OR+g2186+OR+g2189+OR+g153+OR+g158+OR+g166+OR+g167+OR+g168+OR+g169+OR+g2225+OR+g179+OR+g2227+OR+g2229+OR+g183+OR+g2231+OR+g184+OR+g2233+OR+g186+OR+g2235+OR+g2237+OR+g191+OR+g192+OR+g193+OR+g202+OR+g203+OR+g204+OR+g205+OR+g207+OR+g208+OR+g218+OR+g219+OR+g222+OR+g223+OR+g230+OR+g231+OR+g238+OR+g241+OR+g244+OR+g254+OR+g255+OR+g262+OR+g265+OR+g268+OR+g269+OR+g273+OR+g276+OR+g277+OR+g279+OR+g282+OR+g2332+OR+g2335+OR+g2338+OR+g292+OR+g293+OR+g2341+OR+g296+OR+g2344+OR+g297+OR+g2347+OR+g301+OR+g2350+OR+g303+OR+g305+OR+g2356+OR+g310+OR+g311+OR+g2359+OR+g313+OR+g2362+OR+g2365+OR+g2368+OR+g321+OR+g2371+OR+g325+OR+g2374+OR+g328+OR+g2377+OR+g2380+OR+g333+OR+g2383+OR+g2386+OR+g2389+OR+g342+OR+g343+OR+g2392+OR+g345+OR+g2395+OR+g348+OR+g2398+OR+g2401+OR+g2404+OR+g2407+OR+g364+OR+g366+OR+g2425+OR+g2427+OR+g385+OR+g387+OR+g388+OR+g389+OR+g2442+OR+g395+OR+g2443+OR+g2444+OR+g401+OR+g403+OR+g405+OR+g408+OR+g2457+OR+g2458+OR+g411+OR+g2459+OR+g414+OR+g2463+OR+g417+OR+g2465+OR+g2467+OR+g421+OR+g2469+OR+g2471+OR+g424+OR+g2473+OR+g2475+OR+g2476+OR+g429+OR+g433+OR+g2481+OR+g2482+OR+g2483+OR+g443+OR+g444+OR+g445+OR+g446+OR+g448+OR+g453+OR+g455+OR+g456+OR+g457+OR+g458+OR+g459+OR+g461+OR+g462+OR+g463+OR+g464+OR+g465+OR+g467+OR+g468+OR+g469+OR+g474+OR+g476+OR+g477+OR+g480+OR+g483+OR+g484+OR+g493+OR+g496+OR+g497+OR+g498+OR+g500+OR+g502+OR+g504+OR+g505+OR+g2559+OR+g2560+OR+g513+OR+g2561+OR+g515+OR+g516+OR+g518+OR+g519+OR+g2567+OR+g520+OR+g521+OR+g522+OR+g2570+OR+g523+OR+g2571+OR+g524+OR+g525+OR+g2573+OR+g526+OR+g2574+OR+g527+OR+g528+OR+g2576+OR+g529+OR+g531+OR+g2579+OR+g533+OR+g534+OR+g2582+OR+g535+OR+g2584+OR+g538+OR+g2586+OR+g540+OR+g2588+OR+g541+OR+g543+OR+g544+OR+g545+OR+g546+OR+g548+OR+g2596+OR+g549+OR+g551+OR+g555+OR+g556+OR+g558+OR+g561+OR+g569+OR+g570+OR+g571+OR+g2619+OR+g572+OR+g2620+OR+g573+OR+g2621+OR+g2622+OR+g575+OR+g578+OR+g581+OR+g582+OR+g584+OR+g585+OR+g586+OR+g587+OR+g588+OR+g590+OR+g591+OR+g593+OR+g595+OR+g596+OR+g598+OR+g599+OR+g601+OR+g602+OR+g603+OR+g604+OR+g605+OR+g606+OR+g608+OR+g609+OR+g610+OR+g612+OR+g614+OR+g616+OR+g620+OR+g621+OR+g623+OR+g630+OR+g635+OR+g636+OR+g646+OR+g649+OR+g683+OR+g684+OR+g687+OR+g689+OR+g691+OR+g695+OR+g697+OR+g698+OR+g699+OR+g700+OR+g701+OR+g707+OR+g708+OR+g709+OR+g710+OR+g711+OR+g712+OR+g713+OR+g714+OR+g715+OR+g716+OR+g717+OR+g719+OR+g720+OR+g729+OR+g732+OR+g733+OR+g734+OR+g736+OR+g737+OR+g738+OR+g2786+OR+g752+OR+g754+OR+g2804+OR+g757+OR+g2805+OR+g2806+OR+g760+OR+g761+OR+g2810+OR+g2815+OR+g769+OR+g771+OR+g773+OR+g776+OR+g786+OR+g787+OR+g788+OR+g789+OR+g791+OR+g792+OR+g793+OR+g794+OR+g795+OR+g796+OR+g798+OR+g800+OR+g802+OR+g803+OR+g806+OR+g808+OR+g810+OR+g814+OR+g815+OR+g817+OR+g829+OR+g830+OR+g849+OR+g893+OR+g895+OR+g898+OR+g902+OR+g903+OR+g917+OR+g919+OR+g921+OR+g922+OR+g923+OR+g924+OR+g925+OR+g926+OR+g927+OR+g928+OR+g929+OR+g930+OR+g932+OR+g933+OR+g934+OR+g938+OR+g939+OR+g944+OR+g945+OR+g946+OR+g947+OR+g948+OR+g949+OR+g950+OR+g951+OR+g953+OR+g954+OR+g955+OR+g956+OR+g958+OR+g959+OR+g960+OR+g963+OR+g964+OR+g965+OR+g968+OR+g969+OR+g970+OR+g971+OR+g972+OR+g973+OR+g974+OR+g976+OR+g978+OR+g979+OR+g984+OR+g985+OR+g987+OR+g988+OR+g991+OR+g993+OR+g994+OR+g999+OR+g1000+OR+g1003+OR+g1005+OR+g1006+OR+g1007+OR+g1012+OR+g1013+OR+g1015+OR+g1016+OR+g1018+OR+g1023+OR+g1024+OR+g1026+OR+g1028+OR+g1030+OR+g1032+OR+g1033+OR+g1035+OR+g1036+OR+g1038+OR+g1039+OR+g1041+OR+g1042+OR+g1044+OR+g1045+OR+g1047+OR+g1048+OR+g1050+OR+g1051+OR+g1053+OR+g1054+OR+g1056+OR+g1057+OR+g1058+OR+g1059+OR+g1060+OR+g1061+OR+g1062+OR+g1063+OR+g1064+OR+g1065+OR+g1066+OR+g1068+OR+g1071+OR+g1072+OR+g1074+OR+g1075+OR+g1076+OR+g1077+OR+g1078+OR+g1080+OR+g1081+OR+g1082+OR+g1084+OR+g1085+OR+g1087+OR+g1088+OR+g1089+OR+g1090+OR+g1091+OR+g1092+OR+g1093+OR+g1094+OR+g1095+OR+g1096+OR+g1097+OR+g1106+OR+g1108+OR+g1110+OR+g1112+OR+g1114+OR+g1117+OR+g1120+OR+g1121+OR+g1126+OR+g1128+OR+g1129+OR+g1131+OR+g1136+OR+g1138+OR+g1140+OR+g1141+OR+g1143+OR+g1145+OR+g1146+OR+g1148+OR+g1152+OR+g1154+OR+g1156+OR+g1158+OR+g1159+OR+g1160+OR+g1162+OR+g1163+OR+g1165+OR+g1166+OR+g1168+OR+g1170+OR+g1172+OR+g1175+OR+g1177+OR+g1179+OR+g1181+OR+g1185+OR+g1191+OR+g1193+OR+g1197+OR+g1199+OR+g1201+OR+g1203+OR+g1204+OR+g1215+OR+g1217+OR+g1219+OR+g1221+OR+g1224+OR+g1226+OR+g1227+OR+g1228+OR+g1230+OR+g1231+OR+g1232+OR+g1233+OR+g1234+OR+g1235+OR+g1236+OR+g1237+OR+g1238+OR+g1240+OR+g1241+OR+g1242+OR+g1243+OR+g1244+OR+g1246+OR+g1248+OR+g1250+OR+g1252+OR+g1254+OR+g1256+OR+g1257+OR+g1259+OR+g1261+OR+g1263+OR+g1275+OR+g1276+OR+g1277+OR+g1278+OR+g1279+OR+g1282+OR+g1284+OR+g1288+OR+g1290+OR+g1293+OR+g1296+OR+g1297+OR+g1299+OR+g1303+OR+g1304+OR+g1306+OR+g1309+OR+g1310+OR+g1311+OR+g1312+OR+g1313+OR+g1316+OR+g1318+OR+g1320+OR+g1322+OR+g1323+OR+g1324+OR+g1325+OR+g1326+OR+g1329+OR+g1331+OR+g1347+OR+g1348+OR+g1361+OR+g1362+OR+g1363+OR+g1364+OR+g1367+OR+g1368+OR+g1369+OR+g1370+OR+g1371+OR+g1374+OR+g1376+OR+g1377+OR+g1378+OR+g1380+OR+g1381+OR+g1386+OR+g1389+OR+g1391+OR+g1392+OR+g1393+OR+g1395+OR+g1396+OR+g1397+OR+g1400+OR+g1402+OR+g1406+OR+g1408+OR+g1415+OR+g1417+OR+g1433+OR+g1435+OR+g1441+OR+g1442+OR+g1443+OR+g1444+OR+g1446+OR+g1448+OR+g1450+OR+g1451+OR+g1452+OR+g1453+OR+g1454+OR+g1456+OR+g1458+OR+g1460+OR+g1462+OR+g1464+OR+g1466+OR+g1468+OR+g1470+OR+g1471+OR+g1475+OR+g1476+OR+g1477+OR+g1478+OR+g1479+OR+g1481+OR+g1482+OR+g1483+OR+g1484+OR+g1485+OR+g1486+OR+g1487+OR+g1488+OR+g1489+OR+g1490+OR+g1491+OR+g1492+OR+g1493+OR+g1495+OR+g1497+OR+g1499+OR+g1501+OR+g1503+OR+g1504+OR+g1506+OR+g1508+OR+g1511+OR+g1512+OR+g1513+OR+g1516+OR+g1522+OR+g1535+OR+g1536+OR+g1537+OR+g1539+OR+g1540+OR+g1541+OR+g1542+OR+g1547+OR+g1549+OR+g1551+OR+g1553+OR+g1555+OR+g1557+OR+g1559+OR+g1561+OR+g1563+OR+g1565+OR+g1567+OR+g1569+OR+g1571+OR+g1573+OR+g1580+OR+g1583+OR+g1588+OR+g1590+OR+g1592+OR+g1594+OR+g1595+OR+g1596+OR+g1598+OR+g1599+OR+g1600+OR+g1601+OR+g1602+OR+g1604+OR+g1606+OR+g1610+OR+g1611+OR+g1612+OR+g1613+OR+g1616+OR+g1619+OR+g1622+OR+g1624+OR+g1625+OR+g1626+OR+g1628+OR+g1629+OR+g1631+OR+g1632+OR+g1692+OR+g1694+OR+g1695+OR+g1697+OR+g1705+OR+g1706+OR+g1707+OR+g1708+OR+g1711+OR+g1715+OR+g1717+OR+g1719+OR+g1721+OR+g1722+OR+g1723+OR+g1724+OR+g1725+OR+g1726+OR+g1727+OR+g1731+OR+g1732+OR+g1736+OR+g1737+OR+g1738+OR+g1740+OR+g1742+OR+g1743+OR+g1753+OR+g1755+OR+g1758+OR+g1759+OR+g1764+OR+g1766+OR+g1769+OR+g1774+OR+g1782+OR+g1794+OR+g1796+OR+g1797+OR+g1814+OR+g1818+OR+g1826+OR+g1853+OR+g1855+OR+g1857+OR+g1858+OR+g1859+OR+g1860+OR+g1861+OR+g1863+OR+g1864+OR+g1865+OR+g1867+OR+g1869+OR+g1871+OR+g1873+OR+g1875+OR+g1877+OR+g1879+OR+g1881+OR+g1883+OR+g1884+OR+g1885+OR+g1887+OR+g1889+OR+g1891+OR+g1892+OR+g1894+OR+g1896+OR+g1898+OR+g1900+OR+g1902+OR+g1907+OR+g1910+OR+g1915+OR+g1916+OR+g1917+OR+g1918+OR+g1929+OR+g1931+OR+g1932+OR+g1933+OR+g1934+OR+g1936+OR+g1937+OR+g1938+OR+g1939+OR+g1940+OR+g1942+OR+g1944+OR+g1945+OR+g1948+OR+g1950+OR+g1955+OR+g1961+OR+g1962+OR+g1964+OR+g1966+OR+g1968+OR+g1970+OR+g1972+OR+g1974+OR+g1976+OR+g1979+OR+g1982+OR+g1984+OR+g1985+OR+g1986+OR+g1987+OR+g1989+OR+g1991+OR+g1996+OR+g2003+OR+g2007+OR+g2011+OR+g2019+OR+g2020+OR+g2046)&amp;sort=dateIssued.year_sort+desc&amp;rows=1&amp;wt=javabin&amp;version=2} hits=56080 status=0 QTime=3 </code></pre> <ul> <li>Which, according to some old threads on DSpace Tech, means that the user has a lot of permissions (from groups or on the individual eperson) which increases the Solr query size / query URL</li> <li>It might be fixed by increasing the Tomcat <code>maxHttpHeaderSize</code>, which is <a href="http://tomcat.apache.org/tomcat-7.0-doc/config/http.html">8192 (or 8KB) by default</a></li> <li>I&rsquo;ve increased the <code>maxHttpHeaderSize</code> to 16384 on DSpace Test and the user said he is now able to see the communities on the homepage</li> <li>I will make the changes on CGSpace soon</li> <li>A few users are reporting having issues with their workflows, they get the following message: &ldquo;You are not allowed to perform this task&rdquo;</li> <li>Might be the same as <a href="https://jira.duraspace.org/browse/DS-2920">DS-2920</a> on the bug tracker</li> </ul> <h2 id="2016-11-30">2016-11-30</h2> <ul> <li>The <code>maxHttpHeaderSize</code> fix worked on CGSpace (user is able to see the community list on the homepage)</li> <li>The &ldquo;take task&rdquo; cache fix worked on DSpace Test but it&rsquo;s not an official patch, so I&rsquo;ll have to report the bug to DSpace people and try to get advice</li> <li>More work on the KM4Dev Journal article</li> </ul> October, 2016 https://alanorth.github.io/cgspace-notes/2016-10/ Mon, 03 Oct 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-10/ <h2 id="2016-10-03">2016-10-03</h2> <ul> <li>Testing adding <a href="https://wiki.duraspace.org/display/DSDOC5x/ORCID+Integration#ORCIDIntegration-EditingexistingitemsusingBatchCSVEditing">ORCIDs to a CSV</a> file for a single item to see if the author orders get messed up</li> <li>Need to test the following scenarios to see how author order is affected: <ul> <li>ORCIDs only</li> <li>ORCIDs plus normal authors</li> </ul></li> <li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li> </ul> <pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X </code></pre> <p></p> <ul> <li>Hmm, with the <code>dc.contributor.author</code> column removed, DSpace doesn&rsquo;t detect any changes</li> <li>With a blank <code>dc.contributor.author</code> column, DSpace wants to remove all non-ORCID authors and add the new ORCID authors</li> <li>I added the <a href="https://github.com/ilri/DSpace/issues/234">disclaimer text</a> to the About page, then added a footer link to the disclaimer&rsquo;s ID, but there is a Bootstrap issue that causes the page content to disappear when using in-page anchors: <a href="https://github.com/twbs/bootstrap/issues/1768">https://github.com/twbs/bootstrap/issues/1768</a></li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/10/bootstrap-issue.png" alt="Bootstrap issue with in-page anchors" /></p> <ul> <li>Looks like we&rsquo;ll just have to add the text to the About page (without a link) or add a separate page</li> </ul> <h2 id="2016-10-04">2016-10-04</h2> <ul> <li>Start testing cleanups of authors that Peter sent last week</li> <li>Out of 40,000+ rows, Peter had indicated corrections for ~3,200 of them—too many to look through carefully, so I did some basic quality checking: <ul> <li>Trim leading/trailing whitespace</li> <li>Find invalid characters</li> <li>Cluster values to merge obvious authors</li> </ul></li> <li>That left us with 3,180 valid corrections and 3 deletions:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu </code></pre> <ul> <li>Remove old about page (<a href="https://github.com/ilri/DSpace/pull/284">#284</a>)</li> <li>CGSpace crashed a few times today</li> <li>Generate list of unique authors in CCAFS collections:</li> </ul> <pre><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv; </code></pre> <h2 id="2016-10-05">2016-10-05</h2> <ul> <li>Work on more infrastructure cleanups for Ansible DSpace role</li> <li>Clean up Let&rsquo;s Encrypt plumbing and submit pull request for rmg-ansible-public (<a href="https://github.com/ilri/rmg-ansible-public/pull/60">#60</a>)</li> </ul> <h2 id="2016-10-06">2016-10-06</h2> <ul> <li>Nice! DSpace Test (linode02) is now having <code>java.lang.OutOfMemoryError: Java heap space</code> errors&hellip;</li> <li>Heap space is 2048m, and we have 5GB of RAM being used for OS cache (Solr!) so let&rsquo;s just bump the memory to 3072m</li> <li>Magdalena from CCAFS asked why the colors in the thumbnails for these <a href="https://cgspace.cgiar.org/handle/10568/71249">two</a> <a href="https://cgspace.cgiar.org/handle/10568/71259">items</a> look different, even though they are the same in the PDF itself</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/10/cmyk-vs-srgb.jpg" alt="CMYK vs sRGB colors" /></p> <ul> <li>Turns out the first PDF was exported from InDesign using CMYK and the second one was using sRGB</li> <li>Run all system updates on DSpace Test and reboot it</li> </ul> <h2 id="2016-10-08">2016-10-08</h2> <ul> <li>Re-deploy CGSpace with latest changes from late September and early October</li> <li>Run fixes for ILRI subjects and delete blank metadata values:</li> </ul> <pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=''; DELETE 11 </code></pre> <ul> <li>Run all system updates and reboot CGSpace</li> <li>Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):</li> </ul> <pre><code>root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l 47 </code></pre> <ul> <li>Delete 2GB <code>cron-filter-media.log</code> file, as it is just a log from a cron job and it doesn&rsquo;t get rotated like normal log files (almost a year now maybe)</li> </ul> <h2 id="2016-10-14">2016-10-14</h2> <ul> <li>Run all system updates on DSpace Test and reboot server</li> <li>Looking into some issues with Discovery filters in Atmire&rsquo;s content and usage analysis module after adjusting the filter class</li> <li>Looks like changing the filters from <code>configuration.DiscoverySearchFilterFacet</code> to <code>configuration.DiscoverySearchFilter</code> breaks them in Atmire CUA module</li> </ul> <h2 id="2016-10-17">2016-10-17</h2> <ul> <li>A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu </code></pre> <ul> <li>One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)</li> </ul> <h2 id="2016-10-18">2016-10-18</h2> <ul> <li><p>Start working on DSpace 5.5 porting work again:</p> <p>$ git checkout -b 5_x-55 5_x-prod $ git rebase -i dspace-5.5</p></li> <li><p>Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme</p></li> <li><p>Skip 1e34751b8cf17021f45d4cf2b9a5800c93fb4cb2 in lieu of upstream&rsquo;s 55e623d1c2b8b7b1fa45db6728e172e06bfa8598 (fixes X-Forwarded-For header) because I had made the same fix myself and it&rsquo;s better to use the upstream one</p></li> <li><p>I notice this rebase gets rid of GitHub merge commits&hellip; which actually might be fine because merges are fucking annoying to deal with when remote people merge without pulling and rebasing their branch first</p></li> <li><p>Finished up applying the 5.5 sitemap changes to all themes</p></li> <li><p>Merge the <code>discovery.xml</code> cleanups (<a href="https://github.com/ilri/DSpace/pull/278">#278</a>)</p></li> <li><p>Merge some minor edits to the distribution license (<a href="https://github.com/ilri/DSpace/pull/285">#285</a>)</p></li> </ul> <h2 id="2016-10-19">2016-10-19</h2> <ul> <li>When we move to DSpace 5.5 we should also cherry pick some patches from 5.6 branch: <ul> <li><a href="https://jira.duraspace.org/browse/DS-3246">memory cleanup</a>: 9f0f5940e7921765c6a22e85337331656b18a403</li> <li>sql injection: c6fda557f731dbc200d7d58b8b61563f86fe6d06</li> <li>pdfbox security issue: b5330b78153b2052ed3dc2fd65917ccdbfcc0439</li> </ul></li> </ul> <h2 id="2016-10-20">2016-10-20</h2> <ul> <li>Run CCAFS author corrections on CGSpace</li> <li>Discovery reindexing took forever and kinda caused CGSpace to crash, so I ran all system updates and rebooted the server</li> </ul> <h2 id="2016-10-25">2016-10-25</h2> <ul> <li>Move the LIVES community from the top level to the ILRI projects community</li> </ul> <pre><code>$ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101 </code></pre> <ul> <li>Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA</li> <li>Start looking at batch fixing of &ldquo;old&rdquo; ILRI website links without www or https, for example:</li> </ul> <pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%'; </code></pre> <ul> <li>Also CCAFS has HTTPS and their links should use it where possible:</li> </ul> <pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%'; </code></pre> <ul> <li>And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):</li> </ul> <pre><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%'; </code></pre> <ul> <li>Turns out there are shit tons of varieties of this, like with http, https, www, separate <code>&lt;/img&gt;</code> tags, alignments, etc</li> <li>Had to find all variations and replace them individually:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;','&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;%'; </code></pre> <ul> <li>Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)</li> <li>And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc</li> <li>I should look to see if any of those domains is sending an HTTP 301 or setting HSTS headers to their HTTPS domains, then just replace them</li> </ul> <h2 id="2016-10-27">2016-10-27</h2> <ul> <li>Run Font Awesome fixes on DSpace Test:</li> </ul> <pre><code>dspace=# \i /tmp/font-awesome-text-replace.sql UPDATE 17 UPDATE 17 UPDATE 3 UPDATE 3 UPDATE 30 UPDATE 30 UPDATE 1 UPDATE 1 UPDATE 7 UPDATE 7 UPDATE 1 UPDATE 1 UPDATE 1 UPDATE 1 UPDATE 1 UPDATE 1 UPDATE 0 </code></pre> <ul> <li>Looks much better now:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/10/cgspace-icons.png" alt="CGSpace with old icons" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/10/dspacetest-fontawesome-icons.png" alt="DSpace Test with Font Awesome icons" /></p> <ul> <li>Run the same replacements on CGSpace</li> </ul> <h2 id="2016-10-30">2016-10-30</h2> <ul> <li>Fix some messed up authors on CGSpace:</li> </ul> <pre><code>dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%'; UPDATE 10 dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%'; UPDATE 36 </code></pre> <ul> <li>I updated the authority index but nothing seemed to change, so I&rsquo;ll wait and do it again after I update Discovery below</li> <li>Skype chat with Tsega about the <a href="https://github.com/ilri/ckm-cgspace-contentdm-bridge">IFPRI contentdm bridge</a></li> <li>We tested harvesting OAI in an example collection to see how it works</li> <li>Talk to Carlos Quiros about CG Core metadata in CGSpace</li> <li>Get a list of countries from CGSpace so I can do some batch corrections:</li> </ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv; </code></pre> <ul> <li>Fix a bunch of countries in Open Refine and run the corrections on CGSpace:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu </code></pre> <ul> <li>Run a shit ton of author fixes from Peter Ballantyne that we&rsquo;ve been cleaning up for two months:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu </code></pre> <ul> <li>Run a few URL corrections for ilri.org and doi.org, etc:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%'; dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28); dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111); dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111); </code></pre> <ul> <li>I skipped metadata fields like citation and description</li> </ul> September, 2016 https://alanorth.github.io/cgspace-notes/2016-09/ Thu, 01 Sep 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-09/ <h2 id="2016-09-01">2016-09-01</h2> <ul> <li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li> <li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li> <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li> <li>It looks like we might be able to use OUs now, instead of DCs:</li> </ul> <pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot; </code></pre> <p></p> <ul> <li>User who has been migrated to the root vs user still in the hierarchical structure:</li> </ul> <pre><code>distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Ethiopia,DC=ILRI,DC=CGIARAD,DC=ORG </code></pre> <ul> <li>Changing the DSpace LDAP config to use <code>OU=ILRIHUB</code> seems to work:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/09/ilri-ldap-users.png" alt="DSpace groups based on LDAP DN" /></p> <ul> <li>Notes for local PostgreSQL database recreation from production snapshot:</li> </ul> <pre><code>$ dropdb dspacetest $ createdb -O dspacetest --encoding=UNICODE dspacetest $ psql dspacetest -c 'alter user dspacetest createuser;' $ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup $ psql dspacetest -c 'alter user dspacetest nocreateuser;' $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost $ vacuumdb dspacetest </code></pre> <ul> <li>Some names that I thought I fixed in July seem not to be:</li> </ul> <pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %'; text_value | authority | confidence -----------------------+--------------------------------------+------------ Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb | 600 Poole, Elizabeth Jane | 41628f42-fc38-4b38-b473-93aec9196326 | 600 Poole, Elizabeth Jane | 83b82da0-f652-4ebc-babc-591af1697919 | 600 Poole, Elizabeth Jane | c3a22456-8d6a-41f9-bba0-de51ef564d45 | 600 Poole, E.J. | c3a22456-8d6a-41f9-bba0-de51ef564d45 | 600 Poole, E.J. | 0fbd91b9-1b71-4504-8828-e26885bf8b84 | 600 (6 rows) </code></pre> <ul> <li>At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45</li> </ul> <pre><code>dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %'; UPDATE 69 </code></pre> <ul> <li>And for Peter Ballantyne:</li> </ul> <pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %'; text_value | authority | confidence -------------------+--------------------------------------+------------ Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 | 600 Ballantyne, Peter | 4f04ca06-9a76-4206-bd9c-917ca75d278e | 600 Ballantyne, P.G. | 4f04ca06-9a76-4206-bd9c-917ca75d278e | 600 Ballantyne, Peter | ba5f205b-b78b-43e5-8e80-0c9a1e1ad2ca | 600 Ballantyne, Peter | 20f21160-414c-4ecf-89ca-5f2cb64e75c1 | 600 (5 rows) </code></pre> <ul> <li>Again, a few have the correct ORCID, but there should only be one authority&hellip;</li> </ul> <pre><code>dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %'; UPDATE 58 </code></pre> <ul> <li>And for me:</li> </ul> <pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%'; text_value | authority | confidence ------------+--------------------------------------+------------ Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600 Orth, A. | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600 Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600 (3 rows) dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %'; UPDATE 11 </code></pre> <ul> <li>And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:</li> </ul> <pre><code>dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%'; UPDATE 166 dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%'; text_value | authority | confidence ------------------------+--------------------------------------+------------ Campbell, Bruce | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600 Campbell, Bruce Morgan | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600 Campbell, B. | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600 Campbell, B.M. | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600 (4 rows) </code></pre> <ul> <li>After updating the Authority indexes (<code>bin/dspace index-authority</code>) everything looks good</li> <li>Run authority updates on CGSpace</li> </ul> <h2 id="2016-09-05">2016-09-05</h2> <ul> <li>After one week of logging TLS connections on CGSpace:</li> </ul> <pre><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l 217 # zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l 1164376 # zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq TLSv1/DES-CBC3-SHA TLSv1/EDH-RSA-DES-CBC3-SHA </code></pre> <ul> <li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li> <li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li> </ul> <pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value </code></pre> <ul> <li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li> </ul> <h2 id="2016-09-06">2016-09-06</h2> <ul> <li>Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file</li> <li>Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF: <ul> <li>Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf</li> <li>Imports fine on DSpace running on Mac OS X</li> <li>Fails to import on DSpace running on Linux with error <code>No such file or directory</code></li> </ul></li> <li>Change diacritic in file name from á to a and re-create SAF bundle and zip <ul> <li>Success on both Mac OS X and Linux&hellip;</li> </ul></li> <li>Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)</li> <li>See: <a href="http://www.fileformat.info/info/unicode/char/e1/index.htm">http://www.fileformat.info/info/unicode/char/e1/index.htm</a></li> <li>See: <a href="http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0">http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0</a></li> <li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li> <li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li> </ul> <pre><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','') </code></pre> <ul> <li>I need to write a Python script to match that for renaming files in the file system</li> <li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li> <li>Seems that the latter method causes a null pointer exception, so I will just have to use the former method</li> <li>In the end I was able to import the files after unzipping them ONLY on Linux <ul> <li>The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above</li> </ul></li> <li>Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the <code>tomcat7</code> user, and deleting the bundle, for each collection&rsquo;s items:</li> </ul> <pre><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv $ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/ </code></pre> <h2 id="2016-09-07">2016-09-07</h2> <ul> <li>Erase and rebuild DSpace Test based on latest Ubuntu 16.04, PostgreSQL 9.5, and Java 8 stuff</li> <li>Reading about PostgreSQL maintenance and it seems manual vacuuming is only for certain workloads, such as heavy update/write loads</li> <li>I suggest we disable our nightly manual vacuum task, as we&rsquo;re a mostly read workload, and I&rsquo;d rather stick as close to the documentation as possible since we haven&rsquo;t done any testing/observation of PostgreSQL</li> <li>See: <a href="https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html">https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html</a></li> <li>CGSpace went down and the error seems to be the same as always (lately):</li> </ul> <pre><code>2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object ... </code></pre> <ul> <li>Since CGSpace had crashed I quickly deployed the new LDAP settings before restarting Tomcat</li> </ul> <h2 id="2016-09-13">2016-09-13</h2> <ul> <li>CGSpace crashed twice today, errors from <code>catalina.out</code>:</li> </ul> <pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114) </code></pre> <ul> <li>I enabled logging of requests to <code>/rest</code> again</li> </ul> <h2 id="2016-09-14">2016-09-14</h2> <ul> <li>CGSpace crashed again, errors from <code>catalina.out</code>:</li> </ul> <pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114) </code></pre> <ul> <li>I restarted Tomcat and it was ok again</li> <li>CGSpace crashed a few hours later, errors from <code>catalina.out</code>:</li> </ul> <pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding.decode(StringCoding.java:215) </code></pre> <ul> <li>We haven&rsquo;t seen that in quite a while&hellip;</li> <li>Indeed, in a month of logs it only occurs 15 times:</li> </ul> <pre><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l 15 </code></pre> <ul> <li>I also see a bunch of errors from dspace.log:</li> </ul> <pre><code>2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object </code></pre> <ul> <li>Looking at REST requests, it seems there is one IP hitting us nonstop:</li> </ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3 820 50.87.54.15 12872 70.32.99.142 25744 70.32.83.92 # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3 7966 181.118.144.29 54706 70.32.99.142 109412 70.32.83.92 </code></pre> <ul> <li>Those are the same IPs that were hitting us heavily in July, 2016 as well&hellip;</li> <li>I think the stability issues are definitely from REST</li> <li>Crashed AGAIN, errors from dspace.log:</li> </ul> <pre><code>2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object </code></pre> <ul> <li>And more heap space errors:</li> </ul> <pre><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l 19 </code></pre> <ul> <li>There are no more rest requests since the last crash, so maybe there are other things causing this.</li> <li>Hmm, I noticed a shitload of IPs from 180.76.0.0/16 are connecting to both CGSpace and DSpace Test (58 unique IPs concurrently!)</li> <li>They seem to be coming from Baidu, and so far during today alone account for <sup>1</sup>&frasl;<sub>6</sub> of every connection:</li> </ul> <pre><code># grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14 29084 # grep -c ip_addr=180.76.15 /home/cgspace.cgiar.org/log/dspace.log.2016-09-14 5192 </code></pre> <ul> <li>Other recent days are the same&hellip; hmmm.</li> <li>From the activity control panel I can see 58 unique IPs hitting the site <em>concurrently</em>, which has GOT to hurt our stability</li> <li>A list of all 2000 unique IPs from CGSpace logs today:</li> </ul> <pre><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100 </code></pre> <ul> <li>Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc&hellip; do we have any real users?</li> <li>Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:</li> </ul> <pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv; </code></pre> <ul> <li>Looking into the Catalina logs again around the time of the first crash, I see:</li> </ul> <pre><code>Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2 Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs. Commit Commit done dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org Exception in thread &quot;http-bio-127.0.0.1-8081-exec-193&quot; java.lang.OutOfMemoryError: Java heap space </code></pre> <ul> <li>And after that I see a bunch of &ldquo;pool error Timeout waiting for idle object&rdquo;</li> <li>Later, near the time of the next crash I see:</li> </ul> <pre><code>dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org Wed Sep 14 11:29:55 UTC 2016 | Query:id: 79078 AND type:2 Wed Sep 14 11:30:20 UTC 2016 | Updating : 6/6 docs. Commit Commit done Sep 14, 2016 11:32:22 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator buildModelAndSchemas SEVERE: Failed to generate the schema for the JAX-B elements com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions java.util.Map is an interface, and JAXB can't handle interfaces. this problem is related to the following location: at java.util.Map at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender() at com.atmire.dspace.rest.common.Statlet java.util.Map does not have a no-arg default constructor. this problem is related to the following location: at java.util.Map at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender() at com.atmire.dspace.rest.common.Statlet </code></pre> <ul> <li>Then 20 minutes later another outOfMemoryError:</li> </ul> <pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding.decode(StringCoding.java:215) </code></pre> <ul> <li>Perhaps these particular issues <em>are</em> memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/09/tomcat_jvm-day.png" alt="Tomcat JVM usage day" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/09/tomcat_jvm-week.png" alt="Tomcat JVM usage week" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/09/tomcat_jvm-month.png" alt="Tomcat JVM usage month" /></p> <ul> <li>And really, we did reduce the memory of CGSpace in late 2015, so maybe we should just increase it again, now that our usage is higher and we are having memory errors in the logs</li> <li>Oh great, the configuration on the actual server is different than in configuration management!</li> <li>Seems we added a bunch of settings to the <code>/etc/default/tomcat7</code> in December, 2015 and never updated our ansible repository:</li> </ul> <pre><code>JAVA_OPTS=&quot;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&quot; </code></pre> <ul> <li>So I&rsquo;m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)</li> <li>Increased JVM heap to 4096m on CGSpace (linode01)</li> </ul> <h2 id="2016-09-15">2016-09-15</h2> <ul> <li>Looking at Google Webmaster Tools again, it seems the work I did on URL query parameters and blocking via the <code>X-Robots-Tag</code> HTTP header in March, 2016 seem to have had a positive effect on Google&rsquo;s index for CGSpace</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/09/google-webmaster-tools-index.png" alt="Google Webmaster Tools for CGSpace" /></p> <h2 id="2016-09-16">2016-09-16</h2> <ul> <li>CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren&rsquo;t on those lines so I&rsquo;m not sure if they were yesterday:</li> </ul> <pre><code>dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org Thu Sep 15 18:45:25 UTC 2016 | Query:id: 55785 AND type:2 Thu Sep 15 18:45:26 UTC 2016 | Updating : 100/218 docs. Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs. Thu Sep 15 18:45:27 UTC 2016 | Updating : 218/218 docs. Commit Commit done Exception in thread &quot;http-bio-127.0.0.1-8081-exec-247&quot; java.lang.OutOfMemoryError: Java heap space Exception in thread &quot;http-bio-127.0.0.1-8081-exec-241&quot; java.lang.OutOfMemoryError: Java heap space Exception in thread &quot;http-bio-127.0.0.1-8081-exec-243&quot; java.lang.OutOfMemoryError: Java heap space Exception in thread &quot;http-bio-127.0.0.1-8081-exec-258&quot; java.lang.OutOfMemoryError: Java heap space Exception in thread &quot;http-bio-127.0.0.1-8081-exec-268&quot; java.lang.OutOfMemoryError: Java heap space Exception in thread &quot;http-bio-127.0.0.1-8081-exec-263&quot; java.lang.OutOfMemoryError: Java heap space Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOfMemoryError: Java heap space Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb -e14ef82ee224 to the index; possible analysis error. at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102) at com.atmire.statistics.SolrLogThread.run(SourceFile:25) </code></pre> <ul> <li>I bumped the heap space from 4096m to 5120m to see if this is <em>really</em> about heap speace or not.</li> <li>Looking into some of these errors that I&rsquo;ve seen this week but haven&rsquo;t noticed before:</li> </ul> <pre><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements' 113 </code></pre> <ul> <li>I&rsquo;ve sent a message to Atmire about the Solr error to see if it&rsquo;s related to their batch update module</li> </ul> <h2 id="2016-09-19">2016-09-19</h2> <ul> <li>Work on cleanups for author affiliations after Peter sent me his list of corrections/deletions:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2-deletions.csv -m 211 -u dspace -d dspace -p fuuu </code></pre> <ul> <li>After that we need to take the top ~300 and make a controlled vocabulary for it</li> <li>I dumped a list of the top 300 affiliations from the database, sorted it alphabetically in OpenRefine, and created a controlled vocabulary for it (<a href="https://github.com/ilri/DSpace/pull/267">#267</a>)</li> </ul> <h2 id="2016-09-20">2016-09-20</h2> <ul> <li>Run all system updates on DSpace Test and reboot the server</li> <li>Merge changes for sponsorship and affiliation controlled vocabularies (<a href="https://github.com/ilri/DSpace/pull/267">#267</a>, <a href="https://github.com/ilri/DSpace/pull/268">#268</a>)</li> <li>Merge minor changes to <code>messages.xml</code> to reconcile it with the stock DSpace 5.1 one (<a href="https://github.com/ilri/DSpace/pull/269">#269</a>)</li> <li>Peter asked about adding title search to Discovery</li> <li>The index was already defined, so I just added it to the search filters</li> <li>It works but CGSpace apparently uses <code>OR</code> for search terms, which makes the search results basically useless</li> <li>I need to read the docs and ask on the mailing list to see if we can tweak that</li> <li>Generate a new list of sponsors from the database for Peter Ballantyne so we can clean them up and update the controlled vocabulary</li> </ul> <h2 id="2016-09-21">2016-09-21</h2> <ul> <li>Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: <a href="https://jira.duraspace.org/browse/DS-2809">https://jira.duraspace.org/browse/DS-2809</a></li> <li>We just need to set this in <code>dspace/solr/search/conf/schema.xml</code>:</li> </ul> <pre><code>&lt;solrQueryParser defaultOperator=&quot;AND&quot;/&gt; </code></pre> <ul> <li>It actually works really well, and search results return much less hits now (before, after):</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/09/cgspace-search.png" alt="CGSpace search with &quot;OR&quot; boolean logic" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/09/dspacetest-search.png" alt="DSpace Test search with &quot;AND&quot; boolean logic" /></p> <ul> <li>Found a way to improve the configuration of Atmire&rsquo;s Content and Usage Analysis (CUA) module for date fields</li> </ul> <pre><code>-content.analysis.dataset.option.8=metadata:dateAccessioned:discovery +content.analysis.dataset.option.8=metadata:dc.date.accessioned:date(month) </code></pre> <ul> <li>This allows the module to treat the field as a date rather than a text string, so we can interrogate it more intelligently</li> <li>Add <code>dc.date.accessioned</code> to XMLUI Discovery search filters</li> <li>Major CGSpace crash because ILRI forgot to pay the Linode bill</li> <li>45 minutes of downtime!</li> <li>Start processing the fixes to <code>dc.description.sponsorship</code> from Peter Ballantyne:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu </code></pre> <ul> <li>I need to run these and the others from a few days ago on CGSpace the next time we run updates</li> <li>Also, I need to update the controlled vocab for sponsors based on these</li> </ul> <h2 id="2016-09-22">2016-09-22</h2> <ul> <li>Update controlled vocabulary for sponsorship based on the latest corrected values from the database</li> </ul> <h2 id="2016-09-25">2016-09-25</h2> <ul> <li>Merge accession date improvements for CUA module (<a href="https://github.com/ilri/DSpace/pull/275">#275</a>)</li> <li>Merge addition of accession date to Discovery search filters (<a href="https://github.com/ilri/DSpace/pull/276">#276</a>)</li> <li>Merge updates to sponsorship controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/277">#277</a>)</li> <li>I&rsquo;ve been trying to add a search filter for <code>dc.description</code> so the IITA people can search for some tags they use there, but for some reason the filter never shows up in Atmire&rsquo;s CUA</li> <li>Not sure if it&rsquo;s something like we already have too many filters there (30), or the filter name is reserved, etc&hellip;</li> <li>Generate a list of ILRI subjects for Peter and Abenet to look through/fix:</li> </ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv; </code></pre> <ul> <li>Regenerate Discovery indexes a few times after playing with <code>discovery.xml</code> index definitions (syntax, parameters, etc).</li> <li>Merge changes to boolean logic in Solr search (<a href="https://github.com/ilri/DSpace/pull/274">#274</a>)</li> <li>Run all sponsorship and affiliation fixes on CGSpace, deploy latest <code>5_x-prod</code> branch, and re-index Discovery on CGSpace</li> <li>Tested OCSP stapling on DSpace Test&rsquo;s nginx and it works:</li> </ul> <pre><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status ... OCSP response: ====================================== OCSP Response Data: ... Cert Status: good </code></pre> <ul> <li>I&rsquo;ve been monitoring this for almost two years in this GitHub issue: <a href="https://github.com/ilri/DSpace/issues/38">https://github.com/ilri/DSpace/issues/38</a></li> </ul> <h2 id="2016-09-27">2016-09-27</h2> <ul> <li>Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman</li> <li>This author has a few variations:</li> </ul> <pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu len, S%'; </code></pre> <ul> <li>And it looks like <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code> is the authority with the correct ORCID linked</li> </ul> <pre><code>dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%'; UPDATE 101 </code></pre> <ul> <li>Hmm, now her name is missing from the authors facet and only shows the authority ID</li> <li>On the production server there is an item with her ORCID but it is using a different authority: f01f7b7b-be3f-4df7-a61d-b73c067de88d</li> <li>Maybe I used the wrong one&hellip; I need to look again at the production database</li> <li>On a clean snapshot of the database I see the correct authority should be <code>f01f7b7b-be3f-4df7-a61d-b73c067de88d</code>, not <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code></li> <li>Updating her authorities again and reindexing:</li> </ul> <pre><code>dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%'; UPDATE 101 </code></pre> <ul> <li>Use GitHub icon from Font Awesome instead of a PNG to save one extra network request</li> <li>We can also replace the RSS and mail icons in community text!</li> <li>Fix reference to <code>dc.type.*</code> in Atmire CUA module, as we now only index <code>dc.type</code> for &ldquo;Output type&rdquo;</li> </ul> <h2 id="2016-09-28">2016-09-28</h2> <ul> <li>Make a placeholder pull request for <code>discovery.xml</code> changes (<a href="https://github.com/ilri/DSpace/pull/278">#278</a>), as I still need to test their effect on Atmire content analysis module</li> <li>Make a placeholder pull request for Font Awesome changes (<a href="https://github.com/ilri/DSpace/pull/279">#279</a>), which replaces the GitHub image in the footer with an icon, and add style for RSS and @ icons that I will start replacing in community/collection HTML intros</li> <li>Had some issues with local test server after messing with Solr too much, had to blow everything away and re-install from CGSpace</li> <li>Going to try to update Sonja Vermeulen&rsquo;s authority to 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0, as that seems to be one of her authorities that has an ORCID</li> <li>Merge Font Awesome changes (<a href="https://github.com/ilri/DSpace/pull/279">#279</a>)</li> <li>Minor fix to a string in Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/pull/280">#280</a>)</li> <li>This seems to be what I&rsquo;ll need to do for Sonja Vermeulen (but with <code>2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0</code> instead on the live site):</li> </ul> <pre><code>dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%'; dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%'; </code></pre> <ul> <li>And then update Discovery and Authority indexes</li> <li>Minor fix for &ldquo;Subject&rdquo; string in Discovery search and Atmire modules (<a href="https://github.com/ilri/DSpace/pull/281">#281</a>)</li> <li>Start testing batch fixes for ILRI subject from Peter:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -m 203 -d dspace -u dspace -p fuuu </code></pre> <h2 id="2016-09-29">2016-09-29</h2> <ul> <li>Add <code>cg.identifier.ciatproject</code> to metadata registry in preparation for CIAT project tag</li> <li>Merge changes for CIAT project tag (<a href="https://github.com/ilri/DSpace/pull/282">#282</a>)</li> <li>DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console</li> <li>People on DSpace mailing list gave me a query to get authors from certain collections:</li> </ul> <pre><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473'))); </code></pre> <h2 id="2016-09-30">2016-09-30</h2> <ul> <li>Deny access to REST API&rsquo;s <code>find-by-metadata-field</code> endpoint to protect against an upstream security issue (DS-3250)</li> <li>There is a patch but it is only for 5.5 and doesn&rsquo;t apply cleanly to 5.1</li> </ul> August, 2016 https://alanorth.github.io/cgspace-notes/2016-08/ Mon, 01 Aug 2016 15:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-08/ <h2 id="2016-08-01">2016-08-01</h2> <ul> <li>Add updated distribution license from Sisay (<a href="https://github.com/ilri/DSpace/issues/259">#259</a>)</li> <li>Play with upgrading Mirage 2 dependencies in <code>bower.json</code> because most are several versions of out date</li> <li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li> <li>bower stuff is a dead end, waste of time, too many issues</li> <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li> <li>Start working on DSpace 5.1 → 5.5 port:</li> </ul> <pre><code>$ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 </code></pre> <p></p> <ul> <li>Lots of conflicts that don&rsquo;t make sense (ie, shouldn&rsquo;t conflict!)</li> <li>This file in particular conflicts almost 10 times: <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/CGIAR/styles/_style.scss</code></li> <li>Checking out a clean branch at 5.5 and cherry-picking our commits works where that file would normally have a conflict</li> <li>Seems to be related to merge commits</li> <li><code>git rebase --preserve-merges</code> doesn&rsquo;t seem to help</li> <li>Eventually I just turned on git rerere and solved the conflicts and completed the 403 commit rebase</li> <li>The 5.5 code now builds but doesn&rsquo;t run (white page in Tomcat)</li> </ul> <h2 id="2016-08-02">2016-08-02</h2> <ul> <li>Ask Atmire for help with DSpace 5.5 issue</li> <li>Vanilla DSpace 5.5 deploys and runs fine</li> <li>Playing with DSpace in Ubuntu 16.04 and Tomcat 7</li> <li>Everything is still fucked up, even vanilla DSpace 5.5</li> </ul> <h2 id="2016-08-04">2016-08-04</h2> <ul> <li>Ask on DSpace mailing list about duplicate authors, Discovery and author text values</li> <li>Atmire responded with some new DSpace 5.5 ready versions to try for their modules</li> </ul> <h2 id="2016-08-05">2016-08-05</h2> <ul> <li>Fix item display incorrectly displaying Species when Breeds were present (<a href="https://github.com/ilri/DSpace/pull/260">#260</a>)</li> <li>Experiment with fixing more authors, like Delia Grace:</li> </ul> <pre><code>dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.'; </code></pre> <h2 id="2016-08-06">2016-08-06</h2> <ul> <li>Finally figured out how to remove &ldquo;View/Open&rdquo; and &ldquo;Bitstreams&rdquo; from the item view</li> </ul> <h2 id="2016-08-07">2016-08-07</h2> <ul> <li>Start working on Ubuntu 16.04 Ansible playbook for Tomcat 8, PostgreSQL 9.5, Oracle 8, etc</li> </ul> <h2 id="2016-08-08">2016-08-08</h2> <ul> <li>Still troubleshooting Atmire modules on DSpace 5.5</li> <li>Vanilla DSpace 5.5 works on Tomcat 7&hellip;</li> <li>Ooh, and vanilla DSpace 5.5 works on Tomcat 8 with Java 8!</li> <li>Some notes about setting up Tomcat 8, since it&rsquo;s new on this machine&hellip;</li> <li>Install latest Oracle Java 8 JDK</li> <li>Create <code>setenv.sh</code> in Tomcat 8 <code>libexec/bin</code> directory: <br /></li> </ul> <pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&quot; CATALINA_OPTS=&quot;$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib&quot; JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home </code></pre> <ul> <li>Edit Tomcat 8 <code>server.xml</code> to add regular HTTP listener for solr</li> <li>Symlink webapps:</li> </ul> <pre><code>$ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT $ ln -sv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT $ ln -sv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/oai $ ln -sv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/jspui $ ln -sv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/rest $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/solr </code></pre> <h2 id="2016-08-09">2016-08-09</h2> <ul> <li>More tests of Atmire&rsquo;s 5.5 modules on a clean, working instance of <code>5_x-prod</code></li> <li>Still fails, though perhaps differently than before (Flyway): <a href="https://gist.github.com/alanorth/5d49c45a16efd7c6bc1e6642e66118b2">https://gist.github.com/alanorth/5d49c45a16efd7c6bc1e6642e66118b2</a></li> <li>More work on Tomcat 8 and Java 8 stuff for Ansible playbooks</li> </ul> <h2 id="2016-08-10">2016-08-10</h2> <ul> <li>Turns out DSpace 5.x isn&rsquo;t ready for Tomcat 8: <a href="https://jira.duraspace.org/browse/DS-3092">https://jira.duraspace.org/browse/DS-3092</a></li> <li>So we&rsquo;ll need to use Tomcat 7 + Java 8 on Ubuntu 16.04</li> <li>More work on the Ansible stuff for this, allowing Tomcat 7 to use Java 8</li> <li>Merge pull request for fixing the type Discovery index to use <code>dc.type</code> (<a href="https://github.com/ilri/DSpace/pull/262">#262</a>)</li> <li>Merge pull request for removing &ldquo;Bitstream&rdquo; text from item display, as it confuses users and isn&rsquo;t necessary (<a href="https://github.com/ilri/DSpace/pull/263">#263</a>)</li> </ul> <h2 id="2016-08-11">2016-08-11</h2> <ul> <li>Finally got DSpace (5.5) running on Ubuntu 16.04, Tomcat 7, Java 8, PostgreSQL 9.5 via the updated Ansible stuff</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/08/dspace55-ubuntu16.04.png" alt="DSpace 5.5 on Ubuntu 16.04, Tomcat 7, Java 8, PostgreSQL 9.5" /></p> <h2 id="2016-08-14">2016-08-14</h2> <ul> <li>Update Mirage 2 build notes for Ubuntu 16.04: <a href="https://gist.github.com/alanorth/2cf9c15834dc68a514262fcb04004cb0">https://gist.github.com/alanorth/2cf9c15834dc68a514262fcb04004cb0</a></li> </ul> <h2 id="2016-08-15">2016-08-15</h2> <ul> <li>Notes on NodeJS + nginx + systemd: <a href="https://gist.github.com/alanorth/51acd476891c67dfe27725848cf5ace1">https://gist.github.com/alanorth/51acd476891c67dfe27725848cf5ace1</a></li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/08/nodejs-nginx.png" alt="ExpressJS running behind nginx" /></p> <h2 id="2016-08-16">2016-08-16</h2> <ul> <li>Troubleshoot Paramiko connection issues with Ansible on ILRI servers: <a href="https://github.com/ilri/rmg-ansible-public/issues/37">#37</a></li> <li>Turns out we need to add some MACs to our <code>sshd_config</code>: hmac-sha2-512,hmac-sha2-256</li> <li>Update DSpace Test&rsquo;s Java to version 8 to start testing this configuration (<a href="https://wiki.apache.org/solr/ShawnHeisey">seeing as Solr recommends it</a>)</li> </ul> <h2 id="2016-08-17">2016-08-17</h2> <ul> <li>More work on Let&rsquo;s Encrypt stuff for Ansible roles</li> <li>Yesterday Atmire responded about DSpace 5.5 issues and asked me to try the <code>dspace database repair</code> command to fix Flyway issues</li> <li>The <code>dspace database</code> command doesn&rsquo;t even run: <a href="https://gist.github.com/alanorth/c43c8d89e8df346d32c0ee938be90cd5">https://gist.github.com/alanorth/c43c8d89e8df346d32c0ee938be90cd5</a></li> <li>Oops, it looks like the missing classes causing <code>dspace database</code> to fail were coming from the old <code>~/dspace/config/spring</code> folder</li> <li>After removing the spring folder and running ant install again, <code>dspace database</code> works</li> <li>I see there are missing and pending Flyway migrations, but running <code>dspace database repair</code> and <code>dspace database migrate</code> does nothing: <a href="https://gist.github.com/alanorth/41ed5abf2ff32d8ac9eedd1c3d015d70">https://gist.github.com/alanorth/41ed5abf2ff32d8ac9eedd1c3d015d70</a></li> </ul> <h2 id="2016-08-18">2016-08-18</h2> <ul> <li>Fix &ldquo;CONGO,DR&rdquo; country name in <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/264">#264</a>)</li> <li>Also need to fix existing records using the incorrect form in the database:</li> </ul> <pre><code>dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR'; </code></pre> <ul> <li>I asked a question on the DSpace mailing list about updating &ldquo;preferred&rdquo; forms of author names from ORCID</li> </ul> <h2 id="2016-08-21">2016-08-21</h2> <ul> <li>A few days ago someone on the DSpace mailing list suggested I try <code>dspace dsrun org.dspace.authority.UpdateAuthorities</code> to update preferred author names from ORCID</li> <li>If you set <code>auto-update-items=true</code> in <code>dspace/config/modules/solrauthority.cfg</code> it is supposed to update records it finds automatically</li> <li>I updated my name format on ORCID and I&rsquo;ve been running that script a few times per day since then but nothing has changed</li> <li>Still troubleshooting Atmire modules on DSpace 5.5</li> <li>I sent them some new verbose logs: <a href="https://gist.github.com/alanorth/700748995649688148ceba89d760253e">https://gist.github.com/alanorth/700748995649688148ceba89d760253e</a></li> </ul> <h2 id="2016-08-22">2016-08-22</h2> <ul> <li>Database migrations are fine on DSpace 5.1:</li> </ul> <pre><code>$ ~/dspace/bin/dspace database info Database URL: jdbc:postgresql://localhost:5432/dspacetest Database Schema: public Database Software: PostgreSQL version 9.3.14 Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 901) +----------------+----------------------------+---------------------+---------+ | Version | Description | Installed on | State | +----------------+----------------------------+---------------------+---------+ | 1.1 | Initial DSpace 1.1 databas | | PreInit | | 1.2 | Upgrade to DSpace 1.2 sche | | PreInit | | 1.3 | Upgrade to DSpace 1.3 sche | | PreInit | | 1.3.9 | Drop constraint for DSpace | | PreInit | | 1.4 | Upgrade to DSpace 1.4 sche | | PreInit | | 1.5 | Upgrade to DSpace 1.5 sche | | PreInit | | 1.5.9 | Drop constraint for DSpace | | PreInit | | 1.6 | Upgrade to DSpace 1.6 sche | | PreInit | | 1.7 | Upgrade to DSpace 1.7 sche | | PreInit | | 1.8 | Upgrade to DSpace 1.8 sche | | PreInit | | 3.0 | Upgrade to DSpace 3.x sche | | PreInit | | 4.0 | Initializing from DSpace 4 | 2015-11-20 12:42:52 | Success | | 5.0.2014.08.08 | DS-1945 Helpdesk Request a | 2015-11-20 12:42:53 | Success | | 5.0.2014.09.25 | DS 1582 Metadata For All O | 2015-11-20 12:42:55 | Success | | 5.0.2014.09.26 | DS-1582 Metadata For All O | 2015-11-20 12:42:55 | Success | | 5.0.2015.01.27 | MigrateAtmireExtraMetadata | 2015-11-20 12:43:29 | Success | | 5.1.2015.12.03 | Atmire CUA 4 migration | 2016-03-21 17:10:41 | Success | | 5.1.2015.12.03 | Atmire MQM migration | 2016-03-21 17:10:42 | Success | +----------------+----------------------------+---------------------+---------+ </code></pre> <ul> <li>So I&rsquo;m not sure why they have problems when we move to DSpace 5.5 (even the 5.1 migrations themselves show as &ldquo;Missing&rdquo;)</li> </ul> <h2 id="2016-08-23">2016-08-23</h2> <ul> <li>Help Paola from CCAFS with her thumbnails again</li> <li>Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB</li> <li>They said I should delete the Atmire migrations <br /></li> </ul> <pre><code>dspacetest=# delete from schema_version where description = 'Atmire CUA 4 migration' and version='5.1.2015.12.03.2'; dspacetest=# delete from schema_version where description = 'Atmire MQM migration' and version='5.1.2015.12.03.3'; </code></pre> <ul> <li>After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!</li> </ul> <pre><code>org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77 context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77 </code></pre> <ul> <li>Looks like we&rsquo;re missing some stuff in the XMLUI module&rsquo;s <code>sitemap.xmap</code>, as well as in each of our XMLUI themes</li> <li>Diff them with these to get the <code>ThemeResourceReader</code> changes: <ul> <li><code>dspace-xmlui/src/main/webapp/sitemap.xmap</code></li> <li><code>dspace-xmlui-mirage2/src/main/webapp/sitemap.xmap</code></li> </ul></li> <li>Then we had some NullPointerException from the SolrLogger class, which is apparently part of Atmire&rsquo;s CUA module</li> <li>I tried with a small version bump to CUA but it didn&rsquo;t work (version <code>5.5-4.1.1-0</code>)</li> <li>Also, I started looking into huge pages to prepare for PostgreSQL 9.5, but it seems Linode&rsquo;s kernels don&rsquo;t enable them</li> </ul> <h2 id="2016-08-24">2016-08-24</h2> <ul> <li>Clean up and import 48 CCAFS records into DSpace Test</li> <li>SQL to get all journal titles from dc.source (55), since it&rsquo;s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:</li> </ul> <pre><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*'; </code></pre> <h2 id="2016-08-25">2016-08-25</h2> <ul> <li>Atmire suggested adding a missing bean to <code>dspace/config/spring/api/atmire-cua.xml</code> but it doesn&rsquo;t help:</li> </ul> <pre><code>... Error creating bean with name 'MetadataStorageInfoService' ... </code></pre> <ul> <li>Atmire sent an updated version of <code>dspace/config/spring/api/atmire-cua.xml</code> and now XMLUI starts but gives a null pointer exception:</li> </ul> <pre><code>Java stacktrace: java.lang.NullPointerException at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129) at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228) at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71) at com.sun.proxy.$Proxy103.startElement(Unknown Source) at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140) at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140) at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94) ... </code></pre> <ul> <li>Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:</li> </ul> <pre><code>$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv $ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map </code></pre> <ul> <li>Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs</li> </ul> <h2 id="2016-08-26">2016-08-26</h2> <ul> <li>CGSpace had issues tonight, not entirely crashing, but becoming unresponsive</li> <li>The dspace log had this:</li> </ul> <pre><code>2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object </code></pre> <ul> <li>Related to /rest no doubt</li> </ul> <h2 id="2016-08-27">2016-08-27</h2> <ul> <li>Run corrections for Delia Grace and <code>CONGO, DR</code>, and deploy August changes to CGSpace</li> <li>Run all system updates and reboot the server</li> </ul> July, 2016 https://alanorth.github.io/cgspace-notes/2016-07/ Fri, 01 Jul 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-07/ <h2 id="2016-07-01">2016-07-01</h2> <ul> <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li> <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li> </ul> <pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; UPDATE 95 dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$'; text_value ------------ (0 rows) </code></pre> <ul> <li>In this case the select query was showing 95 results before the update</li> </ul> <p></p> <h2 id="2016-07-02">2016-07-02</h2> <ul> <li>Comment on DSpace Jira ticket about author lookup search text (<a href="https://jira.duraspace.org/browse/DS-2329">DS-2329</a>)</li> </ul> <h2 id="2016-07-04">2016-07-04</h2> <ul> <li>Seems the database&rsquo;s author authority values mean nothing without the <code>authority</code> Solr core from the host where they were created!</li> </ul> <h2 id="2016-07-05">2016-07-05</h2> <ul> <li>Amend <code>backup-solr.sh</code> script so it backs up the entire Solr folder</li> <li>We <em>really</em> only need <code>statistics</code> and <code>authority</code> but meh</li> <li>Fix metadata for species on DSpace Test:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu' </code></pre> <ul> <li>Will run later on CGSpace</li> <li>A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is &ldquo;ungraded&rdquo;</li> <li>I tested the <a href="https://jira.duraspace.org/browse/DS-2740">patch for DS-2740</a> that I had found last month and it seems to work</li> <li>I will merge it to <code>5_x-prod</code></li> </ul> <h2 id="2016-07-06">2016-07-06</h2> <ul> <li>Delete 23 blank metadata values from CGSpace:</li> </ul> <pre><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value=''; DELETE 23 </code></pre> <ul> <li>Complete phase three of metadata migration, for the following fields: <ul> <li>dc.title.jtitle → dc.source</li> <li>dc.crsubject.crpsubject → cg.contributor.crp</li> <li>dc.contributor.affiliation → cg.contributor.affiliation</li> <li>dc.Species → cg.species</li> <li>dc.srplace.subregion → cg.coverage.subregion</li> <li>dc.contributor.corporate → dc.contributor.author</li> <li>dc.identifier.url → cg.identifier.url</li> <li>dc.identifier.doi → cg.identifier.doi</li> <li>dc.identifier.googleurl → cg.identifier.googleurl</li> <li>dc.identifier.dataurl → cg.identifier.dataurl</li> </ul></li> <li>Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)</li> </ul> <pre><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu' $ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu' $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu' </code></pre> <ul> <li>I then ran all server updates and rebooted the server</li> </ul> <h2 id="2016-07-11">2016-07-11</h2> <ul> <li>Doing some author cleanups from Peter and Abenet:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu </code></pre> <h2 id="2016-07-13">2016-07-13</h2> <ul> <li>Run the author cleanups on CGSpace and start a full Discovery re-index</li> </ul> <h2 id="2016-07-14">2016-07-14</h2> <ul> <li>Test LDAP settings for new root LDAP</li> <li>Seems to work when binding as a top-level user</li> </ul> <h2 id="2016-07-18">2016-07-18</h2> <ul> <li>Adjust identifiers in XMLUI item display to be more prominent</li> <li>Add species and breed to the XMLUI item display</li> <li>CGSpace crashed late at night and the DSpace logs were showing:</li> </ul> <pre><code>2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object ... </code></pre> <ul> <li>I suspect it&rsquo;s someone hitting REST too much:</li> </ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3 710 66.249.78.38 1781 181.118.144.29 24904 70.32.99.142 </code></pre> <ul> <li>I just blocked access to <code>/rest</code> for that last IP for now:</li> </ul> <pre><code> # log rest requests location /rest { access_log /var/log/nginx/rest.log; proxy_pass http://127.0.0.1:8443; deny 70.32.99.142; } </code></pre> <h2 id="2016-07-21">2016-07-21</h2> <ul> <li>Mitigate the <a href="https://httpoxy.org">HTTPoxy</a> vulnerability for Tomcat etc in nginx: <a href="https://github.com/ilri/rmg-ansible-public/pull/38">https://github.com/ilri/rmg-ansible-public/pull/38</a></li> <li>Unblock 70.32.99.142 from <code>/rest</code> as it has been blocked for a few days</li> </ul> <h2 id="2016-07-22">2016-07-22</h2> <ul> <li>Help Paola from CCAFS with thumbnails for batch uploads</li> <li>She has been struggling to get the dimensions right, and manually enlarging smaller thumbnails, renaming PNGs to JPG, etc</li> <li>Altmetric reports having an issue with some of our authors being doubled&hellip;</li> <li>This is related to authority and confidence!</li> <li>We might need to use <code>index.authority.ignore-prefered=true</code> to tell the Discovery index to prefer the variation that exists in the metadatavalue rather than what it finds in the authority cache.</li> <li>Trying these on DSpace Test after a discussion by Daniel Scharon on the dspace-tech mailing list:</li> </ul> <pre><code>index.authority.ignore-prefered.dc.contributor.author=true index.authority.ignore-variants.dc.contributor.author=false </code></pre> <ul> <li>After reindexing I don&rsquo;t see any change in Discovery&rsquo;s display of authors, and still have entries like:</li> </ul> <pre><code>Grace, D. (464) Grace, D. (62) </code></pre> <ul> <li>I asked for clarification of the following options on the DSpace mailing list:</li> </ul> <pre><code>index.authority.ignore index.authority.ignore-prefered index.authority.ignore-variants </code></pre> <ul> <li>In the mean time, I will try these on DSpace Test (plus a reindex):</li> </ul> <pre><code>index.authority.ignore=true index.authority.ignore-prefered=true index.authority.ignore-variants=true </code></pre> <ul> <li>Enabled usage of <code>X-Forwarded-For</code> in DSpace admin control panel (<a href="https://github.com/ilri/DSpace/pull/255">#255</a></li> <li>It was misconfigured and disabled, but already working for some reason <em>sigh</em></li> <li>&hellip; no luck. Trying with just:</li> </ul> <pre><code>index.authority.ignore=true </code></pre> <ul> <li>After re-indexing and clearing the XMLUI cache nothing has changed</li> </ul> <h2 id="2016-07-25">2016-07-25</h2> <ul> <li>Trying a few more settings (plus reindex) for Discovery on DSpace Test:</li> </ul> <pre><code>index.authority.ignore-prefered.dc.contributor.author=true index.authority.ignore-variants=true </code></pre> <ul> <li>Run all OS updates and reboot DSpace Test server</li> <li>No changes to Discovery after reindexing&hellip; hmm.</li> <li>Integrate and massively clean up About page (<a href="https://github.com/ilri/DSpace/pull/256">#256</a>)</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/07/cgspace-about-page.png" alt="About page" /></p> <ul> <li>The DSpace source code mentions the configuration key <code>discovery.index.authority.ignore-prefered.*</code> (with prefix of discovery, despite the docs saying otherwise), so I&rsquo;m trying the following on DSpace Test:</li> </ul> <pre><code>discovery.index.authority.ignore-prefered.dc.contributor.author=true discovery.index.authority.ignore-variants=true </code></pre> <ul> <li>Still no change!</li> <li>Deploy species, breed, and identifier changes to CGSpace, as well as About page</li> <li>Run Linode RAM upgrade (8→12GB)</li> <li>Re-sync DSpace Test with CGSpace</li> <li>I noticed that our backup scripts don&rsquo;t send Solr cores to S3 so I amended the script</li> </ul> <h2 id="2016-07-31">2016-07-31</h2> <ul> <li>Work on removing Dryland Systems and Humidtropics subjects from Discovery sidebar and Browse by</li> <li>Also change &ldquo;Subjects&rdquo; to &ldquo;AGROVOC keywords&rdquo; in Discovery sidebar/search and Browse by (<a href="https://github.com/ilri/DSpace/issues/257">#257</a>)</li> </ul> June, 2016 https://alanorth.github.io/cgspace-notes/2016-06/ Wed, 01 Jun 2016 10:53:00 +0300 https://alanorth.github.io/cgspace-notes/2016-06/ <h2 id="2016-06-01">2016-06-01</h2> <ul> <li>Experimenting with IFPRI OAI (we want to harvest their publications)</li> <li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li> <li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li> <li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li> <li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li> <li>Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in <code>dc.identifier.fund</code> to <code>cg.identifier.cpwfproject</code> and then the rest to <code>dc.description.sponsorship</code></li> </ul> <p></p> <pre><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA'); UPDATE 497 dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75; UPDATE 14 </code></pre> <ul> <li>Fix a few minor miscellaneous issues in <code>dspace.cfg</code> (<a href="https://github.com/ilri/DSpace/pull/227">#227</a>)</li> </ul> <h2 id="2016-06-02">2016-06-02</h2> <ul> <li>Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with <code>cg.coverage.admin-unit</code></li> <li>Seems that the Browse configuration in <code>dspace.cfg</code> can&rsquo;t handle the &lsquo;-&rsquo; in the field name:</li> </ul> <pre><code>webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text </code></pre> <ul> <li>But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error</li> <li>I&rsquo;ve sent a message to the DSpace mailing list to ask about the Browse index definition</li> <li>A user was having problems with submission and from the stacktrace it looks like a Sherpa/Romeo issue</li> <li>I found a thread on the mailing list talking about it and there is bug report and a patch: <a href="https://jira.duraspace.org/browse/DS-2740">https://jira.duraspace.org/browse/DS-2740</a></li> <li>The patch applies successfully on DSpace 5.1 so I will try it later</li> </ul> <h2 id="2016-06-03">2016-06-03</h2> <ul> <li>Investigating the CCAFS authority issue, I exported the metadata for the Videos collection</li> <li>The top two authors are:</li> </ul> <pre><code>CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500 CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600 </code></pre> <ul> <li>So the only difference is the &ldquo;confidence&rdquo;</li> <li>Ok, well THAT is interesting:</li> </ul> <pre><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %'; text_value | authority | confidence ------------+--------------------------------------+------------ Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, Alan | | -1 Orth, Alan | | -1 Orth, Alan | | -1 Orth, Alan | | -1 Orth, A. | 05c2c622-d252-4efb-b9ed-95a07d3adf11 | -1 Orth, A. | 05c2c622-d252-4efb-b9ed-95a07d3adf11 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1 Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 | 600 Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 | 600 (13 rows) </code></pre> <ul> <li>And now an actually relevent example:</li> </ul> <pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500; count ------- 707 (1 row) dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500; count ------- 253 (1 row) </code></pre> <ul> <li>Trying something experimental:</li> </ul> <pre><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security'; UPDATE 960 </code></pre> <ul> <li>And then re-indexing authority and Discovery&hellip;?</li> <li>After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet</li> <li>The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:</li> </ul> <pre><code>webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority </code></pre> <ul> <li>That would only be for the &ldquo;Browse by&rdquo; function&hellip; so we&rsquo;ll have to see what effect that has later</li> </ul> <h2 id="2016-06-04">2016-06-04</h2> <ul> <li>Re-sync DSpace Test with CGSpace and perform test of metadata migration again</li> <li>Run phase two of metadata migrations on CGSpace (see the <a href="https://gist.github.com/alanorth/1a730bec5ac9457a8fb0e3e72c98d09c">migration notes</a>)</li> <li>Run all system updates and reboot CGSpace server</li> </ul> <h2 id="2016-06-07">2016-06-07</h2> <ul> <li>Figured out how to export a list of the unique values from a metadata field ordered by count:</li> </ul> <pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv; </code></pre> <ul> <li><p>Identified the next round of fields to migrate:</p> <ul> <li>dc.title.jtitle → dc.source</li> <li>dc.crsubject.crpsubject → cg.contributor.crp</li> <li>dc.contributor.affiliation → cg.contributor.affiliation</li> <li>dc.Species → cg.species</li> <li>dc.contributor.corporate → dc.contributor</li> <li>dc.identifier.url → cg.identifier.url</li> <li>dc.identifier.doi → cg.identifier.doi</li> <li>dc.identifier.googleurl → cg.identifier.googleurl</li> <li>dc.identifier.dataurl → cg.identifier.dataurl</li> </ul></li> <li><p>Discuss pulling data from IFPRI&rsquo;s ContentDM with Ryan Miller</p></li> <li><p>Looks like OAI is kinda obtuse for this, and if we use ContentDM&rsquo;s API we&rsquo;ll be able to access their internal field names (rather than trying to figure out how they stuffed them into various, repeated Dublin Core fields)</p></li> </ul> <h2 id="2016-06-08">2016-06-08</h2> <ul> <li>Discuss controlled vocabularies for ~28 fields</li> <li>Looks like this is all we need: <a href="https://wiki.duraspace.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies">https://wiki.duraspace.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies</a></li> <li>I wrote an XPath expression to extract the ILRI subjects from <code>input-forms.xml</code> (uses xmlstartlet):</li> </ul> <pre><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml </code></pre> <ul> <li>Write to Atmire about the use of <code>atmire.orcid.id</code> to see if we can change it</li> <li>Seems to be a virtual field that is queried from the authority cache&hellip; hmm</li> <li>In other news, I found out that the About page that we haven&rsquo;t been using lives in <code>dspace/config/about.xml</code>, so now we can update the text</li> <li>File bug about <code>closed=&quot;true&quot;</code> attribute of controlled vocabularies not working: <a href="https://jira.duraspace.org/browse/DS-3238">https://jira.duraspace.org/browse/DS-3238</a></li> </ul> <h2 id="2016-06-09">2016-06-09</h2> <ul> <li>Atmire explained that the <code>atmire.orcid.id</code> field doesn&rsquo;t exist in the schema, as it actually comes from the authority cache during XMLUI run time</li> <li>This means we don&rsquo;t see it when harvesting via OAI or REST, for example</li> <li>They opened a feature ticket on the DSpace tracker to ask for support of this: <a href="https://jira.duraspace.org/browse/DS-3239">https://jira.duraspace.org/browse/DS-3239</a></li> </ul> <h2 id="2016-06-10">2016-06-10</h2> <ul> <li>Investigating authority confidences</li> <li>It looks like the values are documented in <code>Choices.java</code></li> <li>Experiment with setting all 960 CCAFS author values to be 500:</li> </ul> <pre><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security'; dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security'; UPDATE 960 </code></pre> <ul> <li>After the database edit, I did a full Discovery re-index</li> <li>And now there are exactly 960 items in the authors facet for &lsquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rsquo;</li> <li>Now I ran the same on CGSpace</li> <li>Merge controlled vocabulary functionality for animal breeds to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/236">#236</a>)</li> <li>Write python script to update metadata values in batch via PostgreSQL: <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a></li> <li>We need to use this to correct some pretty ugly values in fields like <code>dc.description.sponsorship</code></li> <li>Merge item display tweaks from earlier this week (<a href="https://github.com/ilri/DSpace/pull/231">#231</a>)</li> <li>Merge controlled vocabulary functionality for subregions (<a href="https://github.com/ilri/DSpace/pull/238">#238</a>)</li> </ul> <h2 id="2016-06-11">2016-06-11</h2> <ul> <li>Merge controlled vocabulary for sponsorship field (<a href="https://github.com/ilri/DSpace/pull/239">#239</a>)</li> <li>Fix character encoding issues for animal breed lookup that I merged yesterday</li> </ul> <h2 id="2016-06-17">2016-06-17</h2> <ul> <li>Linode has free RAM upgrades for their 13th birthday so I migrated DSpace Test (4→8GB of RAM)</li> </ul> <h2 id="2016-06-18">2016-06-18</h2> <ul> <li>Clean up titles and hints in <code>input-forms.xml</code> to use title/sentence case and a few more consistency things (<a href="https://github.com/ilri/DSpace/pull/241">#241</a>)</li> <li><p>The final list of fields to migrate in the third phase of metadata migrations is:</p> <ul> <li>dc.title.jtitle → dc.source</li> <li>dc.crsubject.crpsubject → cg.contributor.crp</li> <li>dc.contributor.affiliation → cg.contributor.affiliation</li> <li>dc.srplace.subregion → cg.coverage.subregion</li> <li>dc.Species → cg.species</li> <li>dc.contributor.corporate → dc.contributor</li> <li>dc.identifier.url → cg.identifier.url</li> <li>dc.identifier.doi → cg.identifier.doi</li> <li>dc.identifier.googleurl → cg.identifier.googleurl</li> <li>dc.identifier.dataurl → cg.identifier.dataurl</li> </ul></li> <li><p>Interesting &ldquo;Sunburst&rdquo; visualization on a Digital Commons page: <a href="http://www.repository.law.indiana.edu/sunburst.html">http://www.repository.law.indiana.edu/sunburst.html</a></p></li> <li><p>Final testing on metadata fix/delete for <code>dc.description.sponsorship</code> cleanup</p></li> <li><p>Need to run <code>fix-metadata-values.py</code> and then <code>fix-metadata-values.py</code></p></li> </ul> <h2 id="2016-06-20">2016-06-20</h2> <ul> <li>CGSpace&rsquo;s HTTPS certificate expired last night and I didn&rsquo;t notice, had to renew:</li> </ul> <pre><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &quot;/usr/bin/service nginx stop&quot; --post-hook &quot;/usr/bin/service nginx start&quot; </code></pre> <ul> <li>I really need to fix that cron job&hellip;</li> </ul> <h2 id="2016-06-24">2016-06-24</h2> <ul> <li>Run the replacements/deletes for <code>dc.description.sponsorship</code> (investors) on CGSpace:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace </code></pre> <ul> <li>The scripts for this are here: <ul> <li><a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a></li> <li><a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a></li> </ul></li> <li>Add new sponsors to controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/244">#244</a>)</li> <li>Refine submission form labels and hints</li> </ul> <h2 id="2016-06-28">2016-06-28</h2> <ul> <li>Testing the cleanup of <code>dc.contributor.corporate</code> with 13 deletions and 121 replacements</li> <li>There are still ~97 fields that weren&rsquo;t indicated to do anything</li> <li>After the above deletions and replacements I regenerated a CSV and sent it to Peter <em>et al</em> to have a look</li> </ul> <pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv; </code></pre> <ul> <li>Re-evaluate <code>dc.contributor.corporate</code> and it seems we will move it to <code>dc.contributor.author</code> as this is more in line with how editors are actually using it</li> </ul> <h2 id="2016-06-29">2016-06-29</h2> <ul> <li>Test run of <code>migrate-fields.sh</code> with the following re-mappings:</li> </ul> <pre><code>72 55 #dc.source 86 230 #cg.contributor.crp 91 211 #cg.contributor.affiliation 94 212 #cg.species 107 231 #cg.coverage.subregion 126 3 #dc.contributor.author 73 219 #cg.identifier.url 74 220 #cg.identifier.doi 79 222 #cg.identifier.googleurl 89 223 #cg.identifier.dataurl </code></pre> <ul> <li>Run all cleanups and deletions of <code>dc.contributor.corporate</code> on CGSpace:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu' $ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu' $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu' </code></pre> <ul> <li>Re-deploy CGSpace and DSpace Test with latest June changes</li> <li>Now the sharing and Altmetric bits are more prominent:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/06/xmlui-altmetric-sharing.png" alt="DSpace 5.1 XMLUI With Altmetric Badge" /></p> <ul> <li>Run all system updates on the servers and reboot</li> <li>Start working on config changes for phase three of the metadata migrations</li> </ul> <h2 id="2016-06-30">2016-06-30</h2> <ul> <li>Wow, there are 95 authors in the database who have &lsquo;,&rsquo; at the end of their name:</li> </ul> <pre><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,'; </code></pre> <ul> <li>We need to use something like this to fix them, need to write a proper regex later:</li> </ul> <pre><code># update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,'; </code></pre> May, 2016 https://alanorth.github.io/cgspace-notes/2016-05/ Sun, 01 May 2016 23:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-05/ <h2 id="2016-05-01">2016-05-01</h2> <ul> <li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li> <li>I have blocked access to the API now</li> <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li> </ul> <pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 </code></pre> <p></p> <ul> <li>The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29</li> <li>100% of the requests coming from Ethiopia are like this and result in an HTTP 500:</li> </ul> <pre><code>GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1 </code></pre> <ul> <li>For now I&rsquo;ll block just the Ethiopian IP</li> <li>The owner of that application has said that the <code>NaN</code> (not a number) is an error in his code and he&rsquo;ll fix it</li> </ul> <h2 id="2016-05-03">2016-05-03</h2> <ul> <li>Update nginx to 1.10.x branch on CGSpace</li> <li>Fix a reference to <code>dc.type.output</code> in Discovery that I had missed when we migrated to <code>dc.type</code> last month (<a href="https://github.com/ilri/DSpace/pull/223">#223</a>)</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/05/discovery-types.png" alt="Item type in Discovery results" /></p> <h2 id="2016-05-06">2016-05-06</h2> <ul> <li>DSpace Test is down, <code>catalina.out</code> has lots of messages about heap space from some time yesterday (!)</li> <li>It looks like Sisay was doing some batch imports</li> <li>Hmm, also disk space is full</li> <li>I decided to blow away the solr indexes, since they are 50GB and we don&rsquo;t really need all the Atmire stuff there right now</li> <li>I will re-generate the Discovery indexes after re-deploying</li> <li>Testing <code>renew-letsencrypt.sh</code> script for nginx</li> </ul> <pre><code>#!/usr/bin/env bash readonly SERVICE_BIN=/usr/sbin/service readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto # stop nginx so LE can listen on port 443 $SERVICE_BIN nginx stop $LETSENCRYPT_BIN renew -nvv --standalone --standalone-supported-challenges tls-sni-01 &gt; /var/log/letsencrypt/renew.log 2&gt;&amp;1 LE_RESULT=$? $SERVICE_BIN nginx start if [[ &quot;$LE_RESULT&quot; != 0 ]]; then echo 'Automated renewal failed:' cat /var/log/letsencrypt/renew.log exit 1 fi </code></pre> <ul> <li>Seems to work well</li> </ul> <h2 id="2016-05-10">2016-05-10</h2> <ul> <li>Start looking at more metadata migrations</li> <li>There are lots of fields in <code>dcterms</code> namespace that look interesting, like: <ul> <li>dcterms.type</li> <li>dcterms.spatial</li> </ul></li> <li>Not sure what <code>dcterms</code> is&hellip;</li> <li>Looks like these were <a href="https://wiki.duraspace.org/display/DSDOC5x/Metadata+and+Bitstream+Format+Registries#MetadataandBitstreamFormatRegistries-DublinCoreTermsRegistry(DCTERMS)">added in DSpace 4</a> to allow for future work to make DSpace more flexible</li> <li>CGSpace&rsquo;s <code>dc</code> registry has 96 items, and the default DSpace one has 73.</li> </ul> <h2 id="2016-05-11">2016-05-11</h2> <ul> <li><p>Identify and propose the next phase of CGSpace fields to migrate:</p> <ul> <li>dc.title.jtitle → cg.title.journal</li> <li>dc.identifier.status → cg.identifier.status</li> <li>dc.river.basin → cg.river.basin</li> <li>dc.Species → cg.species</li> <li>dc.targetaudience → cg.targetaudience</li> <li>dc.fulltextstatus → cg.fulltextstatus</li> <li>dc.editon → cg.edition</li> <li>dc.isijournal → cg.isijournal</li> </ul></li> <li><p>Start a test rebase of the <code>5_x-prod</code> branch on top of the <code>dspace-5.5</code> tag</p></li> <li><p>There were a handful of conflicts that I didn&rsquo;t understand</p></li> <li><p>After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:</p></li> </ul> <pre><code>[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -&gt; [Help 1] </code></pre> <ul> <li>I&rsquo;ve sent them a question about it</li> <li>A user mentioned having problems with uploading a 33 MB PDF</li> <li>I told her I would increase the limit temporarily tomorrow morning</li> <li>Turns out she was able to decrease the size of the PDF so we didn&rsquo;t have to do anything</li> </ul> <h2 id="2016-05-12">2016-05-12</h2> <ul> <li>Looks like the issue that Abenet was having a few days ago with &ldquo;Connection Reset&rdquo; in Firefox might be due to a Firefox 46 issue: <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1268775">https://bugzilla.mozilla.org/show_bug.cgi?id=1268775</a></li> <li>I finally found a copy of the latest CG Core metadata guidelines and it looks like we can add a few more fields to our next migration: <ul> <li>dc.rplace.region → cg.coverage.region</li> <li>dc.cplace.country → cg.coverage.country</li> </ul></li> <li>Questions for CG people: <ul> <li>Our <code>dc.place</code> and <code>dc.srplace.subregion</code> could both map to <code>cg.coverage.admin-unit</code>?</li> <li>Should we use <code>dc.contributor.crp</code> or <code>cg.contributor.crp</code> for the CRP (ours is <code>dc.crsubject.crpsubject</code>)?</li> <li>Our <code>dc.contributor.affiliation</code> and <code>dc.contributor.corporate</code> could both map to <code>dc.contributor</code> and possibly <code>dc.contributor.center</code> depending on if it&rsquo;s a CG center or not</li> <li><code>dc.title.jtitle</code> could either map to <code>dc.publisher</code> or <code>dc.source</code> depending on how you read things</li> </ul></li> <li>Found ~200 messed up CIAT values in <code>dc.publisher</code>:</li> </ul> <pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &quot;% %&quot;; </code></pre> <h2 id="2016-05-13">2016-05-13</h2> <ul> <li>More theorizing about CGcore</li> <li>Add two new fields: <ul> <li>dc.srplace.subregion → cg.coverage.admin-unit</li> <li>dc.place → cg.place</li> </ul></li> <li><code>dc.place</code> is our own field, so it&rsquo;s easy to move</li> <li>I&rsquo;ve removed <code>dc.title.jtitle</code> from the list for now because there&rsquo;s no use moving it out of DC until we know where it will go (see discussion yesterday)</li> </ul> <h2 id="2016-05-18">2016-05-18</h2> <ul> <li>Work on 707 CCAFS records</li> <li>They have thumbnails on Flickr and elsewhere</li> <li>In OpenRefine I created a new <code>filename</code> column based on the <code>thumbnail</code> column with the following GREL:</li> </ul> <pre><code>if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1]) </code></pre> <ul> <li>Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL</li> <li>So for the <code>hqdefault.jpg</code> ones I just take the UUID (-2) and use it as the filename</li> <li>Before importing with SAFBuilder I tested adding &ldquo;__bundle:THUMBNAIL&rdquo; to the <code>filename</code> column and it works fine</li> </ul> <h2 id="2016-05-19">2016-05-19</h2> <ul> <li>More quality control on <code>filename</code> field of CCAFS records to make processing in shell and SAFBuilder more reliable:</li> </ul> <pre><code>value.replace('_','').replace('-','') </code></pre> <ul> <li>We need to hold off on moving <code>dc.Species</code> to <code>cg.species</code> because it is only used for plants, and might be better to move it to something like <code>cg.species.plant</code></li> <li>And <code>dc.identifier.fund</code> is MOSTLY used for CPWF project identifier but has some other sponsorship things <ul> <li>We should move PN<em>, SG</em>, CBA, IA, and PHASE* values to <code>cg.identifier.cpwfproject</code></li> <li>The rest, like BMGF and USAID etc, might have to go to either <code>dc.description.sponsorship</code> or <code>cg.identifier.fund</code> (not sure yet)</li> <li>There are also some mistakes in CPWF&rsquo;s things, like &ldquo;PN 47&rdquo;</li> <li>This ought to catch all the CPWF values (there don&rsquo;t appear to be and SG* values):</li> </ul></li> </ul> <pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA'); </code></pre> <h2 id="2016-05-20">2016-05-20</h2> <ul> <li>More work on CCAFS Video and Images records</li> <li>For SAFBuilder we need to modify filename column to have the thumbnail bundle: <br /></li> </ul> <pre><code>value + &quot;__bundle:THUMBNAIL&quot; </code></pre> <ul> <li>Also, I fixed some weird characters using OpenRefine&rsquo;s transform with the following GREL:</li> </ul> <pre><code>value.replace(/\u0081/,'') </code></pre> <ul> <li>Write shell script to resize thumbnails with height larger than 400: <a href="https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256">https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256</a></li> <li>Upload 707 CCAFS records to DSpace Test</li> <li>A few miscellaneous fixes for XMLUI display niggles (spaces in item lists and link target <code>_black</code>): <a href="https://github.com/ilri/DSpace/pull/224">#224</a></li> <li>Work on configuration changes for Phase 2 metadata migrations</li> </ul> <h2 id="2016-05-23">2016-05-23</h2> <ul> <li>Try to import the CCAFS Images and Videos to CGSpace but had some issues with LibreOffice and OpenRefine</li> <li>LibreOffice excludes empty cells when it exports and all the fields shift over to the left and cause URLs to go to Subjects, etc.</li> <li>Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don&rsquo;t match!</li> <li>I will have to try later</li> </ul> <h2 id="2016-05-30">2016-05-30</h2> <ul> <li>Export CCAFS video and image records from DSpace Test using the migrate option (<code>-m</code>):</li> </ul> <pre><code>$ mkdir ~/ccafs-images $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m </code></pre> <ul> <li>And then import to CGSpace:</li> </ul> <pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log </code></pre> <ul> <li>But now we have double authors for &ldquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rdquo; in the authority</li> <li>I&rsquo;m trying to do a Discovery index before messing with the authority index</li> <li>Looks like we are missing the <code>index-authority</code> cron job, so who knows what&rsquo;s up with our authority index</li> <li>Run system updates on DSpace Test, re-deploy code, and reboot the server</li> <li>Clean up and import ~200 CTA records to CGSpace via CSV like:</li> </ul> <pre><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &amp;&gt; ~/CTA-May30/CTA-42229.log </code></pre> <ul> <li>Discovery indexing took a few hours for some reason, and after that I started the <code>index-authority</code> script</li> </ul> <pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace index-authority </code></pre> <h2 id="2016-05-31">2016-05-31</h2> <ul> <li>The <code>index-authority</code> script ran over night and was finished in the morning</li> <li>Hopefully this was because we haven&rsquo;t been running it regularly and it will speed up next time</li> <li>I am running it again with a timer to see:</li> </ul> <pre><code>$ time /home/cgspace.cgiar.org/bin/dspace index-authority Retrieving all data Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer Cleaning the old index Writing new data All done ! real 37m26.538s user 2m24.627s sys 0m20.540s </code></pre> <ul> <li>Update <code>tomcat7</code> crontab on CGSpace and DSpace Test to have the <code>index-authority</code> script that we were missing</li> <li>Add new ILRI subject and CCAFS project tags to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/226">#226</a>, <a href="https://github.com/ilri/DSpace/pull/225">#225</a>)</li> <li>Manually mapped the authors of a few old CCAFS records to the new CCAFS authority UUID and re-indexed authority indexes to see if it helps correct those items.</li> <li>Re-sync DSpace Test data with CGSpace</li> <li>Clean up and import ~65 more CTA items into CGSpace</li> </ul> April, 2016 https://alanorth.github.io/cgspace-notes/2016-04/ Mon, 04 Apr 2016 11:06:00 +0300 https://alanorth.github.io/cgspace-notes/2016-04/ <h2 id="2016-04-04">2016-04-04</h2> <ul> <li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li> <li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li> <li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li> <li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li> <li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li> </ul> <p></p> <pre><code>Run start time: 03/06/2016 04:00:22 Error retrieving bitstream ID 71274 from asset store. java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.&lt;init&gt;(FileInputStream.java:146) at edu.sdsc.grid.io.local.LocalFileInputStream.open(LocalFileInputStream.java:171) at edu.sdsc.grid.io.GeneralFileInputStream.&lt;init&gt;(GeneralFileInputStream.java:145) at edu.sdsc.grid.io.local.LocalFileInputStream.&lt;init&gt;(LocalFileInputStream.java:139) at edu.sdsc.grid.io.FileFactory.newFileInputStream(FileFactory.java:630) at org.dspace.storage.bitstore.BitstreamStorageManager.retrieve(BitstreamStorageManager.java:525) at org.dspace.checker.BitstreamDAO.getBitstream(BitstreamDAO.java:60) at org.dspace.checker.CheckerCommand.processBitstream(CheckerCommand.java:303) at org.dspace.checker.CheckerCommand.checkBitstream(CheckerCommand.java:171) at org.dspace.checker.CheckerCommand.process(CheckerCommand.java:120) at org.dspace.app.checker.ChecksumChecker.main(ChecksumChecker.java:236) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77) ****************************************************** </code></pre> <ul> <li>So this would be the <code>tomcat7</code> Unix user, who seems to have a default limit of 1024 files in its shell</li> <li>For what it&rsquo;s worth, we have been setting the actual Tomcat 7 process&rsquo; limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li> <li>Looks like cron will read limits from <code>/etc/security/limits.*</code> so we can do something for the tomcat7 user there</li> <li>Submit pull request for Tomcat 7 limits in Ansible dspace role (<a href="https://github.com/ilri/rmg-ansible-public/pull/30">#30</a>)</li> </ul> <h2 id="2016-04-05">2016-04-05</h2> <ul> <li>Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don&rsquo;t need!</li> </ul> <pre><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt # grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del </code></pre> <ul> <li>Also, adjust the cron jobs for backups so they only backup <code>dspace.log</code> and some stats files (.dat)</li> <li>Try to do some metadata field migrations using the Atmire batch UI (<code>dc.Species</code> → <code>cg.species</code>) but it took several hours and even missed a few records</li> </ul> <h2 id="2016-04-06">2016-04-06</h2> <ul> <li>A better way to move metadata on this scale is via SQL, for example <code>dc.type.output</code> → <code>dc.type</code> (their IDs in the metadatafieldregistry are 66 and 109, respectively):</li> </ul> <pre><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66; UPDATE 40852 </code></pre> <ul> <li>After that an <code>index-discovery -bf</code> is required</li> <li>Start working on metadata migrations, add 25 or so new metadata fields to CGSpace</li> </ul> <h2 id="2016-04-07">2016-04-07</h2> <ul> <li>Write shell script to do the migration of fields: <a href="https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b">https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b</a></li> <li>Testing with a few fields it seems to work well:</li> </ul> <pre><code>$ ./migrate-fields.sh UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66 UPDATE 40883 UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72 UPDATE 21420 UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76 UPDATE 51258 </code></pre> <h2 id="2016-04-08">2016-04-08</h2> <ul> <li>Discuss metadata renaming with Abenet, we decided it&rsquo;s better to start with the center-specific subjects like ILRI, CIFOR, CCAFS, IWMI, and CPWF</li> <li>I&rsquo;ve e-mailed CCAFS and CPWF people to ask them how much time it will take for them to update their systems to cope with this change</li> </ul> <h2 id="2016-04-10">2016-04-10</h2> <ul> <li>Looking at the DOI issue <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860">reported by Leroy from CIAT a few weeks ago</a></li> <li>It seems the <code>dx.doi.org</code> URLs are much more proper in our repository!</li> </ul> <pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%'; count ------- 5638 (1 row) dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%'; count ------- 3 </code></pre> <ul> <li>I will manually edit the <code>dc.identifier.doi</code> in <a href="https://cgspace.cgiar.org/handle/10568/72509?show=full"><sup>10568</sup>&frasl;<sub>72509</sub></a> and tweet the link, then check back in a week to see if the donut gets updated</li> </ul> <h2 id="2016-04-11">2016-04-11</h2> <ul> <li>The donut is already updated and shows the correct number now</li> <li>CCAFS people say it will only take them an hour to update their code for the metadata renames, so I proposed we&rsquo;d do it tentatively on Monday the 18th.</li> </ul> <h2 id="2016-04-12">2016-04-12</h2> <ul> <li>Looking at quality of WLE data (<code>cg.subject.iwmi</code>) in SQL:</li> </ul> <pre><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc; </code></pre> <ul> <li>Listings and Reports is still not returning reliable data for <code>dc.type</code></li> <li>I think we need to ask Atmire, as their documentation isn&rsquo;t too clear on the format of the filter configs</li> <li>Alternatively, I want to see if I move all the data from <code>dc.type.output</code> to <code>dc.type</code> and then re-index, if it behaves better</li> <li>Looking at our <code>input-forms.xml</code> I see we have two sets of ILRI subjects, but one has a few extra subjects</li> <li>Remove one set of ILRI subjects and remove duplicate <code>VALUE CHAINS</code> from existing list (<a href="https://github.com/ilri/DSpace/pull/216">#216</a>)</li> <li>I decided to keep the set of subjects that had <code>FMD</code> and <code>RANGELANDS</code> added, as it appears to have been requested to have been added, and might be the newer list</li> <li>I found 226 blank metadatavalues:</li> </ul> <pre><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value=''; </code></pre> <ul> <li>I think we should delete them and do a full re-index:</li> </ul> <pre><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value=''; DELETE 226 </code></pre> <ul> <li>I deleted them on CGSpace but I&rsquo;ll wait to do the re-index as we&rsquo;re going to be doing one in a few days for the metadata changes anyways</li> <li>In other news, moving the <code>dc.type.output</code> to <code>dc.type</code> and re-indexing seems to have fixed the Listings and Reports issue from above</li> <li>Unfortunately this isn&rsquo;t a very good solution, because Listings and Reports config should allow us to filter on <code>dc.type.*</code> but the documentation isn&rsquo;t very clear and I couldn&rsquo;t reach Atmire today</li> <li>We want to do the <code>dc.type.output</code> move on CGSpace anyways, but we should wait as it might affect other external people!</li> </ul> <h2 id="2016-04-14">2016-04-14</h2> <ul> <li>Communicate with Macaroni Bros again about <code>dc.type</code></li> <li>Help Sisay with some rsync and Linux stuff</li> <li>Notify CIAT people of metadata changes (I had forgotten them last week)</li> </ul> <h2 id="2016-04-15">2016-04-15</h2> <ul> <li>DSpace Test had crashed, so I ran all system updates, rebooted, and re-deployed DSpace code</li> </ul> <h2 id="2016-04-18">2016-04-18</h2> <ul> <li>Talk to CIAT people about their portal again</li> <li>Start looking more at the fields we want to delete</li> <li>The following metadata fields have 0 items using them, so we can just remove them from the registry and any references in XMLUI, input forms, etc: <ul> <li>dc.description.abstractother</li> <li>dc.whatwasknown</li> <li>dc.whatisnew</li> <li>dc.description.nationalpartners</li> <li>dc.peerreviewprocess</li> <li>cg.species.animal</li> </ul></li> <li>Deleted!</li> <li>The following fields have some items using them and I have to decide what to do with them (delete or move): <ul> <li>dc.icsubject.icrafsubject: 6 items, mostly in CPWF collections</li> <li>dc.type.journal: 11 items, mostly in ILRI collections</li> <li>dc.publicationcategory: 1 item, in CPWF</li> <li>dc.GRP: 2 items, CPWF</li> <li>dc.Species.animal: 6 items, in ILRI and AnGR</li> <li>cg.livestock.agegroup: 9 items, in ILRI collections</li> <li>cg.livestock.function: 20 items, mostly in EADD</li> </ul></li> <li>Test metadata migration on local instance again:</li> </ul> <pre><code>$ ./migrate-fields.sh UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66 UPDATE 40885 UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76 UPDATE 51330 UPDATE metadatavalue SET metadata_field_id=208 WHERE metadata_field_id=82 UPDATE 5986 UPDATE metadatavalue SET metadata_field_id=210 WHERE metadata_field_id=88 UPDATE 2456 UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106 UPDATE 3872 UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108 UPDATE 46075 $ JAVA_OPTS=&quot;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace index-discovery -bf </code></pre> <ul> <li>CGSpace was down but I&rsquo;m not sure why, this was in <code>catalina.out</code>:</li> </ul> <pre><code>Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at org.dspace.rest.Resource.processFinally(Resource.java:163) at org.dspace.rest.HandleResource.getObject(HandleResource.java:81) at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) ... </code></pre> <ul> <li>Everything else in the system looked normal (50GB disk space available, nothing weird in dmesg, etc)</li> <li>After restarting Tomcat a few more of these errors were logged but the application was up</li> </ul> <h2 id="2016-04-19">2016-04-19</h2> <ul> <li>Get handles for items that are using a given metadata field, ie <code>dc.Species.animal</code> (105):</li> </ul> <pre><code># select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105); handle ------------- 10568/10298 10568/16413 10568/16774 10568/34487 </code></pre> <ul> <li>Delete metadata values for <code>dc.GRP</code> and <code>dc.icsubject.icrafsubject</code>:</li> </ul> <pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=96; # delete from metadatavalue where resource_type_id=2 and metadata_field_id=83; </code></pre> <ul> <li>They are old ICRAF fields and we haven&rsquo;t used them since 2011 or so</li> <li>Also delete them from the metadata registry</li> <li>CGSpace went down again, <code>dspace.log</code> had this:</li> </ul> <pre><code>2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object </code></pre> <ul> <li>I restarted Tomcat and PostgreSQL and now it&rsquo;s back up</li> <li>I bet this is the same crash as yesterday, but I only saw the errors in <code>catalina.out</code></li> <li>Looks to be related to this, from <code>dspace.log</code>:</li> </ul> <pre><code>2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement. </code></pre> <ul> <li>We have 18,000 of these errors right now&hellip;</li> <li>Delete a few more old metadata values: <code>dc.Species.animal</code>, <code>dc.type.journal</code>, and <code>dc.publicationcategory</code>:</li> </ul> <pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=105; # delete from metadatavalue where resource_type_id=2 and metadata_field_id=85; # delete from metadatavalue where resource_type_id=2 and metadata_field_id=95; </code></pre> <ul> <li>And then remove them from the metadata registry</li> </ul> <h2 id="2016-04-20">2016-04-20</h2> <ul> <li>Re-deploy DSpace Test with the new subject and type fields, run all system updates, and reboot the server</li> <li>Migrate fields and re-deploy CGSpace with the new subject and type fields, run all system updates, and reboot the server</li> <li>Field migration went well:</li> </ul> <pre><code>$ ./migrate-fields.sh UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66 UPDATE 40909 UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76 UPDATE 51419 UPDATE metadatavalue SET metadata_field_id=208 WHERE metadata_field_id=82 UPDATE 5986 UPDATE metadatavalue SET metadata_field_id=210 WHERE metadata_field_id=88 UPDATE 2458 UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106 UPDATE 3872 UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108 UPDATE 46075 </code></pre> <ul> <li>Also, I migrated CGSpace to using the PGDG PostgreSQL repo as the infrastructure playbooks had been using it for a while and it seemed to be working well</li> <li>Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)</li> <li>Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:</li> </ul> <pre><code>$ grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-20 21252 </code></pre> <ul> <li>I found a recent discussion on the DSpace mailing list and I&rsquo;ve asked for advice there</li> <li>Looks like this issue was noted and fixed in DSpace 5.5 (we&rsquo;re on 5.1): <a href="https://jira.duraspace.org/browse/DS-2936">https://jira.duraspace.org/browse/DS-2936</a></li> <li>I&rsquo;ve sent a message to Atmire asking about compatibility with DSpace 5.5</li> </ul> <h2 id="2016-04-21">2016-04-21</h2> <ul> <li>Fix a bunch of metadata consistency issues with IITA Journal Articles (Peer review, Formally published, messed up DOIs, etc)</li> <li>Atmire responded with DSpace 5.5 compatible versions for their modules, so I&rsquo;ll start testing those in a few weeks</li> </ul> <h2 id="2016-04-22">2016-04-22</h2> <ul> <li>Import 95 records into <a href="https://cgspace.cgiar.org/handle/10568/42219">CTA&rsquo;s Agrodok collection</a></li> </ul> <h2 id="2016-04-26">2016-04-26</h2> <ul> <li>Test embargo during item upload</li> <li>Seems to be working but the help text is misleading as to the date format</li> <li>It turns out the <code>robots.txt</code> issue we thought we solved last month isn&rsquo;t solved because you can&rsquo;t use wildcards in URL patterns: <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> <li>Write some nginx rules to add <code>X-Robots-Tag</code> HTTP headers to the dynamic requests from <code>robots.txt</code> instead</li> <li>A few URLs to test with: <ul> <li><a href="https://dspacetest.cgiar.org/handle/10568/440/browse?type=bioversity">https://dspacetest.cgiar.org/handle/10568/440/browse?type=bioversity</a></li> <li><a href="https://dspacetest.cgiar.org/handle/10568/913/discover">https://dspacetest.cgiar.org/handle/10568/913/discover</a></li> <li><a href="https://dspacetest.cgiar.org/handle/10568/1/search-filter?filtertype_0=country&amp;filter_0=VIETNAM&amp;filter_relational_operator_0=equals&amp;field=country">https://dspacetest.cgiar.org/handle/10568/1/search-filter?filtertype_0=country&amp;filter_0=VIETNAM&amp;filter_relational_operator_0=equals&amp;field=country</a></li> </ul></li> </ul> <h2 id="2016-04-27">2016-04-27</h2> <ul> <li>I woke up to ten or fifteen &ldquo;up&rdquo; and &ldquo;down&rdquo; emails from the monitoring website</li> <li>Looks like the last one was &ldquo;down&rdquo; from about four hours ago</li> <li>I think there must be something with this REST stuff:</li> </ul> <pre><code># grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-* dspace.log.2016-04-01:0 dspace.log.2016-04-02:0 dspace.log.2016-04-03:0 dspace.log.2016-04-04:0 dspace.log.2016-04-05:0 dspace.log.2016-04-06:0 dspace.log.2016-04-07:0 dspace.log.2016-04-08:0 dspace.log.2016-04-09:0 dspace.log.2016-04-10:0 dspace.log.2016-04-11:0 dspace.log.2016-04-12:235 dspace.log.2016-04-13:44 dspace.log.2016-04-14:0 dspace.log.2016-04-15:35 dspace.log.2016-04-16:0 dspace.log.2016-04-17:0 dspace.log.2016-04-18:11942 dspace.log.2016-04-19:28496 dspace.log.2016-04-20:28474 dspace.log.2016-04-21:28654 dspace.log.2016-04-22:28763 dspace.log.2016-04-23:28773 dspace.log.2016-04-24:28775 dspace.log.2016-04-25:28626 dspace.log.2016-04-26:28655 dspace.log.2016-04-27:7271 </code></pre> <ul> <li>I restarted tomcat and it is back up</li> <li>Add Spanish XMLUI strings so those users see &ldquo;CGSpace&rdquo; instead of &ldquo;DSpace&rdquo; in the user interface (<a href="https://github.com/ilri/DSpace/pull/222">#222</a>)</li> <li>Submit patch to upstream DSpace for the misleading help text in the embargo step of the item submission: <a href="https://jira.duraspace.org/browse/DS-3172">https://jira.duraspace.org/browse/DS-3172</a></li> <li>Update infrastructure playbooks for nginx 1.10.x (stable) release: <a href="https://github.com/ilri/rmg-ansible-public/issues/32">https://github.com/ilri/rmg-ansible-public/issues/32</a></li> <li>Currently running on DSpace Test, we&rsquo;ll give it a few days before we adjust CGSpace</li> <li>CGSpace down, restarted tomcat and it&rsquo;s back up</li> </ul> <h2 id="2016-04-28">2016-04-28</h2> <ul> <li>Problems with stability again. I&rsquo;ve blocked access to <code>/rest</code> for now to see if the number of errors in the log files drop</li> <li>Later we could maybe start logging access to <code>/rest</code> and perhaps whitelist some IPs&hellip;</li> </ul> <h2 id="2016-04-30">2016-04-30</h2> <ul> <li>Logs for today and yesterday have zero references to this REST error, so I&rsquo;m going to open back up the REST API but log all requests</li> </ul> <pre><code>location /rest { access_log /var/log/nginx/rest.log; proxy_pass http://127.0.0.1:8443; } </code></pre> <ul> <li>I will check the logs again in a few days to look for patterns, see who is accessing it, etc</li> </ul> March, 2016 https://alanorth.github.io/cgspace-notes/2016-03/ Wed, 02 Mar 2016 16:50:00 +0300 https://alanorth.github.io/cgspace-notes/2016-03/ <h2 id="2016-03-02">2016-03-02</h2> <ul> <li>Looking at issues with author authorities on CGSpace</li> <li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li> <li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li> </ul> <p></p> <h2 id="2016-03-07">2016-03-07</h2> <ul> <li>Troubleshooting the issues with the slew of commits for Atmire modules in <a href="https://github.com/ilri/DSpace/pull/182">#182</a></li> <li>Their changes on <code>5_x-dev</code> branch work, but it is messy as hell with merge commits and old branch base</li> <li>When I rebase their branch on the latest <code>5_x-prod</code> I get blank white pages</li> <li>I identified one commit that causes the issue and let them know</li> <li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li> </ul> <pre><code>Exception in thread &quot;Lucene Merge Thread #19&quot; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device </code></pre> <h2 id="2016-03-08">2016-03-08</h2> <ul> <li>Add a few new filters to Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li> <li>We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were</li> </ul> <h2 id="2016-03-10">2016-03-10</h2> <ul> <li>Disable the lucene cron job on CGSpace as it shouldn&rsquo;t be needed anymore</li> <li>Discuss ORCiD and duplicate authors on Yammer</li> <li>Request new documentation for Atmire CUA and L&amp;R modules, as ours are from 2013</li> <li>Walk Sisay through some data cleaning workflows in OpenRefine</li> <li>Start cleaning up the configuration for Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/issues/185">#184</a>)</li> <li>It is very messed up because some labels are incorrect, fields are missing, etc</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/03/cua-label-mixup.png" alt="Mixed up label in Atmire CUA" /></p> <ul> <li>Update documentation for Atmire modules</li> </ul> <h2 id="2016-03-11">2016-03-11</h2> <ul> <li>As I was looking at the CUA config I realized our Discovery config is all messed up and confusing</li> <li>I&rsquo;ve opened an issue to track some of that work (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li> <li>I did some major cleanup work on Discovery and XMLUI stuff related to the <code>dc.type</code> indexes (<a href="https://github.com/ilri/DSpace/pull/187">#187</a>)</li> <li>We had been confusing <code>dc.type</code> (a Dublin Core value) with <code>dc.type.output</code> (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.</li> <li>There is still some more work to be done to remove references to old <code>outputtype</code> and <code>output</code></li> </ul> <h2 id="2016-03-14">2016-03-14</h2> <ul> <li>Fix some items that had invalid dates (I noticed them in the log during a re-indexing)</li> <li>Reset <code>search.index.*</code> to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): <a href="https://github.com/ilri/DSpace/pull/188">#188</a></li> <li>Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li> <li>Also four or so center-specific subject strings were missing for Discovery</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/03/missing-xmlui-string.png" alt="Missing XMLUI string" /></p> <h2 id="2016-03-15">2016-03-15</h2> <ul> <li>Create simple theme for new AVCD community just for a unique Google Tracking ID (<a href="https://github.com/ilri/DSpace/pull/191">#191</a>)</li> </ul> <h2 id="2016-03-16">2016-03-16</h2> <ul> <li>Still having problems deploying Atmire&rsquo;s CUA updates and fixes from January!</li> <li>More discussion on the GitHub issue here: <a href="https://github.com/ilri/DSpace/pull/182">https://github.com/ilri/DSpace/pull/182</a></li> <li>Clean up Atmire CUA config (<a href="https://github.com/ilri/DSpace/pull/193">#193</a>)</li> <li>Help Sisay with some PostgreSQL queries to clean up the incorrect <code>dc.contributor.corporateauthor</code> field</li> <li>I noticed that we have some weird values in <code>dc.language</code>:</li> </ul> <pre><code># select * from metadatavalue where metadata_field_id=37; metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------ 1942571 | 35342 | 37 | hi | | 1 | | -1 | 2 1942468 | 35345 | 37 | hi | | 1 | | -1 | 2 1942479 | 35337 | 37 | hi | | 1 | | -1 | 2 1942505 | 35336 | 37 | hi | | 1 | | -1 | 2 1942519 | 35338 | 37 | hi | | 1 | | -1 | 2 1942535 | 35340 | 37 | hi | | 1 | | -1 | 2 1942555 | 35341 | 37 | hi | | 1 | | -1 | 2 1942588 | 35343 | 37 | hi | | 1 | | -1 | 2 1942610 | 35346 | 37 | hi | | 1 | | -1 | 2 1942624 | 35347 | 37 | hi | | 1 | | -1 | 2 1942639 | 35339 | 37 | hi | | 1 | | -1 | 2 </code></pre> <ul> <li>It seems this <code>dc.language</code> field isn&rsquo;t really used, but we should delete these values</li> <li>Also, <code>dc.language.iso</code> has some weird values, like &ldquo;En&rdquo; and &ldquo;English&rdquo;</li> </ul> <h2 id="2016-03-17">2016-03-17</h2> <ul> <li>It turns out <code>hi</code> is the ISO 639 language code for Hindi, but these should be in <code>dc.language.iso</code> instead of <code>dc.language</code></li> <li>I fixed the eleven items with <code>hi</code> as well as some using the incorrect <code>vn</code> for Vietnamese</li> <li>Start discussing CG core with Abenet and Sisay</li> <li>Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches</li> <li>The patches work fine with a clean database, so the error was caused by some mismatch in CUA versions and the database during my testing</li> </ul> <h2 id="2016-03-18">2016-03-18</h2> <ul> <li>Merge Atmire fixes into <code>5_x-prod</code></li> <li>Discuss thumbnails with Francesca from Bioversity</li> <li>Some of their items end up with thumbnails that have a big white border around them:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/03/bioversity-thumbnail-bad.jpg" alt="Excessive whitespace in thumbnail" /></p> <ul> <li>Turns out we can add <code>-trim</code> to the GraphicsMagick options to trim the whitespace</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/03/bioversity-thumbnail-good.jpg" alt="Trimmed thumbnail" /></p> <ul> <li>Command used:</li> </ul> <pre><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg </code></pre> <ul> <li>Also, it looks like adding <code>-sharpen 0x1.0</code> really improves the quality of the image for only a few KB</li> </ul> <h2 id="2016-03-21">2016-03-21</h2> <ul> <li>Fix 66 site errors in Google&rsquo;s webmaster tools</li> <li>I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed</li> <li>We also have 1,300 &ldquo;soft 404&rdquo; errors for URLs like: <a href="https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity">https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity</a></li> <li>I&rsquo;ve marked them as fixed as well since the ones I tested were working fine</li> <li>This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem&hellip;</li> <li>Results pages like this give items that Google already knows from the sitemap: <a href="https://cgspace.cgiar.org/discover?filtertype=author&amp;filter_relational_operator=equals&amp;filter=Orth%2C+A">https://cgspace.cgiar.org/discover?filtertype=author&amp;filter_relational_operator=equals&amp;filter=Orth%2C+A</a>.</li> <li>There are some access denied errors on JSPUI links (of course! we forbid them!), but I&rsquo;m not sure why Google is trying to index them&hellip;</li> <li>For example: <ul> <li>This: <a href="https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf">https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf</a></li> <li>Linked from: <a href="https://cgspace.cgiar.org/jspui/handle/10568/809">https://cgspace.cgiar.org/jspui/handle/10568/809</a></li> </ul></li> <li>I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!</li> <li>Google says the first time it saw this particular error was September 29, 2015&hellip; so maybe it accidentally saw it somehow&hellip;</li> <li>On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/03/google-index.png" alt="CGSpace pages in Google index" /></p> <ul> <li>Turns out this is a problem with DSpace&rsquo;s <code>robots.txt</code>, and there&rsquo;s a Jira ticket since December, 2015: <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li> <li>I am not sure if I want to apply it yet</li> <li>For now I&rsquo;ve just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/03/url-parameters.png" alt="URL parameters cause millions of dynamic pages" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/03/url-parameters2.png" alt="Setting pages with the filter_0 param not to show in search results" /></p> <ul> <li>Move AVCD collection to new community and update <code>move_collection.sh</code> script: <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">https://gist.github.com/alanorth/392c4660e8b022d99dfa</a></li> <li>It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs</li> <li>De-deploy CGSpace with latest <code>5_x-prod</code> branch</li> <li>Run updates on CGSpace and reboot server (new kernel, <code>4.5.0</code>)</li> <li>Deploy Let&rsquo;s Encrypt certificate for cgspace.cgiar.org, but still need to work it into the ansible playbooks</li> </ul> <h2 id="2016-03-22">2016-03-22</h2> <ul> <li>Merge robots.txt patch and disallow indexing of browse pages as our sitemap is consumed correctly (<a href="https://github.com/ilri/DSpace/issues/198">#198</a>)</li> </ul> <h2 id="2016-03-23">2016-03-23</h2> <ul> <li>Abenet is having problems saving group memberships, and she gets this error: <a href="https://gist.github.com/alanorth/87281c061c2de57b773e">https://gist.github.com/alanorth/87281c061c2de57b773e</a></li> </ul> <pre><code>Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967) </code></pre> <ul> <li>I can reproduce the same error on DSpace Test and on my Mac</li> <li>Looks to be an issue with the Atmire modules, I&rsquo;ve submitted a ticket to their tracker.</li> </ul> <h2 id="2016-03-24">2016-03-24</h2> <ul> <li>Atmire sent a patch for the group saving issue: <a href="https://github.com/ilri/DSpace/pull/201">https://github.com/ilri/DSpace/pull/201</a></li> <li>I tested it locally and it works, so I merged it to <code>5_x-prod</code> and will deploy on CGSpace this week</li> </ul> <h2 id="2016-03-25">2016-03-25</h2> <ul> <li>Having problems with Listings and Reports, seems to be caused by a rogue reference to <code>dc.type.output</code></li> <li>This is the error we get when we proceed to the second page of Listings and Reports: <a href="https://gist.github.com/alanorth/b2d7fb5b82f94898caaf">https://gist.github.com/alanorth/b2d7fb5b82f94898caaf</a></li> <li>Commenting out the line works, but I haven&rsquo;t figured out the proper syntax for referring to <code>dc.type.*</code></li> </ul> <h2 id="2016-03-28">2016-03-28</h2> <ul> <li>Look into enabling the embargo during item submission, see: <a href="https://wiki.duraspace.org/display/DSDOC5x/Embargo#Embargo-SubmissionProcess">https://wiki.duraspace.org/display/DSDOC5x/Embargo#Embargo-SubmissionProcess</a></li> <li>Seems we only want <code>AccessStep</code> because <code>UploadWithEmbargoStep</code> disables the ability to edit embargos at the item level</li> <li>This pull request enables the ability to set an item-level embargo during submission: <a href="https://github.com/ilri/DSpace/pull/203">https://github.com/ilri/DSpace/pull/203</a></li> <li>I figured out that the problem with Listings and Reports was because I disabled the <code>search.index.*</code> last week, and they are still used by JSPUI apparently</li> <li>This pull request re-enables them: <a href="https://github.com/ilri/DSpace/pull/202">https://github.com/ilri/DSpace/pull/202</a></li> <li>Re-deploy DSpace Test, run all system updates, and restart the server</li> <li>Looks like the Listings and Reports fix was NOT due to the search indexes (which are actually not used), and rather due to the filter configuration in the Listings and Reports config</li> <li>This pull request simply updates the config for the dc.type.output → dc.type change that was made last week: <a href="https://github.com/ilri/DSpace/pull/204">https://github.com/ilri/DSpace/pull/204</a></li> <li>Deploy robots.txt fix, embargo for item submissions, and listings and reports fix on CGSpace</li> </ul> <h2 id="2016-03-29">2016-03-29</h2> <ul> <li>Skype meeting with Peter and Addis team to discuss metadata changes for Dublin Core, CGcore, and CGSpace-specific fields</li> <li>We decided to proceed with some deletes first, then identify CGSpace-specific fields to clean/move to <code>cg.*</code>, and then worry about broader changes to DC</li> <li>Before we move or rename and fields we need to circulate a list of fields we intend to change to CCAFS, CWPF, etc who might be harvesting the fields</li> <li>After all of this we need to start implementing controlled vocabularies for fields, either with the Javascript lookup or like existing ILRI subjects</li> </ul> February, 2016 https://alanorth.github.io/cgspace-notes/2016-02/ Fri, 05 Feb 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-02/ <h2 id="2016-02-05">2016-02-05</h2> <ul> <li>Looking at some DAGRIS data for Abenet Yabowork</li> <li>Lots of issues with spaces, newlines, etc causing the import to fail</li> <li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p> <ul> <li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li> <li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li> </ul> <p></p> <h2 id="2016-02-06">2016-02-06</h2> <ul> <li>Found a way to get items with null/empty metadata values from SQL</li> <li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li> </ul> <pre><code>dspacetest=# select * from metadatafieldregistry; </code></pre> <ul> <li>In this case our country field is 78</li> <li>Now find all resources with type 2 (item) that have null/empty values for that field:</li> </ul> <pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL); </code></pre> <ul> <li>Then you can find the handle that owns it from its <code>resource_id</code>:</li> </ul> <pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678'; </code></pre> <ul> <li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li> </ul> <pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value=''; DELETE 25 </code></pre> <ul> <li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li> <li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 &ldquo;|||&rdquo; countries are still there</li> <li>Maybe I need to do a full re-index&hellip;</li> <li>Yep! The full re-index seems to work.</li> <li>Process the empty countries on CGSpace</li> </ul> <h2 id="2016-02-07">2016-02-07</h2> <ul> <li>Working on cleaning up Abenet&rsquo;s DAGRIS data with OpenRefine</li> <li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape(&quot;javascript&quot;)</code> which shows whitespace characters like <code>\r\n</code>!</li> <li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li> <li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace(&quot;\.0&quot;, &quot;&quot;)</code></li> <li>I need to start running DSpace in Mac OS X instead of a Linux VM</li> <li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li> </ul> <pre><code>$ postgres -D /opt/brew/var/postgres $ createuser --superuser postgres $ createuser --pwprompt dspacetest $ createdb -O dspacetest --encoding=UNICODE dspacetest $ psql postgres postgres=# alter user dspacetest createuser; postgres=# \q $ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup $ psql postgres postgres=# alter user dspacetest nocreateuser; postgres=# \q $ vacuumdb dspacetest $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost </code></pre> <ul> <li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li> </ul> <pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig $ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT $ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest $ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui $ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai $ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start </code></pre> <ul> <li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li> <li>For example:</li> </ul> <pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot; </code></pre> <ul> <li>After verifying that the site is working, start a full index:</li> </ul> <pre><code>$ ~/dspace/bin/dspace index-discovery -b </code></pre> <h2 id="2016-02-08">2016-02-08</h2> <ul> <li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li> <li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme&rsquo;s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" /> <img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p> <h2 id="2016-02-09">2016-02-09</h2> <ul> <li>Re-sync DSpace Test with CGSpace</li> <li>Help Sisay with OpenRefine</li> <li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li> </ul> <pre><code>$ cd ~/src/git $ git clone https://github.com/letsencrypt/letsencrypt $ cd letsencrypt $ sudo service nginx stop # add port 443 to firewall rules $ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org $ sudo service nginx start $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass </code></pre> <ul> <li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li> <li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs&hellip;</li> <li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li> <li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li> <li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li> </ul> <pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space </code></pre> <ul> <li>or</li> </ul> <pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object </code></pre> <ul> <li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li> </ul> <pre><code># free -m total used free shared buffers cached Mem: 3950 3902 48 9 37 1311 -/+ buffers/cache: 2552 1397 Swap: 255 57 198 </code></pre> <ul> <li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li> </ul> <h2 id="2016-02-11">2016-02-11</h2> <ul> <li>Massaging some CIAT data in OpenRefine</li> <li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li> <li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li> </ul> <pre><code>value.split('/')[-1] </code></pre> <ul> <li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li> </ul> <pre><code>$ ./generate-thumbnails.py ciat-reports.csv Processing 64661.pdf &gt; Downloading 64661.pdf &gt; Creating thumbnail for 64661.pdf Processing 64195.pdf &gt; Downloading 64195.pdf &gt; Creating thumbnail for 64195.pdf </code></pre> <h2 id="2016-02-12">2016-02-12</h2> <ul> <li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li> <li>A few items are using the same exact PDF</li> <li>A few items are using HTM or DOC files</li> <li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li> <li>A few items have no item</li> <li>Also, I&rsquo;m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li> </ul> <h2 id="2016-02-12-1">2016-02-12</h2> <ul> <li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li> <li>265 items have dirty, URL-encoded filenames:</li> </ul> <pre><code>$ ls | grep -c -E &quot;%&quot; 265 </code></pre> <ul> <li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li> <li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li> </ul> <pre><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf </code></pre> <ul> <li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li> <li>They will be deployed on CGSpace the next time I re-deploy</li> </ul> <h2 id="2016-02-16">2016-02-16</h2> <ul> <li>Turns out OpenRefine has an unescape function!</li> </ul> <pre><code>value.unescape(&quot;url&quot;) </code></pre> <ul> <li>This turns the URLs into human-readable versions that we can use as proper filenames</li> <li>Run web server and system updates on DSpace Test and reboot</li> <li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn&rsquo;t have the brackets, like <code>dc.identifier.url2</code></li> <li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li> <li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li> <li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li> <li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li> </ul> <h2 id="2016-02-17">2016-02-17</h2> <ul> <li>Re-deploy CGSpace, run all system updates, and reboot</li> <li>More work on CIAT data, cleaning and doing a last metadata-only import into DSpace Test</li> <li>SAFBuilder has a bug preventing it from processing filenames containing more than one underscore</li> <li>Need to re-process the filename column to replace multiple underscores with one: <code>value.replace(/_{2,}/, &quot;_&quot;)</code></li> </ul> <h2 id="2016-02-20">2016-02-20</h2> <ul> <li>Turns out the &ldquo;bug&rdquo; in SAFBuilder isn&rsquo;t a bug, it&rsquo;s a feature that allows you to encode extra information like the destintion bundle in the filename</li> <li>Also, it seems DSpace&rsquo;s SAF import tool doesn&rsquo;t like importing filenames that have accents in them:</li> </ul> <pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory) </code></pre> <ul> <li>Need to rename files to have no accents or umlauts, etc&hellip;</li> <li>Useful custom text facet for URLs ending with &ldquo;.pdf&rdquo;: <code>value.endsWith(&quot;.pdf&quot;)</code></li> </ul> <h2 id="2016-02-22">2016-02-22</h2> <ul> <li>To change Spanish accents to ASCII in OpenRefine:</li> </ul> <pre><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n') </code></pre> <ul> <li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li> <li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li> </ul> <pre><code>Bitstream: tést.pdf Bitstream: tést señora.pdf Bitstream: tést señora alimentación.pdf </code></pre> <ul> <li>Seems it could be something with the HFS+ filesystem actually, as it&rsquo;s not UTF-8 (<a href="http://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-probably-the-worst-file-system-ever.html">it&rsquo;s something like UCS-2</a>)</li> <li>HFS+ stores filenames as a string, and filenames with accents get stored as <a href="https://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/">character+accent</a> whereas Linux&rsquo;s ext4 stores them as an array of bytes</li> <li>Running the SAFBuilder on Mac OS X works if you&rsquo;re going to import the resulting bundle on Mac OS X, but if your DSpace is running on Linux you need to run the SAFBuilder there where the filesystem&rsquo;s encoding matches</li> </ul> <h2 id="2016-02-29">2016-02-29</h2> <ul> <li>Got notified by some CIFOR colleagues that the Google Scholar team had contacted them about CGSpace&rsquo;s incorrect ordering of authors in Google Scholar metadata</li> <li>Turns out there is a patch, and it was merged in DSpace 5.4: <a href="https://jira.duraspace.org/browse/DS-2679">https://jira.duraspace.org/browse/DS-2679</a></li> <li>I&rsquo;ve merged it into our <code>5_x-prod</code> branch that is currently based on DSpace 5.1</li> <li>We found a bug when a user searches from the homepage, sorts the results, and then tries to click &ldquo;View More&rdquo; in a sidebar facet</li> <li>I am not sure what causes it yet, but I opened an issue for it: <a href="https://github.com/ilri/DSpace/issues/179">https://github.com/ilri/DSpace/issues/179</a></li> <li>Have more problems with SAFBuilder on Mac OS X</li> <li>Now it doesn&rsquo;t recognize description hints in the filename column, like: <code>test.pdf__description:Blah</code></li> <li>But on Linux it works fine</li> <li>Trying to test Atmire&rsquo;s series of stats and CUA fixes from January and February, but their branch history is really messy and it&rsquo;s hard to see what&rsquo;s going on</li> <li>Rebasing their branch on top of our production branch results in a broken Tomcat, so I&rsquo;m going to tell them to fix their history and make a proper pull request</li> <li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li> <li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li> </ul> <pre><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_') </code></pre> <ul> <li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li> <li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li> </ul> January, 2016 https://alanorth.github.io/cgspace-notes/2016-01/ Wed, 13 Jan 2016 13:18:00 +0300 https://alanorth.github.io/cgspace-notes/2016-01/ <h2 id="2016-01-13">2016-01-13</h2> <ul> <li>Move ILRI collection <code>10568/12503</code> from <code>10568/27869</code> to <code>10568/27629</code> using the <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">move_collections.sh</a> script I wrote last year.</li> <li>I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.</li> <li>Update GitHub wiki for documentation of <a href="https://github.com/ilri/DSpace/wiki/Maintenance-Tasks">maintenance tasks</a>.</li> </ul> <p></p> <h2 id="2016-01-14">2016-01-14</h2> <ul> <li>Update CCAFS project identifiers in input-forms.xml</li> <li>Run system updates and restart the server</li> </ul> <h2 id="2016-01-18">2016-01-18</h2> <ul> <li>Change &ldquo;Extension material&rdquo; to &ldquo;Extension Material&rdquo; in input-forms.xml (a mistake that fell through the cracks when we fixed the others in DSpace 4 era)</li> </ul> <h2 id="2016-01-19">2016-01-19</h2> <ul> <li>Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme&rsquo;s primary color (<a href="https://github.com/ilri/DSpace/issues/157">#157</a>)</li> <li>Tweak date-based facets to show more values in drill-down ranges (<a href="https://github.com/ilri/DSpace/issues/162">#162</a>)</li> <li>Need to remember to clear the Cocoon cache after deployment or else you don&rsquo;t see the new ranges immediately</li> <li>Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account</li> <li>Altmetrics&rsquo; support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li> </ul> <h2 id="2016-01-21">2016-01-21</h2> <ul> <li>Still waiting for my IFTTT recipe to fire, two days later</li> <li>It looks like the Atom feed on CGSpace hasn&rsquo;t changed in two days, but there have definitely been new items</li> <li>The RSS feed is nearly as old, but has different old items there</li> <li>On a hunch I cleared the Cocoon cache and now the feeds are fresh</li> <li>Looks like there is configuration option related to this, <code>webui.feed.cache.age</code>, which defaults to 48 hours, though I&rsquo;m not sure what relation it has to the Cocoon cache</li> <li>In any case, we should change this cache to be something more like 6 hours, as we publish new items several times per day.</li> <li>Work around a CSS issue with long URLs in the item view (<a href="https://github.com/ilri/DSpace/issues/172">#172</a>)</li> </ul> <h2 id="2016-01-25">2016-01-25</h2> <ul> <li>Re-deploy CGSpace and DSpace Test with latest <code>5_x-prod</code> branch</li> <li>This included the social icon fixes/updates, date-based facet tweaks, reducing the feed cache age, and fixing a layout issue in XMLUI item view when an item had long URLs</li> </ul> <h2 id="2016-01-26">2016-01-26</h2> <ul> <li>Run nginx updates on CGSpace and DSpace Test (<a href="http://mailman.nginx.org/pipermail/nginx/2016-January/049700.html">1.8.1 and 1.9.10, respectively</a>)</li> <li>Run updates on DSpace Test and reboot for new Linode kernel <code>Linux 4.4.0-x86_64-linode63</code> (first update in months)</li> </ul> <h2 id="2016-01-28">2016-01-28</h2> <ul> <li>Start looking at importing some Bioversity data that had been prepared earlier this week</li> <li><p>While checking the data I noticed something strange, there are 79 items but only 8 unique PDFs:</p> <p>$ ls SimpleArchiveForBio/ | wc -l 79 $ find SimpleArchiveForBio/ -iname &ldquo;*.pdf&rdquo; -exec basename {} \; | sort -u | wc -l 8</p></li> </ul> <h2 id="2016-01-29">2016-01-29</h2> <ul> <li>Add five missing center-specific subjects to XMLUI item view (<a href="https://github.com/ilri/DSpace/issues/174">#174</a>)</li> <li>This <a href="https://cgspace.cgiar.org/handle/10568/67062">CCAFS item</a> Before:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/01/xmlui-subjects-before.png" alt="XMLUI subjects before" /></p> <ul> <li>After:</li> </ul> <p><img src="https://alanorth.github.io/cgspace-notes/cgspace-notes/2016/01/xmlui-subjects-after.png" alt="XMLUI subjects after" /></p>