cgspace-notes/public/2017-02/index.html

509 lines
20 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="February, 2017" />
<meta property="og:description" content="2017-02-07
An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------&#43;---------------&#43;---------
92551 | 313 | 80278
92550 | 313 | 80278
90774 | 1051 | 80278
(3 rows)
dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-02/" />
<meta property="article:published_time" content="2017-02-07T07:04:52-08:00"/>
<meta property="article:modified_time" content="2017-02-07T07:04:52-08:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2017"/>
<meta name="twitter:description" content="2017-02-07
An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------&#43;---------------&#43;---------
92551 | 313 | 80278
92550 | 313 | 80278
90774 | 1051 | 80278
(3 rows)
dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.18.1" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "February, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-02/",
"wordCount": "1595",
"datePublished": "2017-02-07T07:04:52-08:00",
"dateModified": "2017-02-07T07:04:52-08:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
}
,
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-02/">
<title>February, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-qRVpIj9hSzsBhmO8Y7YEKF2UFra2sJQtl9V/uFKKDvy&#43;Wjh9zgTku6VRgT8YdPoD" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2017-02/">February, 2017</a></h2>
<p class="blog-post-meta"><time datetime="2017-02-07T07:04:52-08:00">Tue Feb 07, 2017</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-02-07">2017-02-07</h2>
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre><code>dspace=# select * from collection2item where item_id = '80278';
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
92550 | 313 | 80278
90774 | 1051 | 80278
(3 rows)
dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
</code></pre>
<ul>
<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li>
<li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
</ul>
<p></p>
<h2 id="2017-02-08">2017-02-08</h2>
<ul>
<li>We also need to rename some of the CCAFS Phase I flagships:
<ul>
<li>CLIMATE-SMART AGRICULTURAL PRACTICESCLIMATE-SMART TECHNOLOGIES AND PRACTICES</li>
<li>CLIMATE RISK MANAGEMENTCLIMATE SERVICES AND SAFETY NETS</li>
<li>LOW EMISSIONS AGRICULTURELOW EMISSIONS DEVELOPMENT</li>
<li>POLICIES AND INSTITUTIONSPRIORITIES AND POLICIES FOR CSA</li>
</ul></li>
<li>The climate risk management one doesn&rsquo;t exist, so I will have to ask Magdalena if they want me to add it to the input forms</li>
<li>Start testing some nearly 500 author corrections that CCAFS sent me:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
</code></pre>
<h2 id="2017-02-09">2017-02-09</h2>
<ul>
<li>More work on CCAFS Phase II stuff</li>
<li>Looks like simply adding a new metadata field to <code>dspace/config/registries/cgiar-types.xml</code> and restarting DSpace causes the field to get added to the rregistry</li>
<li>It requires a restart but at least it allows you to manage the registry programmatically</li>
<li>It&rsquo;s not a very good way to manage the registry, though, as removing one there doesn&rsquo;t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created</li>
<li>Testing some corrections on CCAFS Phase II flagships (<code>cg.subject.ccafs</code>):</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre>
<h2 id="2017-02-10">2017-02-10</h2>
<ul>
<li>CCAFS said they want to wait on the flagship updates (<code>cg.subject.ccafs</code>) on CGSpace, perhaps for a month or so</li>
<li>Help Marianne Gadeberg (WLE) with some user permissions as it seems she had previously been using a personal email account, and is now on a CGIAR one</li>
<li>I manually added her new account to ~25 authorizations that her hold user was on</li>
</ul>
<h2 id="2017-02-14">2017-02-14</h2>
<ul>
<li>Add <code>SCALING</code> to ILRI subjects (<a href="https://github.com/ilri/DSpace/pull/304">#304</a>), as Sisay&rsquo;s attempts were all sloppy</li>
<li>Cherry pick some patches from the DSpace 5.7 branch:
<ul>
<li>DS-3363 CSV import error says &ldquo;row&rdquo;, means &ldquo;column&rdquo;: f7b6c83e991db099003ee4e28ca33d3c7bab48c0</li>
<li>DS-3479 avoid adding empty metadata values during import: 329f3b48a6de7fad074d825fd12118f7e181e151</li>
<li>[DS-3456] 5x Clarify command line options for statisics import/export tools (#1623): 567ec083c8a94eb2bcc1189816eb4f767745b278</li>
<li>[DS-3458]5x Allow Shard Process to Append to an existing repo: 3c8ecb5d1fd69a1dcfee01feed259e80abbb7749</li>
</ul></li>
<li>I still need to test these, especially as the last two which change some stuff with Solr maintenance</li>
</ul>
<h2 id="2017-02-15">2017-02-15</h2>
<ul>
<li>Update rvm on DSpace Test and CGSpace as there was a <a href="https://github.com/justinsteven/advisories/blob/master/2017_rvm_cd_command_execution.md">security disclosure about versions less than 1.28.0</a></li>
</ul>
<h2 id="2017-02-16">2017-02-16</h2>
<ul>
<li>Looking at memory info from munin on CGSpace:</li>
</ul>
<p><img src="/cgspace-notes/2017/02/meminfo_phisical-week.png" alt="CGSpace meminfo" /></p>
<ul>
<li>We are using only ~8GB of RAM for applications, and 16GB for caches!</li>
<li>The Linode machine we&rsquo;re on has 24GB of RAM but only because that&rsquo;s the only instance that had enough disk space for us (384GB)&hellip;</li>
<li>We should probably look into Google Compute Engine or Digital Ocean where we can get more storage without having to follow a linear increase in instance pricing for CPU/memory as well</li>
<li>Especially because we only use 2 out of 8 CPUs basically:</li>
</ul>
<p><img src="/cgspace-notes/2017/02/cpu-week.png" alt="CGSpace CPU" /></p>
<ul>
<li>Fix issue with duplicate declaration of in atmire-dspace-xmlui <code>pom.xml</code> (causing non-fatal warnings during the maven build)</li>
<li>Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site&rsquo;s properties file:</li>
</ul>
<pre><code>handle.canonical.prefix = https://hdl.handle.net/
</code></pre>
<ul>
<li>And then a SQL command to update existing records:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
UPDATE 58193
</code></pre>
<ul>
<li>Seems to work fine!</li>
<li>I noticed a few items that have incorrect DOI links (<code>dc.identifier.doi</code>), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:</li>
</ul>
<pre><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
</code></pre>
<ul>
<li>This will replace any that begin with <code>10.</code> and change them to <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
</code></pre>
<ul>
<li>This will get any that begin with <code>doi:10.</code> and change them to <code>https://dx.doi.org/10.x</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
</code></pre>
<ul>
<li>Fix DOIs like <code>dx.doi.org/10.</code> to be <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
</code></pre>
<ul>
<li>Fix DOIs like <code>http//</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
</code></pre>
<ul>
<li>Fix DOIs like <code>dx.doi.org./</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
</code></pre>
<ul>
<li>Delete some invalid DOIs:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
</code></pre>
<ul>
<li>Fix some other random outliers:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052';
</code></pre>
<ul>
<li>And do another round of <code>http://</code> → <code>https://</code> cleanups:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
</code></pre>
<ul>
<li>Run all DOI corrections on CGSpace</li>
<li>Something to think about here is to write a <a href="https://wiki.duraspace.org/display/DSDOC5x/Curation+System#CurationSystem-ScriptedTasks">Curation Task</a> in Java to do these sanity checks / corrections every night</li>
<li>Then we could add a cron job for them and run them from the command line like:</li>
</ul>
<pre><code>[dspace]/bin/dspace curate -t noop -i 10568/79891
</code></pre>
<h2 id="2017-02-20">2017-02-20</h2>
<ul>
<li>Run all system updates on DSpace Test and reboot the server</li>
<li>Run CCAFS author corrections on DSpace Test and CGSpace and force a full discovery reindex</li>
<li>Fix label of CCAFS subjects in Atmire Listings and Reports module</li>
<li>Help Sisay with SQL commands</li>
<li>Help Paola from CCAFS with the Atmire Listings and Reports module</li>
<li>Testing the <code>fix-metadata-values.py</code> script on macOS and it seems like we don&rsquo;t need to use <code>.encode('utf-8')</code> anymore when printing strings to the screen</li>
<li>It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string &ldquo;Entwicklung &amp; Ländlicher Raum&rdquo; without the <code>encode()</code> call, but print it as a bytes when it <em>is</em> used:</li>
</ul>
<pre><code>$ python
Python 3.6.0 (default, Dec 25 2016, 17:30:53)
&gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum')
Entwicklung &amp; Ländlicher Raum
&gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum'.encode())
b'Entwicklung &amp; L\xc3\xa4ndlicher Raum'
</code></pre>
<ul>
<li>So for now I will remove the encode call from the script (though it was never used on the versions on the Linux hosts), leading me to believe it really <em>was</em> a temporary problem, perhaps due to macOS or the Python build I was using.</li>
</ul>
<h2 id="2017-02-21">2017-02-21</h2>
<ul>
<li>Testing regenerating PDF thumbnails, like I started in 2016-11</li>
<li>It seems there is a bug in <code>filter-media</code> that causes it to process formats that aren&rsquo;t part of its configuration:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &quot;ImageMagick PDF Thumbnail&quot;
File: earlywinproposal_esa_postharvest.pdf.jpg
FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
File: postHarvest.jpg.jpg
FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
</code></pre>
<ul>
<li>According to <code>dspace.cfg</code> the ImageMagick PDF Thumbnail plugin should only process PDFs:</li>
</ul>
<pre><code>filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
</code></pre>
<ul>
<li>I&rsquo;ve sent a message to the mailing list and might file a Jira issue</li>
<li>Ask Atmire about the failed interpolation of the <code>dspace.internalUrl</code> variable in <code>atmire-cua.cfg</code></li>
</ul>
<h2 id="2017-02-22">2017-02-22</h2>
<ul>
<li>Atmire said I can add <code>dspace.internalUrl</code> to my build properties and the error will go away</li>
<li>It should be the local URL for accessing Tomcat from the server&rsquo;s own perspective, ie: <a href="http://localhost:8080">http://localhost:8080</a></li>
</ul>
<h2 id="2017-02-26">2017-02-26</h2>
<ul>
<li>Find all fields with &ldquo;<a href="http://hdl.handle.net&quot;">http://hdl.handle.net&quot;</a> values (most are in <code>dc.identifier.uri</code>, but some are in other URL-related fields like <code>cg.link.reference</code>, <code>cg.identifier.dataurl</code>, and <code>cg.identifier.url</code>):</li>
</ul>
<pre><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
UPDATE 58633
</code></pre>
<ul>
<li>This works but I&rsquo;m thinking I&rsquo;ll wait on the replacement as there are perhaps some other places that rely on <code>http://hdl.handle.net</code> (grep the code, it&rsquo;s scary how many things are hard coded)</li>
<li>Send message to dspace-tech mailing list with concerns about this</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 offset-sm-1 blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2017-02/">February, 2017</a></li>
<li><a href="/cgspace-notes/2017-01/">January, 2017</a></li>
<li><a href="/cgspace-notes/2016-12/">December, 2016</a></li>
<li><a href="/cgspace-notes/2016-11/">November, 2016</a></li>
<li><a href="/cgspace-notes/2016-10/">October, 2016</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>