cgspace-notes/docs/2017-06/index.html

350 lines
16 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="June, 2017" />
<meta property="og:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-06/" />
<meta property="article:published_time" content="2017-06-01T10:14:52&#43;03:00"/>
<meta property="article:modified_time" content="2018-03-09T22:10:33&#43;02:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.55.5" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "June, 2017",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2017-06\/",
"wordCount": "1261",
"datePublished": "2017-06-01T10:14:52\x2b03:00",
"dateModified": "2018-03-09T22:10:33\x2b02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-06/">
<title>June, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2017-06/">June, 2017</a></h2>
<p class="blog-post-meta"><time datetime="2017-06-01T10:14:52&#43;03:00">Thu Jun 01, 2017</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-06-01">2017-06-01</h2>
<ul>
<li>After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes</li>
<li>The <code>cg.identifier.wletheme</code> field will be used for both Phase I and Phase II Research Themes</li>
<li>Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there</li>
<li>The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo;</li>
<li>Tagged all items in the current Phase I collections with their appropriate themes</li>
<li>Create pull request to add Phase II research themes to the submission form: <a href="https://github.com/ilri/DSpace/pull/328">#328</a></li>
<li>Add <code>cg.subject.system</code> to CGSpace metadata registry, for subject from the upcoming CGIAR Library migration</li>
</ul>
<h2 id="2017-06-04">2017-06-04</h2>
<ul>
<li>After adding <code>cg.identifier.wletheme</code> to 1106 WLE items I can see the field on XMLUI but not in REST!</li>
<li>Strangely it happens on DSpace Test AND on CGSpace!</li>
<li>I tried to re-index Discovery but it didn&rsquo;t fix it</li>
<li>Run all system updates on DSpace Test and reboot the server</li>
<li>After rebooting the server (and therefore restarting Tomcat) the new metadata field is available</li>
<li>I&rsquo;ve sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket</li>
</ul>
<h2 id="2016-06-05">2016-06-05</h2>
<ul>
<li>Rename WLE&rsquo;s &ldquo;Research Themes&rdquo; sub-community to &ldquo;WLE Phase I Research Themes&rdquo; on DSpace Test so Macaroni Bros can continue their testing</li>
<li>Macaroni Bros tested it and said it&rsquo;s fine, so I renamed it on CGSpace as well</li>
<li>Working on how to automate the extraction of the CIAT Book chapters, doing some magic in OpenRefine to extract page fromto from cg.identifier.url and dc.format.extent, respectively:
<ul>
<li>cg.identifier.url: <code>value.split(&quot;page=&quot;, &quot;&quot;)[1]</code></li>
<li>dc.format.extent: <code>value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[1].toNumber() - value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[0].toNumber()</code></li>
</ul></li>
<li>Finally, after some filtering to see which small outliers there were (based on dc.format.extent using &ldquo;p. 1-14&rdquo; vs &ldquo;29 p.&rdquo;), create a new column with last page number:
<ul>
<li><code>cells[&quot;dc.page.from&quot;].value.toNumber() + cells[&quot;dc.format.pages&quot;].value.toNumber()</code></li>
</ul></li>
<li>Then create a new, unique file name to be used in the output, based on a SHA1 of the dc.title and with a description:
<ul>
<li>dc.page.to: <code>value.split(&quot; &quot;)[0].replace(&quot;,&quot;,&quot;&quot;).toLowercase() + &quot;-&quot; + sha1(value).get(1,9) + &quot;.pdf__description:&quot; + cells[&quot;dc.type&quot;].value</code></li>
</ul></li>
<li>Start processing 769 records after filtering the following (there are another 159 records that have some other format, or for example they have their own PDF which I will process later), using a modified <code>generate-thumbnails.py</code> script to read certain fields and then pass to GhostScript:
<ul>
<li>cg.identifier.url: <code>value.contains(&quot;page=&quot;)</code></li>
<li>dc.format.extent: <code>or(value.contains(&quot;p. &quot;),value.contains(&quot; p.&quot;))</code></li>
<li>Command like: <code>$ gs -dNOPAUSE -dBATCH -dFirstPage=14 -dLastPage=27 -sDEVICE=pdfwrite -sOutputFile=beans.pdf -f 12605-1.pdf</code></li>
</ul></li>
<li>17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF</li>
<li><p>I&rsquo;ve flagged them and proceeded without them (752 total) on DSpace Test:</p>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
</code></pre></li>
<li><p>I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)</p></li>
<li><p>Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT</p></li>
<li><p>Restart Tomcat on CGSpace so that the <code>cg.identifier.wletheme</code> field is available on REST API for Macaroni Bros</p></li>
</ul>
<h2 id="2017-06-07">2017-06-07</h2>
<ul>
<li>Testing <a href="https://github.com/ilri/DSpace/pull/319">Atmire&rsquo;s patch for the CUA Workflow Statistics again</a></li>
<li>Still doesn&rsquo;t seem to give results I&rsquo;d expect, like there are no results for Maria Garruccio, or for the ILRI community!</li>
<li>Then I&rsquo;ll file an update to the issue on Atmire&rsquo;s tracker</li>
<li>Created a new branch with just the relevant changes, so I can send it to them</li>
<li><p>One thing I noticed is that there is a failed database migration related to CUA:</p>
<pre><code>+----------------+----------------------------+---------------------+---------+
| Version | Description | Installed on | State |
+----------------+----------------------------+---------------------+---------+
| 1.1 | Initial DSpace 1.1 databas | | PreInit |
| 1.2 | Upgrade to DSpace 1.2 sche | | PreInit |
| 1.3 | Upgrade to DSpace 1.3 sche | | PreInit |
| 1.3.9 | Drop constraint for DSpace | | PreInit |
| 1.4 | Upgrade to DSpace 1.4 sche | | PreInit |
| 1.5 | Upgrade to DSpace 1.5 sche | | PreInit |
| 1.5.9 | Drop constraint for DSpace | | PreInit |
| 1.6 | Upgrade to DSpace 1.6 sche | | PreInit |
| 1.7 | Upgrade to DSpace 1.7 sche | | PreInit |
| 1.8 | Upgrade to DSpace 1.8 sche | | PreInit |
| 3.0 | Upgrade to DSpace 3.x sche | | PreInit |
| 4.0 | Initializing from DSpace 4 | 2015-11-20 12:42:52 | Success |
| 5.0.2014.08.08 | DS-1945 Helpdesk Request a | 2015-11-20 12:42:53 | Success |
| 5.0.2014.09.25 | DS 1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2014.09.26 | DS-1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2015.01.27 | MigrateAtmireExtraMetadata | 2015-11-20 12:43:29 | Success |
| 5.0.2017.04.28 | CUA eperson metadata migra | 2017-06-07 11:07:28 | OutOrde |
| 5.5.2015.12.03 | Atmire CUA 4 migration | 2016-11-27 06:39:05 | OutOrde |
| 5.5.2015.12.03 | Atmire MQM migration | 2016-11-27 06:39:06 | OutOrde |
| 5.6.2016.08.08 | CUA emailreport migration | 2017-01-29 11:18:56 | OutOrde |
+----------------+----------------------------+---------------------+---------+
</code></pre></li>
<li><p>Merge the pull request for <a href="https://github.com/ilri/DSpace/pull/328">WLE Phase II themes</a></p></li>
</ul>
<h2 id="2017-06-18">2017-06-18</h2>
<ul>
<li>Redeploy CGSpace with latest changes from <code>5_x-prod</code>, run system updates, and reboot the server</li>
<li>Continue working on ansible infrastructure changes for CGIAR Library</li>
</ul>
<h2 id="2017-06-20">2017-06-20</h2>
<ul>
<li>Import Abenet and Peter&rsquo;s changes to the CGIAR Library CRP community</li>
<li>Due to them using Windows and renaming some columns there were formatting, encoding, and duplicate metadata value issues</li>
<li>I had to remove some fields from the CSV and rename some back to, ie, <code>dc.subject[en_US]</code> just so DSpace would detect changes properly</li>
<li>Now it looks much better: <a href="https://dspacetest.cgiar.org/handle/10947/2517">https://dspacetest.cgiar.org/handle/10947/2517</a></li>
<li>Removing the HTML tags and HTML/XML entities using the following GREL:
<ul>
<li><code>replace(value,/&lt;\/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;/,'')</code></li>
<li><code>value.unescape(&quot;html&quot;).unescape(&quot;xml&quot;)</code></li>
</ul></li>
<li><p>Finally import 914 CIAT Book Chapters to CGSpace in two batches:</p>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &amp;&gt; /tmp/ciat-books2.log
</code></pre></li>
</ul>
<h2 id="2017-06-25">2017-06-25</h2>
<ul>
<li>WLE has said that one of their Phase II research themes is being renamed from <code>Regenerating Degraded Landscapes</code> to <code>Restoring Degraded Landscapes</code></li>
<li>Pull request with the changes to <code>input-forms.xml</code>: <a href="https://github.com/ilri/DSpace/pull/329">#329</a></li>
<li><p>As of now it doesn&rsquo;t look like there are any items using this research theme so we don&rsquo;t need to do any updates:</p>
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
text_value
------------
(0 rows)
</code></pre></li>
<li><p>Marianne from WLE asked if they can have both Phase I and II research themes together in the item submission form</p></li>
<li><p>Perhaps we can add them together in the same question for <code>cg.identifier.wletheme</code></p></li>
</ul>
<h2 id="2017-06-30">2017-06-30</h2>
<ul>
<li><p>CGSpace went down briefly, I see lots of these errors in the dspace logs:</p>
<pre><code>Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre></li>
<li><p>After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load</p></li>
<li><p>Might be a good time to adjust DSpace&rsquo;s database connection settings, like I first mentioned in April, 2017 after reading the <a href="https://wiki.duraspace.org/display/cmtygp/DCAT+Meeting+April+2017">2017-04 DCAT comments</a></p></li>
<li><p>I&rsquo;ve adjusted the following in CGSpace&rsquo;s config:</p>
<ul>
<li><code>db.maxconnections</code> 30→70 (the default PostgreSQL config allows 100 connections, so DSpace&rsquo;s default of 30 is quite low)</li>
<li><code>db.maxwait</code> 5000→10000</li>
<li><code>db.maxidle</code> 8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)</li>
</ul></li>
<li><p>We will need to adjust this again (as well as the <code>pg_hba.conf</code> settings) when we deploy tsega&rsquo;s REST API</p></li>
<li><p>Whip up a test for Marianne of WLE to be able to show both their Phase I and II research themes in the CGSpace item submission form:</p></li>
</ul>
<p><img src="/cgspace-notes/2017/06/wle-theme-test-a.png" alt="Test A for displaying the Phase I and II research themes" />
<img src="/cgspace-notes/2017/06/wle-theme-test-b.png" alt="Test B for displaying the Phase I and II research themes" /></p>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2019-05/">May, 2019</a></li>
<li><a href="/cgspace-notes/posts/">Posts</a></li>
<li><a href="/cgspace-notes/2019-05/">May, 2019</a></li>
<li><a href="/cgspace-notes/2019-04/">April, 2019</a></li>
<li><a href="/cgspace-notes/2019-03/">March, 2019</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>