cgspace-notes/docs/2018-05/index.html

334 lines
13 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2018" />
<meta property="og:description" content="2018-05-01
I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface:
http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E
http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-05/" />
<meta property="article:published_time" content="2018-05-01T16:43:54&#43;03:00"/>
<meta property="article:modified_time" content="2018-05-06T17:13:31&#43;03:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2018"/>
<meta name="twitter:description" content="2018-05-01
I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface:
http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E
http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
<meta name="generator" content="Hugo 0.40.2" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-05/",
"wordCount": "841",
"datePublished": "2018-05-01T16:43:54&#43;03:00",
"dateModified": "2018-05-06T17:13:31&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-05/">
<title>May, 2018 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-ZwlQQbzhEPf3PrXZ3h/XKT/4UHafQ/TYI72AL&#43;7WOJ8D6JmpGC8JMMse6xX7cyeI" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-05/">May, 2018</a></h2>
<p class="blog-post-meta"><time datetime="2018-05-01T16:43:54&#43;03:00">Tue May 01, 2018</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2018-05-01">2018-05-01</h2>
<ul>
<li>I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface:
<ul>
<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</a></li>
<li><a href="http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E">http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</a></li>
</ul></li>
<li>Then I reduced the JVM heap size from 6144 back to 5120m</li>
<li>Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to support hosts choosing which distribution they want to use</li>
</ul>
<p></p>
<h2 id="2018-05-02">2018-05-02</h2>
<ul>
<li>Advise Fabio Fidanza about integrating CGSpace content in the new CGIAR corporate website</li>
<li>I think they can mostly rely on using the <code>cg.contributor.crp</code> field</li>
<li>Looking over some IITA records for Sisay
<ul>
<li>Other than trimming and collapsing consecutive whitespace, I made some other corrections</li>
<li>I need to check the correct formatting of COTE D&rsquo;IVOIRE vs COTE DIVOIRE</li>
<li>I replaced all DOIs with HTTPS</li>
<li>I checked a few DOIs and found at least one that was missing, so I Googled the title of the paper and found the correct DOI</li>
<li>Also, I found an <a href="https://www.doi.org/factsheets/DOI_PURL.html">FAQ for DOI that says the <code>dx.doi.org</code> syntax is older</a>, so I will replace all the DOIs with <code>doi.org</code> instead</li>
<li>I found five records with &ldquo;ISI Jounal&rdquo; instead of &ldquo;ISI Journal&rdquo;</li>
<li>I found one item with IITA subject &ldquo;.&rdquo;</li>
<li>Need to remember to check the facets for things like this in sponsorship:</li>
<li>Deutsche Gesellschaft für Internationale Zusammenarbeit</li>
<li>Deutsche Gesellschaft fur Internationale Zusammenarbeit</li>
<li>Eight records with language &ldquo;fn&rdquo; instead of &ldquo;fr&rdquo;</li>
<li>One incorrect type (lowercase &ldquo;proceedings&rdquo;): Conference proceedings</li>
<li>Found some capitalized CRPs in <code>cg.contributor.crp</code></li>
<li>Found some incorrect author affiliations, ie &ldquo;Institut de Recherche pour le Developpement Agricolc&rdquo; should be &ldquo;Institut de Recherche pour le Developpement <em>Agricole</em>&ldquo;</li>
<li>Wow, and for sponsors there are the following:</li>
<li>Incorrect: Flemish Agency for Development Cooperation and Technical Assistance</li>
<li>Incorrect: Flemish Organization for Development Cooperation and Technical Assistance</li>
<li>Correct: Flemish <em>Association</em> for Development Cooperation and Technical Assistance</li>
<li>One item had region &ldquo;WEST&rdquo; (I corrected it to &ldquo;WEST AFRICA&rdquo;)</li>
</ul></li>
</ul>
<h2 id="2018-05-03">2018-05-03</h2>
<ul>
<li>It turns out that the IITA records that I was helping Sisay with in March were imported in 2018-04 without a final check by Abenet or I</li>
<li>There are lots of errors on language, CRP, and even some encoding errors on abstract fields</li>
<li>I export them and include the hidden metadata fields like <code>dc.date.accessioned</code> so I can filter the ones from 2018-04 and correct them in Open Refine:</li>
</ul>
<pre><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
</code></pre>
<ul>
<li>Abenet sent a list of 46 ORCID identifiers for ILRI authors so I need to get their names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script and merge them into our controlled vocabulary</li>
<li>On the messed up IITA records from 2018-04 I see sixty DOIs in incorrect format (cg.identifier.doi)</li>
</ul>
<h2 id="2018-05-06">2018-05-06</h2>
<ul>
<li>Fixing the IITA records from Sisay, sixty DOIs have completely invalid format like <code>http:dx.doi.org10.1016j.cropro.2008.07.003</code></li>
<li>I corrected all the DOIs and then checked them for validity with a quick bash loop:</li>
</ul>
<pre><code>$ for line in $(&lt; /tmp/links.txt); do echo $line; http --print h $line; done
</code></pre>
<ul>
<li>Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher&rsquo;s site so&hellip;</li>
<li>Also, there are some duplicates:
<ul>
<li><code>10568/92241</code> and <code>10568/92230</code> (same DOI)</li>
<li><code>10568/92151</code> and <code>10568/92150</code> (same ISBN)</li>
<li><code>10568/92291</code> and <code>10568/92286</code> (same citation, title, authors, year)</li>
</ul></li>
<li>Messed up abstracts:
<ul>
<li><code>10568/92309</code></li>
</ul></li>
<li>Fixed some issues in regions, countries, sponsors, ISSN, and cleaned whitespace errors from citation, abstract, author, and titles</li>
<li>Fixed all issues with CRPs</li>
<li>A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: <code></code> (0x2019), <code>·</code> (0x00b7), and <code></code> (0x20ac)</li>
<li>A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:</li>
</ul>
<pre><code>or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
isNotNull(value.match(/.*\u2019.*/)),
isNotNull(value.match(/.*\u00b7.*/)),
isNotNull(value.match(/.*\u20ac.*/))
)
</code></pre>
<ul>
<li>I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!</li>
<li>Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-05-06-combined.txt
$ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre>
<ul>
<li>I made a pull request (<a href="https://github.com/ilri/DSpace/pull/373">#373</a>) for this that I&rsquo;ll merge some time next week (I&rsquo;m expecting Atmire to get back to us about DSpace 5.8 soon)</li>
<li>After testing quickly I just decided to merge it, and I noticed that I don&rsquo;t even need to restart Tomcat for the changes to get loaded</li>
</ul>
<h2 id="2018-05-07">2018-05-07</h2>
<ul>
<li>I spent a bit of time playing with <a href="https://github.com/codeforkjeff/conciliator">conciliator</a> and Solr, trying to figure out how to reconcile columns in OpenRefine with data in our existing Solr cores (like CRP subjects)</li>
<li>The documentation regarding the Solr stuff is limited, and I cannot figure out what all the fields in <code>conciliator.properties</code> are supposed to be</li>
<li>But then I found <a href="https://github.com/okfn/reconcile-csv">reconcile-csv</a>, which allows you to reconcile against values in a CSV file!</li>
<li>That, combined with splitting our multi-value fields on &ldquo;||&rdquo; in OpenRefine is amaaaaazing, because after reconciliation you can just join them again</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2018-05/">May, 2018</a></li>
<li><a href="/cgspace-notes/2018-04/">April, 2018</a></li>
<li><a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li><a href="/cgspace-notes/2018-02/">February, 2018</a></li>
<li><a href="/cgspace-notes/2018-01/">January, 2018</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>