cgspace-notes/docs/2019-10/index.html

276 lines
9.9 KiB
HTML
Raw Normal View History

2019-10-01 16:31:40 +02:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="October, 2019" />
<meta property="og:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace
I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix:" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-10/" />
<meta property="article:published_time" content="2019-10-01T13:20:51+03:00" />
2019-10-09 16:05:01 +02:00
<meta property="article:modified_time" content="2019-10-08T19:33:09+03:00" />
2019-10-01 16:31:40 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace
I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix:"/>
<meta name="generator" content="Hugo 0.58.3" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2019",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-10\/",
2019-10-09 16:05:01 +02:00
"wordCount": "592",
2019-10-01 16:31:40 +02:00
"datePublished": "2019-10-01T13:20:51\x2b03:00",
2019-10-09 16:05:01 +02:00
"dateModified": "2019-10-08T19:33:09\x2b03:00",
2019-10-01 16:31:40 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-10/">
<title>October, 2019 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-10/">October, 2019</a></h2>
<p class="blog-post-meta"><time datetime="2019-10-01T13:20:51&#43;03:00">Tue Oct 01, 2019</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2019-10-01">2019-10-01</h2>
<ul>
<li><p>Udana from IWMI asked me for a CSV export of their community on CGSpace</p>
<ul>
<li>I exported it, but a quick run through the <code>csv-metadata-quality</code> tool shows that there are some low-hanging fruits we can fix before I send him the data</li>
<li><p>I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix:</p>
<pre><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
</code></pre></li>
<li><p>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can&rsquo;t figure out the correct sed syntax to do it directly from the pipe above</p></li>
<li><p>I uploaded those to CGSpace and then re-exported the metadata</p></li>
<li><p>Now that I think about it, I shouldn&rsquo;t be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!</p></li>
<li><p>I modified the script so it replaces the non-breaking spaces instead of removing them</p></li>
<li><p>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</p>
<pre><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
</code></pre></li>
<li><p>That fixed 153 items (unnecessary Unicode, duplicates, commaspace fixes, etc)</p></li>
</ul></li>
<li><p>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</p></li>
</ul>
2019-10-03 16:38:41 +02:00
<h2 id="2019-10-03">2019-10-03</h2>
<ul>
<li>Upload the 117 IITA records that we had been working on last month (aka 20196th.xls aka Sept 6) to CGSpace</li>
</ul>
2019-10-04 17:34:31 +02:00
<h2 id="2019-10-04">2019-10-04</h2>
<ul>
<li><p>Create an account for Bioversity&rsquo;s ICT consultant Francesco on DSpace Test:</p>
<pre><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
</code></pre></li>
<li><p>Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21</p>
<ul>
<li>I suggested that if they still want to do value addition of those records (like adding countries, regions, etc) that they could maybe do it after we migrate the records to CGSpace</li>
<li>Carol responded to tell me where to map the items with type Brochure, Journal Item, and Thesis, so I applied them to the <a href="https://dspacetest.cgiar.org/handle/10568/103688">collection on DSpace Test</a></li>
</ul></li>
</ul>
2019-10-06 15:40:15 +02:00
<h2 id="2019-10-06">2019-10-06</h2>
<ul>
<li>Hector from CCAFS responded about my feedback of their CLARISA API
<ul>
<li>He made some fixes to the metadata values they are using based on my feedback and said they are happy if we would use it</li>
</ul></li>
<li>Gabriela from CIP asked me if it was possible to generate an RSS feed of items that have the CIP subject &ldquo;POTATO AGRI-FOOD SYSTEMS&rdquo;
<ul>
<li>I notice that there is a similar term &ldquo;SWEETPOTATO AGRI-FOOD SYSTEMS&rdquo; so I had to come up with a way to exclude that using the boolean &ldquo;AND NOT&rdquo; in the <a href="https://cgspace.cgiar.org/open-search/discover?query=cipsubject:POTATO%20AGRI%E2%80%90FOOD%20SYSTEMS%20AND%20NOT%20cipsubject:SWEETPOTATO%20AGRI%E2%80%90FOOD%20SYSTEMS&amp;scope=10568/51671&amp;sort_by=3&amp;order=DESC">OpenSearch query</a></li>
<li>Again, the <code>sort_by=3</code> parameter is the accession date, as configured in <code>dspace.cfg</code></li>
</ul></li>
</ul>
2019-10-08 11:07:31 +02:00
<h2 id="2019-10-08">2019-10-08</h2>
<ul>
2019-10-08 18:33:09 +02:00
<li>Fix 108 more issues with authors in the ongoing Bioversity migration on DSpace Test, for example:
2019-10-08 11:07:31 +02:00
<ul>
<li>Europeanooperative Programme for Plant Genetic Resources</li>
<li>Bioversity International. Capacity Development Unit</li>
<li>W.M. van der Heide, W.M., Tripp, R.</li>
<li>Internationallant Genetic Resources Institute</li>
</ul></li>
2019-10-08 18:33:09 +02:00
<li>Start looking at duplicates in the Bioversity migration data on DSpace Test
<ul>
<li>I&rsquo;m keeping track of the originals and duplicates in a Google Docs spreadsheet that I will share with Bioversity</li>
</ul></li>
2019-10-08 11:07:31 +02:00
</ul>
2019-10-09 16:05:01 +02:00
<h2 id="2019-10-09">2019-10-09</h2>
<ul>
<li>Continue working on identifying duplicates in the Bioversity migration
<ul>
<li>I have been recording the originals and duplicates in a spreadsheet so I can map them later</li>
<li>For now I am just reconciling any incorrect or missing metadata in the original items on CGSpace, deleting the duplicate from DSpace Test, and mapping the original to the correct place on CGSpace</li>
<li>So far I have deleted thirty duplicates and mapped fourteen</li>
</ul></li>
<li>Run all system updates on DSpace Test (linode19) and reboot the server</li>
</ul>
2019-10-01 16:31:40 +02:00
<!-- vim: set sw=2 ts=2: -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2019-10/">October, 2019</a></li>
<li><a href="/cgspace-notes/posts/">Posts</a></li>
<li><a href="/cgspace-notes/2019-09/">September, 2019</a></li>
<li><a href="/cgspace-notes/2019-08/">August, 2019</a></li>
<li><a href="/cgspace-notes/2019-07/">July, 2019</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>