mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-23 05:32:20 +01:00
207 lines
6.8 KiB
HTML
207 lines
6.8 KiB
HTML
|
<!DOCTYPE html>
|
|||
|
<html lang="en">
|
|||
|
|
|||
|
<head>
|
|||
|
<meta charset="utf-8">
|
|||
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|||
|
|
|||
|
<meta property="og:title" content="October, 2019" />
|
|||
|
<meta property="og:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace
|
|||
|
I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix:" />
|
|||
|
<meta property="og:type" content="article" />
|
|||
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-10/" />
|
|||
|
<meta property="article:published_time" content="2019-10-01T13:20:51+03:00" />
|
|||
|
<meta property="article:modified_time" content="2019-10-01T13:20:51+03:00" />
|
|||
|
|
|||
|
<meta name="twitter:card" content="summary"/>
|
|||
|
<meta name="twitter:title" content="October, 2019"/>
|
|||
|
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace
|
|||
|
I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix:"/>
|
|||
|
<meta name="generator" content="Hugo 0.58.3" />
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<script type="application/ld+json">
|
|||
|
{
|
|||
|
"@context": "http://schema.org",
|
|||
|
"@type": "BlogPosting",
|
|||
|
"headline": "October, 2019",
|
|||
|
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-10\/",
|
|||
|
"wordCount": "214",
|
|||
|
"datePublished": "2019-10-01T13:20:51\x2b03:00",
|
|||
|
"dateModified": "2019-10-01T13:20:51\x2b03:00",
|
|||
|
"author": {
|
|||
|
"@type": "Person",
|
|||
|
"name": "Alan Orth"
|
|||
|
},
|
|||
|
"keywords": "Notes"
|
|||
|
}
|
|||
|
</script>
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-10/">
|
|||
|
|
|||
|
<title>October, 2019 | CGSpace Notes</title>
|
|||
|
|
|||
|
<!-- combined, minified CSS -->
|
|||
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
|
|||
|
|
|||
|
<!-- RSS 2.0 feed -->
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
</head>
|
|||
|
|
|||
|
<body>
|
|||
|
|
|||
|
|
|||
|
<div class="blog-masthead">
|
|||
|
<div class="container">
|
|||
|
<nav class="nav blog-nav">
|
|||
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|||
|
</nav>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<header class="blog-header">
|
|||
|
<div class="container">
|
|||
|
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|||
|
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|||
|
</div>
|
|||
|
</header>
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<div class="container">
|
|||
|
<div class="row">
|
|||
|
<div class="col-sm-8 blog-main">
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<article class="blog-post">
|
|||
|
<header>
|
|||
|
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-10/">October, 2019</a></h2>
|
|||
|
<p class="blog-post-meta"><time datetime="2019-10-01T13:20:51+03:00">Tue Oct 01, 2019</time> by Alan Orth in
|
|||
|
|
|||
|
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
|||
|
|
|||
|
</p>
|
|||
|
</header>
|
|||
|
|
|||
|
|
|||
|
<h2 id="2019-10-01">2019-10-01</h2>
|
|||
|
|
|||
|
<ul>
|
|||
|
<li><p>Udana from IWMI asked me for a CSV export of their community on CGSpace</p>
|
|||
|
|
|||
|
<ul>
|
|||
|
<li>I exported it, but a quick run through the <code>csv-metadata-quality</code> tool shows that there are some low-hanging fruits we can fix before I send him the data</li>
|
|||
|
|
|||
|
<li><p>I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix:</p>
|
|||
|
|
|||
|
<pre><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
|
|||
|
</code></pre></li>
|
|||
|
|
|||
|
<li><p>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can’t figure out the correct sed syntax to do it directly from the pipe above</p></li>
|
|||
|
|
|||
|
<li><p>I uploaded those to CGSpace and then re-exported the metadata</p></li>
|
|||
|
|
|||
|
<li><p>Now that I think about it, I shouldn’t be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!</p></li>
|
|||
|
|
|||
|
<li><p>I modified the script so it replaces the non-breaking spaces instead of removing them</p></li>
|
|||
|
|
|||
|
<li><p>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</p>
|
|||
|
|
|||
|
<pre><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
|
|||
|
</code></pre></li>
|
|||
|
|
|||
|
<li><p>That fixed 153 items (unnecessary Unicode, duplicates, comma–space fixes, etc)</p></li>
|
|||
|
</ul></li>
|
|||
|
|
|||
|
<li><p>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</p></li>
|
|||
|
</ul>
|
|||
|
|
|||
|
<!-- vim: set sw=2 ts=2: -->
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
</article>
|
|||
|
|
|||
|
|
|||
|
|
|||
|
</div> <!-- /.blog-main -->
|
|||
|
|
|||
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<section class="sidebar-module">
|
|||
|
<h4>Recent Posts</h4>
|
|||
|
<ol class="list-unstyled">
|
|||
|
|
|||
|
|
|||
|
<li><a href="/cgspace-notes/2019-10/">October, 2019</a></li>
|
|||
|
|
|||
|
<li><a href="/cgspace-notes/posts/">Posts</a></li>
|
|||
|
|
|||
|
<li><a href="/cgspace-notes/2019-09/">September, 2019</a></li>
|
|||
|
|
|||
|
<li><a href="/cgspace-notes/2019-08/">August, 2019</a></li>
|
|||
|
|
|||
|
<li><a href="/cgspace-notes/2019-07/">July, 2019</a></li>
|
|||
|
|
|||
|
</ol>
|
|||
|
</section>
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<section class="sidebar-module">
|
|||
|
<h4>Links</h4>
|
|||
|
<ol class="list-unstyled">
|
|||
|
|
|||
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|||
|
|
|||
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|||
|
|
|||
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|||
|
|
|||
|
</ol>
|
|||
|
</section>
|
|||
|
|
|||
|
</aside>
|
|||
|
|
|||
|
|
|||
|
</div> <!-- /.row -->
|
|||
|
</div> <!-- /.container -->
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<footer class="blog-footer">
|
|||
|
<p>
|
|||
|
|
|||
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|||
|
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
<a href="#">Back to top</a>
|
|||
|
</p>
|
|||
|
</footer>
|
|||
|
|
|||
|
|
|||
|
</body>
|
|||
|
|
|||
|
</html>
|