mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-01-27
This commit is contained in:
@ -6,7 +6,7 @@
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||||
|
||||
<meta property="og:title" content="October, 2019" />
|
||||
<meta property="og:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc." />
|
||||
<meta property="og:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc." />
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-10/" />
|
||||
<meta property="article:published_time" content="2019-10-01T13:20:51+03:00" />
|
||||
@ -14,8 +14,8 @@
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="October, 2019"/>
|
||||
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix: $ csvcut -c 'id,dc."/>
|
||||
<meta name="generator" content="Hugo 0.62.2" />
|
||||
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc."/>
|
||||
<meta name="generator" content="Hugo 0.63.1" />
|
||||
|
||||
|
||||
|
||||
@ -45,7 +45,7 @@
|
||||
|
||||
<!-- combined, minified CSS -->
|
||||
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel="stylesheet" integrity="sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy+piAwENoVPTw=" crossorigin="anonymous">
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css" rel="stylesheet" integrity="sha256-I+LDKYvMjBE2wZq6MwwhHslMNvfERU6hXPTTVINwBCo=" crossorigin="anonymous">
|
||||
|
||||
|
||||
<!-- RSS 2.0 feed -->
|
||||
@ -92,7 +92,7 @@
|
||||
<header>
|
||||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-10/">October, 2019</a></h2>
|
||||
<p class="blog-post-meta"><time datetime="2019-10-01T13:20:51+03:00">Tue Oct 01, 2019</time> by Alan Orth in
|
||||
<i class="fa fa-folder" aria-hidden="true"></i> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
|
||||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
|
||||
|
||||
|
||||
</p>
|
||||
@ -102,15 +102,15 @@
|
||||
<li>Udana from IWMI asked me for a CSV export of their community on CGSpace
|
||||
<ul>
|
||||
<li>I exported it, but a quick run through the <code>csv-metadata-quality</code> tool shows that there are some low-hanging fruits we can fix before I send him the data</li>
|
||||
<li>I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's “unneccesary Unicode” fix:</li>
|
||||
<li>I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
|
||||
</code></pre><ul>
|
||||
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can't figure out the correct sed syntax to do it directly from the pipe above</li>
|
||||
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can’t figure out the correct sed syntax to do it directly from the pipe above</li>
|
||||
<li>I uploaded those to CGSpace and then re-exported the metadata</li>
|
||||
<li>Now that I think about it, I shouldn't be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!</li>
|
||||
<li>Now that I think about it, I shouldn’t be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!</li>
|
||||
<li>I modified the script so it replaces the non-breaking spaces instead of removing them</li>
|
||||
<li>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</li>
|
||||
</ul>
|
||||
@ -125,7 +125,7 @@
|
||||
</ul>
|
||||
<h2 id="2019-10-04">2019-10-04</h2>
|
||||
<ul>
|
||||
<li>Create an account for Bioversity's ICT consultant Francesco on DSpace Test:</li>
|
||||
<li>Create an account for Bioversity’s ICT consultant Francesco on DSpace Test:</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
|
||||
</code></pre><ul>
|
||||
@ -162,7 +162,7 @@
|
||||
</li>
|
||||
<li>Start looking at duplicates in the Bioversity migration data on DSpace Test
|
||||
<ul>
|
||||
<li>I'm keeping track of the originals and duplicates in a Google Docs spreadsheet that I will share with Bioversity</li>
|
||||
<li>I’m keeping track of the originals and duplicates in a Google Docs spreadsheet that I will share with Bioversity</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
@ -181,7 +181,7 @@
|
||||
<ul>
|
||||
<li>Felix Shaw from Earlham emailed me to ask about his admin account on DSpace Test
|
||||
<ul>
|
||||
<li>His old one got lost when I re-sync'd DSpace Test with CGSpace a few weeks ago</li>
|
||||
<li>His old one got lost when I re-sync’d DSpace Test with CGSpace a few weeks ago</li>
|
||||
<li>I added a new account for him and added it to the Administrators group:</li>
|
||||
</ul>
|
||||
</li>
|
||||
@ -206,7 +206,7 @@ UPDATE 1
|
||||
<li>More work on identifying duplicates in the Bioversity migration data on DSpace Test
|
||||
<ul>
|
||||
<li>I mapped twenty-five more items on CGSpace and deleted them from the migration test collection on DSpace Test</li>
|
||||
<li>After a few hours I think I finished all the duplicates that were identified by Atmire's Duplicate Checker module</li>
|
||||
<li>After a few hours I think I finished all the duplicates that were identified by Atmire’s Duplicate Checker module</li>
|
||||
<li>According to my spreadsheet there were fifty-two in total</li>
|
||||
</ul>
|
||||
</li>
|
||||
@ -234,8 +234,8 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
|
||||
<li>I would still like to perhaps (re)move institutional authors from <code>dc.contributor.author</code> to <code>cg.contributor.affiliation</code>, but I will have to run that by Francesca, Carol, and Abenet</li>
|
||||
<li>I could use a custom text facet like this in OpenRefine to find authors that likely match the “Last, F.” pattern: <code>isNotNull(value.match(/^.*, \p{Lu}\.?.*$/))</code></li>
|
||||
<li>The <code>\p{Lu}</code> is a cool <a href="https://www.regular-expressions.info/unicode.html">regex character class</a> to make sure this works for letters with accents</li>
|
||||
<li>As cool as that is, it's actually more effective to just search for authors that have “.” in them!</li>
|
||||
<li>I've decided to add a <code>cg.contributor.affiliation</code> column to 1,025 items based on the logic above where the author name is not an actual person</li>
|
||||
<li>As cool as that is, it’s actually more effective to just search for authors that have “.” in them!</li>
|
||||
<li>I’ve decided to add a <code>cg.contributor.affiliation</code> column to 1,025 items based on the logic above where the author name is not an actual person</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
@ -279,7 +279,7 @@ real 82m35.993s
|
||||
10568/129
|
||||
(1 row)
|
||||
</code></pre><ul>
|
||||
<li>So I'm still not sure where these weird authors in the “Top Author” stats are coming from</li>
|
||||
<li>So I’m still not sure where these weird authors in the “Top Author” stats are coming from</li>
|
||||
</ul>
|
||||
<h2 id="2019-10-14">2019-10-14</h2>
|
||||
<ul>
|
||||
@ -302,12 +302,12 @@ $ mkdir 2019-10-15-Bioversity
|
||||
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
|
||||
$ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
|
||||
</code></pre><ul>
|
||||
<li>It's really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
|
||||
<li>It’s really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
|
||||
<li>Then I imported a test subset of them in my local test environment:</li>
|
||||
</ul>
|
||||
<pre><code>$ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
|
||||
</code></pre><ul>
|
||||
<li>I had forgotten (again) that the <code>dspace export</code> command doesn't preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…</li>
|
||||
<li>I had forgotten (again) that the <code>dspace export</code> command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…</li>
|
||||
<li>On CGSpace I will increase the RAM of the command line Java process for good luck before import…</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
@ -338,8 +338,8 @@ $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map
|
||||
<li>Move the CGSpace CG Core v2 notes from a <a href="https://gist.github.com/alanorth/2db39e91f48d116e00a4edffd6ba6409">GitHub Gist</a> to a <a href="/cgspace-notes/cgspace-cgcorev2-migration/">page</a> on this site for archive and searchability sake</li>
|
||||
<li>Work on the CG Core v2 implementation testing
|
||||
<ul>
|
||||
<li>I noticed that the page title is messed up on the item view, and after over an hour of troubleshooting it I couldn't figure out why</li>
|
||||
<li>It seems to be because the <code>dc.title</code>→<code>dcterms.title</code> modifications cause the title metadata to disappear from DRI's <code><pageMeta></code> and therefore the title is not accessible to the XSL transformation</li>
|
||||
<li>I noticed that the page title is messed up on the item view, and after over an hour of troubleshooting it I couldn’t figure out why</li>
|
||||
<li>It seems to be because the <code>dc.title</code>→<code>dcterms.title</code> modifications cause the title metadata to disappear from DRI’s <code><pageMeta></code> and therefore the title is not accessible to the XSL transformation</li>
|
||||
<li>Also, I noticed a few places in the Java code where <code>dc.title</code> is hard coded so I think this might be one of the fields that we just assume DSpace relies on internally</li>
|
||||
<li>I will revert all changes to <code>dc.title</code> and <code>dc.title.alternative</code></li>
|
||||
<li>TODO: there are similar issues with the <code>citation_author</code> metadata element missing from DRI, so I might have to revert those changes too</li>
|
||||
|
Reference in New Issue
Block a user