mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-18 12:47:04 +01:00
346 lines
11 KiB
HTML
346 lines
11 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
||
<meta property="og:title" content="March, 2018" />
|
||
<meta property="og:description" content="2018-03-02
|
||
|
||
|
||
Export a CSV of the IITA community metadata for Martin Mueller
|
||
|
||
|
||
" />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-03/" />
|
||
|
||
|
||
|
||
<meta property="article:published_time" content="2018-03-02T16:07:54+02:00"/>
|
||
|
||
<meta property="article:modified_time" content="2018-03-08T21:29:37+02:00"/>
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="March, 2018"/>
|
||
<meta name="twitter:description" content="2018-03-02
|
||
|
||
|
||
Export a CSV of the IITA community metadata for Martin Mueller
|
||
|
||
|
||
"/>
|
||
<meta name="generator" content="Hugo 0.37.1" />
|
||
|
||
|
||
|
||
<script type="application/ld+json">
|
||
{
|
||
"@context": "http://schema.org",
|
||
"@type": "BlogPosting",
|
||
"headline": "March, 2018",
|
||
"url": "https://alanorth.github.io/cgspace-notes/2018-03/",
|
||
"wordCount": "780",
|
||
"datePublished": "2018-03-02T16:07:54+02:00",
|
||
"dateModified": "2018-03-08T21:29:37+02:00",
|
||
"author": {
|
||
"@type": "Person",
|
||
"name": "Alan Orth"
|
||
},
|
||
"keywords": "Notes"
|
||
}
|
||
</script>
|
||
|
||
|
||
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-03/">
|
||
|
||
<title>March, 2018 | CGSpace Notes</title>
|
||
|
||
<!-- combined, minified CSS -->
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-CoMzlF7G4xk3ftqRr7leobnWP85AuISUJljMFjtTG/UHyP/+bBwWAvBlXkB4VQQk" crossorigin="anonymous">
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
|
||
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
</div>
|
||
</header>
|
||
|
||
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-03/">March, 2018</a></h2>
|
||
<p class="blog-post-meta"><time datetime="2018-03-02T16:07:54+02:00">Fri Mar 02, 2018</time> by Alan Orth in
|
||
|
||
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||
|
||
</p>
|
||
</header>
|
||
<h2 id="2018-03-02">2018-03-02</h2>
|
||
|
||
<ul>
|
||
<li>Export a CSV of the IITA community metadata for Martin Mueller</li>
|
||
</ul>
|
||
|
||
<p></p>
|
||
|
||
<h2 id="2018-03-06">2018-03-06</h2>
|
||
|
||
<ul>
|
||
<li>Add three new CCAFS project tags to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/357">#357</a>)</li>
|
||
<li>Andrea from Macaroni Bros had sent me an email that CCAFS needs them</li>
|
||
<li>Give Udana more feedback on his WLE records from last month</li>
|
||
<li>There were some records using a non-breaking space in their AGROVOC subject field</li>
|
||
<li>I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
|
||
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character</li>
|
||
<li>Add new CRP subject “GRAIN LEGUMES AND DRYLAND CEREALS” to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/358">#358</a>)</li>
|
||
<li>Merge the ORCID integration stuff in to <code>5_x-prod</code> for deployment on CGSpace soon (<a href="https://github.com/ilri/DSpace/pull/359">#359</a>)</li>
|
||
<li>Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server</li>
|
||
<li>Run all system updates on DSpace Test and reboot server</li>
|
||
<li>I ran the <a href="https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c">orcid-authority-to-item.py</a> script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I ran the DSpace cleanup script on CGSpace and it threw an error (as always):</li>
|
||
</ul>
|
||
|
||
<pre><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||
Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li><p>The solution is, as always:</p>
|
||
|
||
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
|
||
UPDATE 1
|
||
</code></pre></li>
|
||
|
||
<li><p>Apply the proposed PostgreSQL indexes from DS-3636 (pull request <a href="https://github.com/DSpace/DSpace/pull/1791/">#1791</a> on CGSpace (linode18)</p></li>
|
||
</ul>
|
||
|
||
<h2 id="2018-03-07">2018-03-07</h2>
|
||
|
||
<ul>
|
||
<li>Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (<a href="https://github.com/ilri/DSpace/pull/360">#360</a>)</li>
|
||
<li>Help Sisay proof 200 IITA records on DSpace Test</li>
|
||
<li>Finally import Udana’s 24 items to <a href="https://cgspace.cgiar.org/handle/10568/36185">IWMI Journal Articles</a> on CGSpace</li>
|
||
<li>Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc</li>
|
||
</ul>
|
||
|
||
<h2 id="2018-03-08">2018-03-08</h2>
|
||
|
||
<ul>
|
||
<li>Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata</li>
|
||
<li>This makes the CSV have tons of columns, for example <code>dc.title</code>, <code>dc.title[]</code>, <code>dc.title[en]</code>, <code>dc.title[eng]</code>, <code>dc.title[en_US]</code> and so on!</li>
|
||
<li>I think I can fix — or at least normalize — them in the database:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||
text_lang
|
||
-----------
|
||
|
||
ethnob
|
||
en
|
||
spa
|
||
EN
|
||
En
|
||
en_
|
||
en_US
|
||
E.
|
||
|
||
EN_US
|
||
en_U
|
||
eng
|
||
fr
|
||
es_ES
|
||
es
|
||
(16 rows)
|
||
|
||
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
|
||
UPDATE 122227
|
||
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||
text_lang
|
||
-----------
|
||
|
||
ethnob
|
||
en_US
|
||
spa
|
||
E.
|
||
|
||
fr
|
||
es_ES
|
||
es
|
||
(9 rows)
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>On second inspection it looks like <code>dc.description.provenance</code> fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…</li>
|
||
<li>If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
|
||
UPDATE 2309
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I will apply this on CGSpace right now</li>
|
||
<li>In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine</li>
|
||
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
|
||
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
|
||
</ul>
|
||
|
||
<pre><code>or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
|
||
</ul>
|
||
|
||
<pre><code>if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>One thing that bothers me is that this won’t honor author order</li>
|
||
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
|
||
<li>I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching <code>cg.creator.id</code> fieldsa: <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py </a></li>
|
||
<li>The CSV should have two columns: author name and ORCID identifier:</li>
|
||
</ul>
|
||
|
||
<pre><code>dc.contributor.author,cg.creator.id
|
||
"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
|
||
"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors</li>
|
||
<li>I added ORCID identifers for 187 items by CIAT’s Hernan Ceballos, because that is what Elizabeth was trying to do manually!</li>
|
||
<li>Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well</li>
|
||
</ul>
|
||
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
|
||
<li><a href="/cgspace-notes/2018-03/">March, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-02/">February, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-01/">January, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2017-12/">December, 2017</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2017-11/">November, 2017</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
|
||
|
||
<footer class="blog-footer">
|
||
<p>
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
|
||
</body>
|
||
|
||
</html>
|