Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -113,7 +113,7 @@
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
<pre tabindex="0"><code>$ csvcut -c &#39;id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]&#39; ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
</code></pre><ul>
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can&rsquo;t figure out the correct sed syntax to do it directly from the pipe above</li>
<li>I uploaded those to CGSpace and then re-exported the metadata</li>
@ -121,7 +121,7 @@
<li>I modified the script so it replaces the non-breaking spaces instead of removing them</li>
<li>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</li>
</ul>
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x &#39;dc.date.issued,dc.date.issued[],dc.date.issued[en_US]&#39; -u
</code></pre><ul>
<li>That fixed 153 items (unnecessary Unicode, duplicates, commaspace fixes, etc)</li>
<li>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</li>
@ -134,7 +134,7 @@
<ul>
<li>Create an account for Bioversity&rsquo;s ICT consultant Francesco on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p &#39;fffff&#39;
</code></pre><ul>
<li>Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21
<ul>
@ -193,20 +193,20 @@
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p &#39;fuananaaa&#39;
</code></pre><h2 id="2019-10-11">2019-10-11</h2>
<ul>
<li>I ran the DSpace cleanup function on CGSpace and it found some errors:</li>
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(171221) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(171221) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution, as always, is (repeat as many times as needed):</li>
</ul>
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);&#39;
UPDATE 1
</code></pre><h2 id="2019-10-12">2019-10-12</h2>
<ul>
@ -229,12 +229,12 @@ International Centre for Tropical Agriculture,International Center for Tropical
International Maize and Wheat Improvement Center (CIMMYT),International Maize and Wheat Improvement Center
International Centre for Agricultural Research in the Dry Areas,International Center for Agricultural Research in the Dry Areas
International Maize and Wheat Improvement Centre,International Maize and Wheat Improvement Center
&quot;Agricultural Information Resource Centre, Kenya.&quot;,&quot;Agricultural Information Resource Centre, Kenya&quot;
&quot;Centre for Livestock and Agricultural Development, Cambodia&quot;,&quot;Centre for Livestock and Agriculture Development, Cambodia&quot;
&#34;Agricultural Information Resource Centre, Kenya.&#34;,&#34;Agricultural Information Resource Centre, Kenya&#34;
&#34;Centre for Livestock and Agricultural Development, Cambodia&#34;,&#34;Centre for Livestock and Agriculture Development, Cambodia&#34;
</code></pre><ul>
<li>Then I applied it with my <code>fix-metadata-values.py</code> script on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f from -m 211 -t to
</code></pre><ul>
<li>I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready
<ul>
@ -270,7 +270,7 @@ real 82m35.993s
</code></pre><ul>
<li>I looked in the database to find authors that had &ldquo;|&rdquo; in them:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE &#39;%|%&#39;;
text_value | resource_id
----------------------------------+-------------
Anandajayasekeram, P.|Puskur, R. | 157
@ -280,7 +280,7 @@ real 82m35.993s
</code></pre><ul>
<li>Then I found their handles and corrected them, for example:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;157&#39; and handle.resource_type_id=2;
handle
-----------
10568/129
@ -304,10 +304,10 @@ real 82m35.993s
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx512m&#39;
$ mkdir 2019-10-15-Bioversity
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
$ sed -i '/&lt;dcvalue element=&quot;identifier&quot; qualifier=&quot;uri&quot;&gt;/d' 2019-10-15-Bioversity/*/dublin_core.xml
$ sed -i &#39;/&lt;dcvalue element=&#34;identifier&#34; qualifier=&#34;uri&#34;&gt;/d&#39; 2019-10-15-Bioversity/*/dublin_core.xml
</code></pre><ul>
<li>It&rsquo;s really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
<li>Then I imported a test subset of them in my local test environment:</li>
@ -317,7 +317,7 @@ $ sed -i '/&lt;dcvalue element=&quot;identifier&quot; qualifier=&quot;uri&quot;&
<li>I had forgotten (again) that the <code>dspace export</code> command doesn&rsquo;t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import&hellip;</li>
<li>On CGSpace I will increase the RAM of the command line Java process for good luck before import&hellip;</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
</code></pre><ul>
<li>After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them</li>