mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -18,7 +18,7 @@
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="October, 2019"/>
|
||||
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc."/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -113,7 +113,7 @@
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
|
||||
<pre tabindex="0"><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
|
||||
</code></pre><ul>
|
||||
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can’t figure out the correct sed syntax to do it directly from the pipe above</li>
|
||||
<li>I uploaded those to CGSpace and then re-exported the metadata</li>
|
||||
@ -121,7 +121,7 @@
|
||||
<li>I modified the script so it replaces the non-breaking spaces instead of removing them</li>
|
||||
<li>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
|
||||
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
|
||||
</code></pre><ul>
|
||||
<li>That fixed 153 items (unnecessary Unicode, duplicates, comma–space fixes, etc)</li>
|
||||
<li>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</li>
|
||||
@ -134,7 +134,7 @@
|
||||
<ul>
|
||||
<li>Create an account for Bioversity’s ICT consultant Francesco on DSpace Test:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
|
||||
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
|
||||
</code></pre><ul>
|
||||
<li>Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21
|
||||
<ul>
|
||||
@ -193,20 +193,20 @@
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
|
||||
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
|
||||
</code></pre><h2 id="2019-10-11">2019-10-11</h2>
|
||||
<ul>
|
||||
<li>I ran the DSpace cleanup function on CGSpace and it found some errors:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ dspace cleanup -v
|
||||
...
|
||||
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(171221) is still referenced from table "bundle".
|
||||
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(171221) is still referenced from table "bundle".
|
||||
</code></pre><ul>
|
||||
<li>The solution, as always, is (repeat as many times as needed):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># su - postgres
|
||||
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
|
||||
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
|
||||
UPDATE 1
|
||||
</code></pre><h2 id="2019-10-12">2019-10-12</h2>
|
||||
<ul>
|
||||
@ -229,12 +229,12 @@ International Centre for Tropical Agriculture,International Center for Tropical
|
||||
International Maize and Wheat Improvement Center (CIMMYT),International Maize and Wheat Improvement Center
|
||||
International Centre for Agricultural Research in the Dry Areas,International Center for Agricultural Research in the Dry Areas
|
||||
International Maize and Wheat Improvement Centre,International Maize and Wheat Improvement Center
|
||||
"Agricultural Information Resource Centre, Kenya.","Agricultural Information Resource Centre, Kenya"
|
||||
"Centre for Livestock and Agricultural Development, Cambodia","Centre for Livestock and Agriculture Development, Cambodia"
|
||||
"Agricultural Information Resource Centre, Kenya.","Agricultural Information Resource Centre, Kenya"
|
||||
"Centre for Livestock and Agricultural Development, Cambodia","Centre for Livestock and Agriculture Development, Cambodia"
|
||||
</code></pre><ul>
|
||||
<li>Then I applied it with my <code>fix-metadata-values.py</code> script on CGSpace:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
|
||||
</code></pre><ul>
|
||||
<li>I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready
|
||||
<ul>
|
||||
@ -270,7 +270,7 @@ real 82m35.993s
|
||||
</code></pre><ul>
|
||||
<li>I looked in the database to find authors that had “|” in them:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
|
||||
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
|
||||
text_value | resource_id
|
||||
----------------------------------+-------------
|
||||
Anandajayasekeram, P.|Puskur, R. | 157
|
||||
@ -280,7 +280,7 @@ real 82m35.993s
|
||||
</code></pre><ul>
|
||||
<li>Then I found their handles and corrected them, for example:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
|
||||
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
|
||||
handle
|
||||
-----------
|
||||
10568/129
|
||||
@ -304,10 +304,10 @@ real 82m35.993s
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
|
||||
$ mkdir 2019-10-15-Bioversity
|
||||
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
|
||||
$ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
|
||||
$ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
|
||||
</code></pre><ul>
|
||||
<li>It’s really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
|
||||
<li>Then I imported a test subset of them in my local test environment:</li>
|
||||
@ -317,7 +317,7 @@ $ sed -i '/<dcvalue element="identifier" qualifier="uri"&
|
||||
<li>I had forgotten (again) that the <code>dspace export</code> command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…</li>
|
||||
<li>On CGSpace I will increase the RAM of the command line Java process for good luck before import…</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
$ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
|
||||
</code></pre><ul>
|
||||
<li>After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them</li>
|
||||
|
Reference in New Issue
Block a user