Add notes for 2021-10-08

This commit is contained in:
Alan Orth 2021-10-08 17:15:17 +03:00
parent b55e6c9efe
commit 23d6a808fc
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
26 changed files with 192 additions and 31 deletions

View File

@ -168,4 +168,85 @@ $ xsv split -s 2000 /tmp /tmp/iwmi.csv
- Shit, so that's worth looking into...
## 2021-10-07
- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
- I started by copying just a handful of fields from the iwmi.csv community export:
```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
```
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...
## 2021-10-08
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
```console
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
```
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
391
```
- I tried to export ILRI's community, but ran into the export bug (DS-4211)
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
```console
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
```
- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
220
```
- I found a cool way to select only the items with corrections
- First, extract a handful of fields from the CSV with csvcut
- Second, clean the CSV with csv-metadata-quality
- Third, rename the columns to something obvious in the cleaned CSV
- Fourth, use csvjoin to merge the cleaned file with the original
```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
```
- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:
```
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
```
- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
- After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again
<!-- vim: set sw=2 ts=2: -->

View File

@ -25,7 +25,7 @@ So we have 1879/7100 (26.46%) matching already
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-10/" />
<meta property="article:published_time" content="2021-10-01T11:14:07+03:00" />
<meta property="article:modified_time" content="2021-10-05T18:54:39+03:00" />
<meta property="article:modified_time" content="2021-10-07T08:27:39+03:00" />
@ -56,9 +56,9 @@ So we have 1879/7100 (26.46%) matching already
"@type": "BlogPosting",
"headline": "October, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-10/",
"wordCount": "1307",
"wordCount": "1754",
"datePublished": "2021-10-01T11:14:07+03:00",
"dateModified": "2021-10-05T18:54:39+03:00",
"dateModified": "2021-10-07T08:27:39+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -311,6 +311,86 @@ Try doing it in two imports. In first import, remove all authors. In second impo
<ul>
<li>Shit, so that&rsquo;s worth looking into&hellip;</li>
</ul>
<h2 id="2021-10-07">2021-10-07</h2>
<ul>
<li>I decided to upload the cleaned IWMI community by moving the cleaned metadata field from <code>dcterms.subject[en_US]</code> to <code>dcterms.subject[en_Fu]</code> temporarily, uploading them, then moving them back, and uploading again
<ul>
<li>I started by copying just a handful of fields from the iwmi.csv community export:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv &gt; /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
</code></pre><ul>
<li>It takes a few hours per 2,000 items because DSpace processes them so slowly&hellip; sigh&hellip;</li>
</ul>
<h2 id="2021-10-08">2021-10-08</h2>
<ul>
<li>I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
</code></pre><ul>
<li>So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c 'Removing duplicate value' /tmp/out.log
391
</code></pre><ul>
<li>I tried to export ILRI&rsquo;s community, but ran into the export bug (DS-4211)
<ul>
<li>After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
</code></pre><ul>
<li>It seems there are only about 200 duplicate values in this subset of fields in ILRI&rsquo;s community:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c 'Removing duplicate value' /tmp/out.log
220
</code></pre><ul>
<li>I found a cool way to select only the items with corrections
<ul>
<li>First, extract a handful of fields from the CSV with csvcut</li>
<li>Second, clean the CSV with csv-metadata-quality</li>
<li>Third, rename the columns to something obvious in the cleaned CSV</li>
<li>Fourth, use csvjoin to merge the cleaned file with the original</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq &gt; /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv &gt; /tmp/ilri-deduplicated-items-cleaned-joined.csv
</code></pre><ul>
<li>Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:</li>
</ul>
<pre tabindex="0"><code>if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,&quot;same&quot;,&quot;different&quot;)
</code></pre><ul>
<li>For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
<ul>
<li>After these are uploaded I will normalize the <code>text_lang</code> fields in PostgreSQL again</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-10-05T18:54:39+03:00" />
<meta property="og:updated_time" content="2021-10-07T08:27:39+03:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2021-10-05T18:54:39+03:00</lastmod>
<lastmod>2021-10-07T08:27:39+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2021-10-05T18:54:39+03:00</lastmod>
<lastmod>2021-10-07T08:27:39+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2021-10-05T18:54:39+03:00</lastmod>
<lastmod>2021-10-07T08:27:39+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-10/</loc>
<lastmod>2021-10-05T18:54:39+03:00</lastmod>
<lastmod>2021-10-07T08:27:39+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2021-10-05T18:54:39+03:00</lastmod>
<lastmod>2021-10-07T08:27:39+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-09/</loc>
<lastmod>2021-10-04T11:10:54+03:00</lastmod>