From 23d6a808fc7334c64b207005a77739940acb9a88 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Fri, 8 Oct 2021 17:15:17 +0300 Subject: [PATCH] Add notes for 2021-10-08 --- content/posts/2021-10.md | 81 +++++++++++++++++++++++ docs/2021-10/index.html | 86 ++++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/categories/notes/page/6/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/sitemap.xml | 10 +-- 26 files changed, 192 insertions(+), 31 deletions(-) diff --git a/content/posts/2021-10.md b/content/posts/2021-10.md index a9890993f..0097172b4 100644 --- a/content/posts/2021-10.md +++ b/content/posts/2021-10.md @@ -168,4 +168,85 @@ $ xsv split -s 2000 /tmp /tmp/iwmi.csv - Shit, so that's worth looking into... +## 2021-10-07 + +- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again + - I started by copying just a handful of fields from the iwmi.csv community export: + +```console +$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv +# Copy and blank columns in OpenRefine +$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log +$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv +``` +- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh... + +## 2021-10-08 + +- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too: + +```console +cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC; + text_lang | count +-----------+--------- + en_US | 2603711 + en_Fu | 115568 + en | 8818 + | 5286 + fr | 2 + vn | 2 + | 0 +(7 rows) +cgspace=# BEGIN; +cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', ''); +UPDATE 129673 +cgspace=# COMMIT; +``` + +- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm: + +```console +$ grep -c 'Removing duplicate value' /tmp/out.log +391 +``` + +- I tried to export ILRI's community, but ran into the export bug (DS-4211) + - After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02): + +```console +$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l +32070 +$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l +19315 +``` + +- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community: + +```console +$ grep -c 'Removing duplicate value' /tmp/out.log +220 +``` + +- I found a cool way to select only the items with corrections + - First, extract a handful of fields from the CSV with csvcut + - Second, clean the CSV with csv-metadata-quality + - Third, rename the columns to something obvious in the cleaned CSV + - Fourth, use csvjoin to merge the cleaned file with the original + +```console +$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv +$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log +$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv +$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv +``` + +- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes: + +``` +if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different") +``` + +- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column + - After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again + diff --git a/docs/2021-10/index.html b/docs/2021-10/index.html index b8122cd1f..fb6d7ded5 100644 --- a/docs/2021-10/index.html +++ b/docs/2021-10/index.html @@ -25,7 +25,7 @@ So we have 1879/7100 (26.46%) matching already - + @@ -56,9 +56,9 @@ So we have 1879/7100 (26.46%) matching already "@type": "BlogPosting", "headline": "October, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-10/", - "wordCount": "1307", + "wordCount": "1754", "datePublished": "2021-10-01T11:14:07+03:00", - "dateModified": "2021-10-05T18:54:39+03:00", + "dateModified": "2021-10-07T08:27:39+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -311,6 +311,86 @@ Try doing it in two imports. In first import, remove all authors. In second impo +

2021-10-07

+ +
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
+# Copy and blank columns in OpenRefine
+$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
+$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
+
+

2021-10-08

+ +
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
+ text_lang |  count  
+-----------+---------
+ en_US     | 2603711
+ en_Fu     |  115568
+ en        |    8818
+           |    5286
+ fr        |       2
+ vn        |       2
+           |       0
+(7 rows)
+cgspace=# BEGIN;
+cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
+UPDATE 129673
+cgspace=# COMMIT;
+
+
$ grep -c 'Removing duplicate value' /tmp/out.log
+391
+
+
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l 
+32070
+$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
+19315
+
+
$ grep -c 'Removing duplicate value' /tmp/out.log
+220
+
+
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
+$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
+$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
+$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
+
+
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
+
diff --git a/docs/categories/index.html b/docs/categories/index.html index 0a5afd7b1..89ba2d206 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index bb5e862de..1dbc47de6 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 84a0bdebb..67b6e16c2 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 19fe4f689..9796727f6 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 08955786b..8dd297991 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index fe78c451e..fb5c4666c 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index c0fc6766d..2c36b6ecf 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 0beaba364..e85b0987d 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index ec4ba8b3b..1d5392725 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 4d9597bf8..04506761a 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 249fb3dec..8d6be6044 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 566099b0e..78288e46e 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index b48e8ad95..0f4f01b87 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index ad25196b7..6675623e8 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 196f57098..cbb211dc3 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 9957c861e..2a81e7cb5 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index aef10e644..347d35c00 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index d6f99805b..312ed9ea6 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 325bf4321..587752e36 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 3d029af2d..b4f2f2d19 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index f905a6c68..b26714806 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index b73e9c2f2..c247c6ec4 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 60e03915d..2f884d788 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index a8fe466de..b30504859 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-10-05T18:54:39+03:00 + 2021-10-07T08:27:39+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-10-05T18:54:39+03:00 + 2021-10-07T08:27:39+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-10-05T18:54:39+03:00 + 2021-10-07T08:27:39+03:00 https://alanorth.github.io/cgspace-notes/2021-10/ - 2021-10-05T18:54:39+03:00 + 2021-10-07T08:27:39+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-10-05T18:54:39+03:00 + 2021-10-07T08:27:39+03:00 https://alanorth.github.io/cgspace-notes/2021-09/ 2021-10-04T11:10:54+03:00