Add notes for 2021-10-08

This commit is contained in:
2021-10-08 17:15:17 +03:00
parent b55e6c9efe
commit 23d6a808fc
26 changed files with 192 additions and 31 deletions

View File

@ -168,4 +168,85 @@ $ xsv split -s 2000 /tmp /tmp/iwmi.csv
- Shit, so that's worth looking into...
## 2021-10-07
- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
- I started by copying just a handful of fields from the iwmi.csv community export:
```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
```
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...
## 2021-10-08
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
```console
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
```
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
391
```
- I tried to export ILRI's community, but ran into the export bug (DS-4211)
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
```console
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
```
- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
220
```
- I found a cool way to select only the items with corrections
- First, extract a handful of fields from the CSV with csvcut
- Second, clean the CSV with csv-metadata-quality
- Third, rename the columns to something obvious in the cleaned CSV
- Fourth, use csvjoin to merge the cleaned file with the original
```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
```
- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:
```
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
```
- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
- After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again
<!-- vim: set sw=2 ts=2: -->