mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-10-08
This commit is contained in:
@ -168,4 +168,85 @@ $ xsv split -s 2000 /tmp /tmp/iwmi.csv
|
||||
|
||||
- Shit, so that's worth looking into...
|
||||
|
||||
## 2021-10-07
|
||||
|
||||
- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
|
||||
- I started by copying just a handful of fields from the iwmi.csv community export:
|
||||
|
||||
```console
|
||||
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
|
||||
# Copy and blank columns in OpenRefine
|
||||
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
|
||||
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
|
||||
```
|
||||
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...
|
||||
|
||||
## 2021-10-08
|
||||
|
||||
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
|
||||
|
||||
```console
|
||||
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
|
||||
text_lang | count
|
||||
-----------+---------
|
||||
en_US | 2603711
|
||||
en_Fu | 115568
|
||||
en | 8818
|
||||
| 5286
|
||||
fr | 2
|
||||
vn | 2
|
||||
| 0
|
||||
(7 rows)
|
||||
cgspace=# BEGIN;
|
||||
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
|
||||
UPDATE 129673
|
||||
cgspace=# COMMIT;
|
||||
```
|
||||
|
||||
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
|
||||
|
||||
```console
|
||||
$ grep -c 'Removing duplicate value' /tmp/out.log
|
||||
391
|
||||
```
|
||||
|
||||
- I tried to export ILRI's community, but ran into the export bug (DS-4211)
|
||||
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
|
||||
|
||||
```console
|
||||
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
|
||||
32070
|
||||
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
|
||||
19315
|
||||
```
|
||||
|
||||
- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:
|
||||
|
||||
```console
|
||||
$ grep -c 'Removing duplicate value' /tmp/out.log
|
||||
220
|
||||
```
|
||||
|
||||
- I found a cool way to select only the items with corrections
|
||||
- First, extract a handful of fields from the CSV with csvcut
|
||||
- Second, clean the CSV with csv-metadata-quality
|
||||
- Third, rename the columns to something obvious in the cleaned CSV
|
||||
- Fourth, use csvjoin to merge the cleaned file with the original
|
||||
|
||||
```console
|
||||
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
|
||||
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
|
||||
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
|
||||
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
|
||||
```
|
||||
|
||||
- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:
|
||||
|
||||
```
|
||||
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
|
||||
```
|
||||
|
||||
- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
|
||||
- After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user