- Export all affiliations on CGSpace and run them against the latest RoR data dump:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
- Start looking at the last month of Solr statistics on CGSpace
- I see a number of IPs with "normal" user agents who clearly behave like bots
- 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)
- 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)
- 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)
- 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)
- 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:
- Thinking about how we could check for duplicates before importing
- I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):
```console
localhost/dspace63= > CREATE EXTENSION pg_trgm;
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
```
- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
- I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
- The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
- I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:
> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.
- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
- I started by copying just a handful of fields from the iwmi.csv community export:
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...
## 2021-10-08
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
```console
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
```
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
391
```
- I tried to export ILRI's community, but ran into the export bug (DS-4211)
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
```console
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
```
- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
220
```
- I found a cool way to select only the items with corrections
- First, extract a handful of fields from the CSV with csvcut
- Second, clean the CSV with csv-metadata-quality
- Third, rename the columns to something obvious in the cleaned CSV
- Fourth, use csvjoin to merge the cleaned file with the original