Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<pretabindex="0"><codeclass="language-console"data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
<li>Start looking at the last month of Solr statistics on CGSpace
<ul>
<li>I see a number of IPs with “normal” user agents who clearly behave like bots
<ul>
<li>198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)</li>
<li>93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)</li>
<li>193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)</li>
<li>3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)</li>
<li>34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
<li>18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
<li>3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
<li>3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
</ul>
</li>
<li>Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:</li>
<li>Thinking about how we could check for duplicates before importing
<ul>
<li>I found out that <ahref="https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/">PostgreSQL has a built-in similarity function</a>:</li>
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
</code></pre><ul>
<li>I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)</li>
<li>I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
<ul>
<li>I think I will check for similar titles, and if I find them I will print out the handles for verification</li>
<li>I could also proceed to check other metadata like type because those shouldn’t vary too much</li>
</ul>
</li>
<li>I ran my new <code>check-duplicates.py</code> script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
<ul>
<li>Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!</li>
<li>This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives</li>
<li>I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
<ul>
<li>0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total</li>
<li>0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total</li>
<li>0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total</li>
</ul>
</li>
</ul>
</li>
<li>Some minor updates to csv-metadata-quality
<ul>
<li>Fix two issues with regular expressions in the duplicate items and experimental language checks</li>
<li>Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field</li>
</ul>
</li>
<li>Then I ran this new version of csv-metadata-quality on an export of IWMI’s community, minus some fields I don’t want to check:</li>
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
</code></pre><ul>
<li>I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs…
<ul>
<li>I cut a subset of the fields from the main CSV and tried again, but DSpace said “no changes detected”</li>
<li>The duplicates are definitely removed from the CSV, but DSpace doesn’t detect them</li>
<li>I realized this is an issue I’ve had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!</li>
<li>I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 (“No changes were detected” when importing metadata via XMLUI") where he says:</li>
</ul>
</li>
</ul>
<blockquote>
<p>It’s very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.</p>
</blockquote>
<ul>
<li>Shit, so that’s worth looking into…</li>
<li>I decided to upload the cleaned IWMI community by moving the cleaned metadata field from <code>dcterms.subject[en_US]</code> to <code>dcterms.subject[en_Fu]</code> temporarily, uploading them, then moving them back, and uploading again
<ul>
<li>I started by copying just a handful of fields from the iwmi.csv community export:</li>
<li>It takes a few hours per 2,000 items because DSpace processes them so slowly… sigh…</li>
</ul>
<h2id="2021-10-08">2021-10-08</h2>
<ul>
<li>I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:</li>
</ul>
<pretabindex="0"><codeclass="language-console"data-lang="console">cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
</code></pre><ul>
<li>So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:</li>