diff --git a/content/posts/2021-10.md b/content/posts/2021-10.md index 7773fa6c4..a9890993f 100644 --- a/content/posts/2021-10.md +++ b/content/posts/2021-10.md @@ -120,4 +120,52 @@ $ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p Total number of bot hits purged: 465119 ``` +## 2021-10-06 + +- Thinking about how we could check for duplicates before importing + - I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/): + +```console +localhost/dspace63= > CREATE EXTENSION pg_trgm; +localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5; + metadata_value_id │ text_value │ dspace_object_id +───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────── + 3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467 + 3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e +(2 rows) +``` + +- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed) +- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items + - I think I will check for similar titles, and if I find them I will print out the handles for verification + - I could also proceed to check other metadata like type because those shouldn't vary too much +- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates + - Upon checking them manually, I found that 7/12 were indeed already present on CGSpace! + - This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives + - I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items! + - 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total + - 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total + - 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total +- Some minor updates to csv-metadata-quality + - Fix two issues with regular expressions in the duplicate items and experimental language checks + - Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field +- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check: + +```console +$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv +$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log +$ xsv split -s 2000 /tmp /tmp/iwmi.csv +``` + +- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs... + - I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected" + - The duplicates are definitely removed from the CSV, but DSpace doesn't detect them + - I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace! + - I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says: + +> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list. +> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order. + +- Shit, so that's worth looking into... + diff --git a/docs/2021-10/index.html b/docs/2021-10/index.html index 259c15b42..b8122cd1f 100644 --- a/docs/2021-10/index.html +++ b/docs/2021-10/index.html @@ -25,7 +25,7 @@ So we have 1879/7100 (26.46%) matching already - + @@ -56,9 +56,9 @@ So we have 1879/7100 (26.46%) matching already "@type": "BlogPosting", "headline": "October, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-10/", - "wordCount": "771", + "wordCount": "1307", "datePublished": "2021-10-01T11:14:07+03:00", - "dateModified": "2021-10-04T19:40:13+03:00", + "dateModified": "2021-10-05T18:54:39+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -247,7 +247,71 @@ $ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 & ... Total number of bot hits purged: 465119 - +

2021-10-06

+ +
localhost/dspace63= > CREATE EXTENSION pg_trgm;
+localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
+ metadata_value_id │                                         text_value                                         │           dspace_object_id
+───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
+           3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
+           3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
+(2 rows)
+
+
$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
+$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
+$ xsv split -s 2000 /tmp /tmp/iwmi.csv
+
+
+

It’s very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list. +Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.

+
+ + diff --git a/docs/categories/index.html b/docs/categories/index.html index b9ffd53f1..0a5afd7b1 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index e1f8309e6..bb5e862de 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 7f414a96d..84a0bdebb 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 19e82d031..19fe4f689 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 8d7b784f9..08955786b 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 856fd2e25..fe78c451e 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 891cb0853..c0fc6766d 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index d6eecce46..0beaba364 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 5bad837e1..ec4ba8b3b 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 053cc7900..4d9597bf8 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 8f3c08e83..249fb3dec 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index c8aa8f95f..566099b0e 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index a4e5d204f..b48e8ad95 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 66288e6c3..ad25196b7 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index ead11d5f5..196f57098 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 90bf70758..9957c861e 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index ea2b27f1b..aef10e644 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 22e21b841..d6f99805b 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index a2d377914..325bf4321 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 72732ad07..3d029af2d 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index dc1f3d3bc..f905a6c68 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 364136673..b73e9c2f2 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 9d7454065..60e03915d 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 07b567c3f..a8fe466de 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-10-04T19:40:13+03:00 + 2021-10-05T18:54:39+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-10-04T19:40:13+03:00 + 2021-10-05T18:54:39+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-10-04T19:40:13+03:00 + 2021-10-05T18:54:39+03:00 https://alanorth.github.io/cgspace-notes/2021-10/ - 2021-10-04T19:40:13+03:00 + 2021-10-05T18:54:39+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-10-04T19:40:13+03:00 + 2021-10-05T18:54:39+03:00 https://alanorth.github.io/cgspace-notes/2021-09/ 2021-10-04T11:10:54+03:00