diff --git a/content/posts/2021-10.md b/content/posts/2021-10.md
index 7773fa6c4..a9890993f 100644
--- a/content/posts/2021-10.md
+++ b/content/posts/2021-10.md
@@ -120,4 +120,52 @@ $ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
Total number of bot hits purged: 465119
```
+## 2021-10-06
+
+- Thinking about how we could check for duplicates before importing
+ - I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):
+
+```console
+localhost/dspace63= > CREATE EXTENSION pg_trgm;
+localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
+ metadata_value_id │ text_value │ dspace_object_id
+───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
+ 3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
+ 3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
+(2 rows)
+```
+
+- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
+- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
+ - I think I will check for similar titles, and if I find them I will print out the handles for verification
+ - I could also proceed to check other metadata like type because those shouldn't vary too much
+- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
+ - Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!
+ - This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives
+ - I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
+ - 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total
+ - 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total
+ - 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total
+- Some minor updates to csv-metadata-quality
+ - Fix two issues with regular expressions in the duplicate items and experimental language checks
+ - Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field
+- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:
+
+```console
+$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
+$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
+$ xsv split -s 2000 /tmp /tmp/iwmi.csv
+```
+
+- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
+ - I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
+ - The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
+ - I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
+ - I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:
+
+> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
+> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.
+
+- Shit, so that's worth looking into...
+
diff --git a/docs/2021-10/index.html b/docs/2021-10/index.html
index 259c15b42..b8122cd1f 100644
--- a/docs/2021-10/index.html
+++ b/docs/2021-10/index.html
@@ -25,7 +25,7 @@ So we have 1879/7100 (26.46%) matching already
-
+
@@ -56,9 +56,9 @@ So we have 1879/7100 (26.46%) matching already
"@type": "BlogPosting",
"headline": "October, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-10/",
- "wordCount": "771",
+ "wordCount": "1307",
"datePublished": "2021-10-01T11:14:07+03:00",
- "dateModified": "2021-10-04T19:40:13+03:00",
+ "dateModified": "2021-10-05T18:54:39+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -247,7 +247,71 @@ $ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 &
...
Total number of bot hits purged: 465119
-
+
2021-10-06
+
+- Thinking about how we could check for duplicates before importing
+
+
+
+localhost/dspace63= > CREATE EXTENSION pg_trgm;
+localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
+ metadata_value_id │ text_value │ dspace_object_id
+───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
+ 3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
+ 3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
+(2 rows)
+
+- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
+- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
+
+- I think I will check for similar titles, and if I find them I will print out the handles for verification
+- I could also proceed to check other metadata like type because those shouldn’t vary too much
+
+
+- I ran my new
check-duplicates.py
script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
+
+- Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!
+- This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives
+- I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
+
+- 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total
+- 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total
+- 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total
+
+
+
+
+- Some minor updates to csv-metadata-quality
+
+- Fix two issues with regular expressions in the duplicate items and experimental language checks
+- Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field
+
+
+- Then I ran this new version of csv-metadata-quality on an export of IWMI’s community, minus some fields I don’t want to check:
+
+$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
+$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
+$ xsv split -s 2000 /tmp /tmp/iwmi.csv
+
+- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs…
+
+- I cut a subset of the fields from the main CSV and tried again, but DSpace said “no changes detected”
+- The duplicates are definitely removed from the CSV, but DSpace doesn’t detect them
+- I realized this is an issue I’ve had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
+- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 (“No changes were detected” when importing metadata via XMLUI") where he says:
+
+
+
+
+It’s very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
+Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.
+
+
+- Shit, so that’s worth looking into…
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index b9ffd53f1..0a5afd7b1 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index e1f8309e6..bb5e862de 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index 7f414a96d..84a0bdebb 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 19e82d031..19fe4f689 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 8d7b784f9..08955786b 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index 856fd2e25..fe78c451e 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index 891cb0853..c0fc6766d 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index d6eecce46..0beaba364 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 5bad837e1..ec4ba8b3b 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 053cc7900..4d9597bf8 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index 8f3c08e83..249fb3dec 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index c8aa8f95f..566099b0e 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index a4e5d204f..b48e8ad95 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 66288e6c3..ad25196b7 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index ead11d5f5..196f57098 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 90bf70758..9957c861e 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index ea2b27f1b..aef10e644 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 22e21b841..d6f99805b 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index a2d377914..325bf4321 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 72732ad07..3d029af2d 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index dc1f3d3bc..f905a6c68 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 364136673..b73e9c2f2 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index 9d7454065..60e03915d 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 07b567c3f..a8fe466de 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2021-10-04T19:40:13+03:00
+ 2021-10-05T18:54:39+03:00
https://alanorth.github.io/cgspace-notes/
- 2021-10-04T19:40:13+03:00
+ 2021-10-05T18:54:39+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2021-10-04T19:40:13+03:00
+ 2021-10-05T18:54:39+03:00
https://alanorth.github.io/cgspace-notes/2021-10/
- 2021-10-04T19:40:13+03:00
+ 2021-10-05T18:54:39+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2021-10-04T19:40:13+03:00
+ 2021-10-05T18:54:39+03:00
https://alanorth.github.io/cgspace-notes/2021-09/
2021-10-04T11:10:54+03:00