diff --git a/content/posts/2022-11.md b/content/posts/2022-11.md
index 541cf0b81..b33395df1 100644
--- a/content/posts/2022-11.md
+++ b/content/posts/2022-11.md
@@ -462,5 +462,45 @@ Western Europe
- Peter wrote to ask me to change the PIM CRP's full name from `Policies, Institutions and Markets` to `Policies, Institutions, and Markets`
- It's apparently the only CRP with an Oxford comma...?
- I updated them all on CGSpace
+- Also, I ran an `index-discovery` without the `-b` since now my metadata update scripts update the `last_modified` timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
+
+## 2022-11-29
+
+- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about `dcterms.type` for CG Core
+ - We discussed some of the feedback from Peter
+- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
+ - I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
+ - These are UN M.49 regions
+
+## 2022-11-30
+
+- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
+ - It fixed some whitespace issues and added missing regions to about 1,200 items
+- I thought of a way to delete duplicate metadata values, since the CSV upload method can't detect them correctly
+ - First, I wrote a [SQL query](https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/) to identify metadata falues with the same `text_value`, `metadata_field_id`, and `dspace_object_id`:
+
+```console
+\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
+ FROM metadatavalue a
+ JOIN (
+ SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
+ ) b
+ ON a.dspace_object_id = b.dspace_object_id
+ AND a.text_value = b.text_value
+ AND a.metadata_field_id = b.metadata_field_id
+ ORDER BY a.text_value) TO /tmp/duplicates.txt
+```
+
+- (This query excludes metadata for accession and available dates, provenance, format, etc)
+- Then, I sorted the file by fields four and one (`dspace_object_id` and `text_value`) so that the duplicate metadata for each item were next to each other, used awk to print the second field (`metadata_field_id`) from every _other_ line, and created a SQL script to delete the metadata
+
+```console
+$ sort -k4,1 /tmp/duplicates.txt | \
+ awk -F'\t' 'NR%2==0 {print $2}' | \
+ sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
+```
+
+- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
+ - I just ran it again two more times to find the last duplicates, now we have none!
diff --git a/docs/2022-11/index.html b/docs/2022-11/index.html
index d024d8be3..760762571 100644
--- a/docs/2022-11/index.html
+++ b/docs/2022-11/index.html
@@ -24,7 +24,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
-
+
@@ -54,9 +54,9 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
"@type": "BlogPosting",
"headline": "November, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-11/",
- "wordCount": "2853",
+ "wordCount": "3206",
"datePublished": "2022-11-01T09:11:36+03:00",
- "dateModified": "2022-11-28T17:42:46+03:00",
+ "dateModified": "2022-11-28T23:19:19+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -656,6 +656,57 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
I updated them all on CGSpace
+Also, I ran an index-discovery
without the -b
since now my metadata update scripts update the last_modified
timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
+
+2022-11-29
+
+- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about
dcterms.type
for CG Core
+
+- We discussed some of the feedback from Peter
+
+
+- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
+
+- I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
+- These are UN M.49 regions
+
+
+
+2022-11-30
+
+- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
+
+- It fixed some whitespace issues and added missing regions to about 1,200 items
+
+
+- I thought of a way to delete duplicate metadata values, since the CSV upload method can’t detect them correctly
+
+- First, I wrote a SQL query to identify metadata falues with the same
text_value
, metadata_field_id
, and dspace_object_id
:
+
+
+
+\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
+ FROM metadatavalue a
+ JOIN (
+ SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
+ ) b
+ ON a.dspace_object_id = b.dspace_object_id
+ AND a.text_value = b.text_value
+ AND a.metadata_field_id = b.metadata_field_id
+ ORDER BY a.text_value) TO /tmp/duplicates.txt
+
+- (This query excludes metadata for accession and available dates, provenance, format, etc)
+- Then, I sorted the file by fields four and one (
dspace_object_id
and text_value
) so that the duplicate metadata for each item were next to each other, used awk to print the second field (metadata_field_id
) from every other line, and created a SQL script to delete the metadata
+
+$ sort -k4,1 /tmp/duplicates.txt | \
+ awk -F'\t' 'NR%2==0 {print $2}' | \
+ sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
+
+- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
+
+- I just ran it again two more times to find the last duplicates, now we have none!
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 6ccbdbe60..22e797200 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index bce535b37..f8e5cee28 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index 686e4f526..49cdec66b 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index a058eba8c..9be03b9fc 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 32072315a..87a6d56d9 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index 107b1f9bc..36b0fbf71 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index ba6b81ad6..62f9b51e7 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html
index 1f80fb5fb..9a0975dc3 100644
--- a/docs/categories/notes/page/7/index.html
+++ b/docs/categories/notes/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index 873b23efa..b32357b79 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index c4066911e..33d09a9ac 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 7c3101362..7cbafaa0c 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index e78bc3e64..a2e23b1ca 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 87b8208e7..46cde0f30 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 3cd62ec33..bd430664f 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 0e3577aca..73bea5c9d 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index a9b44e027..acd0ef2aa 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/9/index.html b/docs/page/9/index.html
index a252694d1..00d2e4900 100644
--- a/docs/page/9/index.html
+++ b/docs/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index e5ffdabc6..5b6fbbd93 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index f227e00fc..d1cac4a9b 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index b4f1b1552..006837b81 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index e3242129f..7c41f8367 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 24285b4ea..8cf54bcdb 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 4b7c5bf11..100a02916 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 899a3f633..da0cdd183 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index e40481f12..927f67323 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html
index 10bc43094..d0a5a14be 100644
--- a/docs/posts/page/9/index.html
+++ b/docs/posts/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index c2bc88703..0cafaa05e 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2022-11-28T17:42:46+03:00
+ 2022-11-28T23:19:19+03:00
https://alanorth.github.io/cgspace-notes/
- 2022-11-28T17:42:46+03:00
+ 2022-11-28T23:19:19+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2022-11-28T17:42:46+03:00
+ 2022-11-28T23:19:19+03:00
https://alanorth.github.io/cgspace-notes/2022-11/
- 2022-11-28T17:42:46+03:00
+ 2022-11-28T23:19:19+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2022-11-28T17:42:46+03:00
+ 2022-11-28T23:19:19+03:00
https://alanorth.github.io/cgspace-notes/2022-10/
2022-10-31T16:59:47+03:00