Add notes for 2022-11-30

2025-01-27 05:49:12 +01:00 · 2022-11-30 12:35:31 +03:00
parent 4f254af2f3
commit 0599df9bed
29 changed files with 125 additions and 34 deletions
--- a/content/posts/2022-11.md
+++ b/content/posts/2022-11.md
@ -462,5 +462,45 @@ Western Europe
 - Peter wrote to ask me to change the PIM CRP's full name from `Policies, Institutions and Markets` to `Policies, Institutions, and Markets`
  - It's apparently the only CRP with an Oxford comma...?
  - I updated them all on CGSpace
+- Also, I ran an `index-discovery` without the `-b` since now my metadata update scripts update the `last_modified` timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
+
+## 2022-11-29
+
+- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about `dcterms.type` for CG Core
+  - We discussed some of the feedback from Peter
+- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
+  - I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
+  - These are UN M.49 regions
+
+## 2022-11-30
+
+- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
+  - It fixed some whitespace issues and added missing regions to about 1,200 items
+- I thought of a way to delete duplicate metadata values, since the CSV upload method can't detect them correctly
+  - First, I wrote a [SQL query](https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/) to identify metadata falues with the same `text_value`, `metadata_field_id`, and `dspace_object_id`:
+
+```console
+\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id 
+    FROM metadatavalue a
+    JOIN (
+        SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
+    ) b
+    ON a.dspace_object_id = b.dspace_object_id
+    AND a.text_value = b.text_value
+    AND a.metadata_field_id = b.metadata_field_id
+    ORDER BY a.text_value) TO /tmp/duplicates.txt
+```
+
+- (This query excludes metadata for accession and available dates, provenance, format, etc)
+- Then, I sorted the file by fields four and one (`dspace_object_id` and `text_value`) so that the duplicate metadata for each item were next to each other, used awk to print the second field (`metadata_field_id`) from every _other_ line, and created a SQL script to delete the metadata
+
+```console
+$ sort -k4,1 /tmp/duplicates.txt    | \
+    awk -F'\t' 'NR%2==0 {print $2}' | \
+    sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
+```
+
+- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
+  - I just ran it again two more times to find the last duplicates, now we have none!

 <!-- vim: set sw=2 ts=2: -->