From 0599df9bed6a7f9f94dfdc405400fdca9624189d Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 30 Nov 2022 12:35:31 +0300 Subject: [PATCH] Add notes for 2022-11-30 --- content/posts/2022-11.md | 40 +++++++++++++++++ docs/2022-11/index.html | 57 +++++++++++++++++++++++-- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/categories/notes/page/6/index.html | 2 +- docs/categories/notes/page/7/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/page/9/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/posts/page/9/index.html | 2 +- docs/sitemap.xml | 10 ++--- 29 files changed, 125 insertions(+), 34 deletions(-) diff --git a/content/posts/2022-11.md b/content/posts/2022-11.md index 541cf0b81..b33395df1 100644 --- a/content/posts/2022-11.md +++ b/content/posts/2022-11.md @@ -462,5 +462,45 @@ Western Europe - Peter wrote to ask me to change the PIM CRP's full name from `Policies, Institutions and Markets` to `Policies, Institutions, and Markets` - It's apparently the only CRP with an Oxford comma...? - I updated them all on CGSpace +- Also, I ran an `index-discovery` without the `-b` since now my metadata update scripts update the `last_modified` timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets + +## 2022-11-29 + +- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about `dcterms.type` for CG Core + - We discussed some of the feedback from Peter +- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback + - I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form + - These are UN M.49 regions + +## 2022-11-30 + +- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields + - It fixed some whitespace issues and added missing regions to about 1,200 items +- I thought of a way to delete duplicate metadata values, since the CSV upload method can't detect them correctly + - First, I wrote a [SQL query](https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/) to identify metadata falues with the same `text_value`, `metadata_field_id`, and `dspace_object_id`: + +```console +\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id + FROM metadatavalue a + JOIN ( + SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1 + ) b + ON a.dspace_object_id = b.dspace_object_id + AND a.text_value = b.text_value + AND a.metadata_field_id = b.metadata_field_id + ORDER BY a.text_value) TO /tmp/duplicates.txt +``` + +- (This query excludes metadata for accession and available dates, provenance, format, etc) +- Then, I sorted the file by fields four and one (`dspace_object_id` and `text_value`) so that the duplicate metadata for each item were next to each other, used awk to print the second field (`metadata_field_id`) from every _other_ line, and created a SQL script to delete the metadata + +```console +$ sort -k4,1 /tmp/duplicates.txt | \ + awk -F'\t' 'NR%2==0 {print $2}' | \ + sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql +``` + +- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate + - I just ran it again two more times to find the last duplicates, now we have none! diff --git a/docs/2022-11/index.html b/docs/2022-11/index.html index d024d8be3..760762571 100644 --- a/docs/2022-11/index.html +++ b/docs/2022-11/index.html @@ -24,7 +24,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe - + @@ -54,9 +54,9 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe "@type": "BlogPosting", "headline": "November, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-11/", - "wordCount": "2853", + "wordCount": "3206", "datePublished": "2022-11-01T09:11:36+03:00", - "dateModified": "2022-11-28T17:42:46+03:00", + "dateModified": "2022-11-28T23:19:19+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -656,6 +656,57 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
  • I updated them all on CGSpace
  • +
  • Also, I ran an index-discovery without the -b since now my metadata update scripts update the last_modified timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
  • + +

    2022-11-29

    + +

    2022-11-30

    + +
    \COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id 
    +    FROM metadatavalue a
    +    JOIN (
    +        SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
    +    ) b
    +    ON a.dspace_object_id = b.dspace_object_id
    +    AND a.text_value = b.text_value
    +    AND a.metadata_field_id = b.metadata_field_id
    +    ORDER BY a.text_value) TO /tmp/duplicates.txt
    +
    +
    $ sort -k4,1 /tmp/duplicates.txt    | \
    +    awk -F'\t' 'NR%2==0 {print $2}' | \
    +    sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index 6ccbdbe60..22e797200 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index bce535b37..f8e5cee28 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 686e4f526..49cdec66b 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index a058eba8c..9be03b9fc 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 32072315a..87a6d56d9 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 107b1f9bc..36b0fbf71 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index ba6b81ad6..62f9b51e7 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 1f80fb5fb..9a0975dc3 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 873b23efa..b32357b79 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index c4066911e..33d09a9ac 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 7c3101362..7cbafaa0c 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index e78bc3e64..a2e23b1ca 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 87b8208e7..46cde0f30 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 3cd62ec33..bd430664f 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 0e3577aca..73bea5c9d 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index a9b44e027..acd0ef2aa 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index a252694d1..00d2e4900 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index e5ffdabc6..5b6fbbd93 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index f227e00fc..d1cac4a9b 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index b4f1b1552..006837b81 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index e3242129f..7c41f8367 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 24285b4ea..8cf54bcdb 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 4b7c5bf11..100a02916 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 899a3f633..da0cdd183 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index e40481f12..927f67323 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 10bc43094..d0a5a14be 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index c2bc88703..0cafaa05e 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2022-11-28T17:42:46+03:00 + 2022-11-28T23:19:19+03:00 https://alanorth.github.io/cgspace-notes/ - 2022-11-28T17:42:46+03:00 + 2022-11-28T23:19:19+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-11-28T17:42:46+03:00 + 2022-11-28T23:19:19+03:00 https://alanorth.github.io/cgspace-notes/2022-11/ - 2022-11-28T17:42:46+03:00 + 2022-11-28T23:19:19+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-11-28T17:42:46+03:00 + 2022-11-28T23:19:19+03:00 https://alanorth.github.io/cgspace-notes/2022-10/ 2022-10-31T16:59:47+03:00