Add notes for 2022-11-30

This commit is contained in:
2022-11-30 12:35:31 +03:00
parent 4f254af2f3
commit 0599df9bed
29 changed files with 125 additions and 34 deletions

View File

@ -462,5 +462,45 @@ Western Europe
- Peter wrote to ask me to change the PIM CRP's full name from `Policies, Institutions and Markets` to `Policies, Institutions, and Markets`
- It's apparently the only CRP with an Oxford comma...?
- I updated them all on CGSpace
- Also, I ran an `index-discovery` without the `-b` since now my metadata update scripts update the `last_modified` timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
## 2022-11-29
- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about `dcterms.type` for CG Core
- We discussed some of the feedback from Peter
- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
- I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
- These are UN M.49 regions
## 2022-11-30
- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
- It fixed some whitespace issues and added missing regions to about 1,200 items
- I thought of a way to delete duplicate metadata values, since the CSV upload method can't detect them correctly
- First, I wrote a [SQL query](https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/) to identify metadata falues with the same `text_value`, `metadata_field_id`, and `dspace_object_id`:
```console
\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
FROM metadatavalue a
JOIN (
SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
) b
ON a.dspace_object_id = b.dspace_object_id
AND a.text_value = b.text_value
AND a.metadata_field_id = b.metadata_field_id
ORDER BY a.text_value) TO /tmp/duplicates.txt
```
- (This query excludes metadata for accession and available dates, provenance, format, etc)
- Then, I sorted the file by fields four and one (`dspace_object_id` and `text_value`) so that the duplicate metadata for each item were next to each other, used awk to print the second field (`metadata_field_id`) from every _other_ line, and created a SQL script to delete the metadata
```console
$ sort -k4,1 /tmp/duplicates.txt | \
awk -F'\t' 'NR%2==0 {print $2}' | \
sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
```
- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
- I just ran it again two more times to find the last duplicates, now we have none!
<!-- vim: set sw=2 ts=2: -->