diff --git a/content/posts/2022-08.md b/content/posts/2022-08.md index fbc51756e..11dbb0ea9 100644 --- a/content/posts/2022-08.md +++ b/content/posts/2022-08.md @@ -122,4 +122,38 @@ $ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM - Dedupe value pairs and controlled vocabularies before writing them - Sort the controlled vocabularies before writing them (we don't do this for value pairs because some are added in specific order, like CRPs) +## 2022-08-19 + +- Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading + - I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date + - Then I checked for duplicates: + +```console +$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv +``` + +- I sent the list of ~130 possible duplicates to Peter to check +- Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs + - The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login + - I asked them why they don't just use the original links in the first place in case tinyurl.com disappears +- I continued working on the MARLO MELIA v2 UTF-8 metadata + - I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field + - It helps to replace some characters with spaces first with this GREL: `value.replace(/[.\/;(),]/, " ")` + - This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms + - Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker: + +```console +$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv +``` + +- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join): + +```console +$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv +``` + +- I had to use `xsv` because `csvcut` was throwing an error detecting the dialect of the input CSVs (?) +- I created a SAF bundle and imported the 749 MELIAs to DSpace Test +- I found thirteen items on CGSpace with dates in format "DD/MM/YYYY" so I fixed those + diff --git a/docs/2022-08/index.html b/docs/2022-08/index.html index edeadb661..915918467 100644 --- a/docs/2022-08/index.html +++ b/docs/2022-08/index.html @@ -14,7 +14,7 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago - + @@ -34,9 +34,9 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago "@type": "BlogPosting", "headline": "August, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-08/", - "wordCount": "1122", + "wordCount": "1446", "datePublished": "2022-08-01T10:22:36+03:00", - "dateModified": "2022-08-18T13:45:48-07:00", + "dateModified": "2022-08-18T22:43:37-07:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -257,6 +257,43 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago +

2022-08-19

+ +
$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
+
+
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
+
+
$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
+
diff --git a/docs/categories/index.html b/docs/categories/index.html index 9c0746622..de2bd60f1 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index a2335ba75..03b55892d 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index df49067ee..ab779e046 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 2b692bf2c..6b5f905b0 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 0de658482..81016c70b 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 7bb6515ba..9e74f5921 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index b299bc6d3..17c7f89d0 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index ebfd92975..9335cab8d 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 84bea4267..73bd226d8 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index fc52a958b..a7fdf026b 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 3bf6358dd..bd711a0c5 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index a758065a3..8473b0cf8 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index d42837416..188ba74fd 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 2be6524dc..a4f564d00 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 67f4abacc..7de054cd5 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 719ff012a..ba71b452b 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 49170e008..981e2e119 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index e8bef1b45..584f3ed47 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 048b2362f..dc87ade39 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 9f42da1df..cbe3091d0 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 0705466aa..0f80c0ff0 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 49b3efb61..f62a16be1 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index dfd1aaccf..1c15eebaf 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 22395588b..4e5fcc00a 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index e558356d5..2f0bc5d0b 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 417933eb9..76e0a0094 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 6f07d5118..65f8257cf 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/2022-08/ - 2022-08-18T13:45:48-07:00 + 2022-08-18T22:43:37-07:00 https://alanorth.github.io/cgspace-notes/categories/ - 2022-08-18T13:45:48-07:00 + 2022-08-18T22:43:37-07:00 https://alanorth.github.io/cgspace-notes/ - 2022-08-18T13:45:48-07:00 + 2022-08-18T22:43:37-07:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-08-18T13:45:48-07:00 + 2022-08-18T22:43:37-07:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-08-18T13:45:48-07:00 + 2022-08-18T22:43:37-07:00 https://alanorth.github.io/cgspace-notes/2022-07/ 2022-07-31T15:49:35+03:00