diff --git a/content/posts/2022-08.md b/content/posts/2022-08.md
index fbc51756e..11dbb0ea9 100644
--- a/content/posts/2022-08.md
+++ b/content/posts/2022-08.md
@@ -122,4 +122,38 @@ $ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM
- Dedupe value pairs and controlled vocabularies before writing them
- Sort the controlled vocabularies before writing them (we don't do this for value pairs because some are added in specific order, like CRPs)
+## 2022-08-19
+
+- Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
+ - I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
+ - Then I checked for duplicates:
+
+```console
+$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
+```
+
+- I sent the list of ~130 possible duplicates to Peter to check
+- Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
+ - The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
+ - I asked them why they don't just use the original links in the first place in case tinyurl.com disappears
+- I continued working on the MARLO MELIA v2 UTF-8 metadata
+ - I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
+ - It helps to replace some characters with spaces first with this GREL: `value.replace(/[.\/;(),]/, " ")`
+ - This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
+ - Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:
+
+```console
+$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
+```
+
+- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
+
+```console
+$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
+```
+
+- I had to use `xsv` because `csvcut` was throwing an error detecting the dialect of the input CSVs (?)
+- I created a SAF bundle and imported the 749 MELIAs to DSpace Test
+- I found thirteen items on CGSpace with dates in format "DD/MM/YYYY" so I fixed those
+
diff --git a/docs/2022-08/index.html b/docs/2022-08/index.html
index edeadb661..915918467 100644
--- a/docs/2022-08/index.html
+++ b/docs/2022-08/index.html
@@ -14,7 +14,7 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
-
+
@@ -34,9 +34,9 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
"@type": "BlogPosting",
"headline": "August, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-08/",
- "wordCount": "1122",
+ "wordCount": "1446",
"datePublished": "2022-08-01T10:22:36+03:00",
- "dateModified": "2022-08-18T13:45:48-07:00",
+ "dateModified": "2022-08-18T22:43:37-07:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -257,6 +257,43 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
+
2022-08-19
+
+- Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
+
+- I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
+- Then I checked for duplicates:
+
+
+
+$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
+
+- I sent the list of ~130 possible duplicates to Peter to check
+- Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
+
+- The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
+- I asked them why they don’t just use the original links in the first place in case tinyurl.com disappears
+
+
+- I continued working on the MARLO MELIA v2 UTF-8 metadata
+
+- I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
+- It helps to replace some characters with spaces first with this GREL:
value.replace(/[.\/;(),]/, " ")
+- This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
+- Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:
+
+
+
+$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
+
+- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
+
+$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
+
+- I had to use
xsv
because csvcut
was throwing an error detecting the dialect of the input CSVs (?)
+- I created a SAF bundle and imported the 749 MELIAs to DSpace Test
+- I found thirteen items on CGSpace with dates in format “DD/MM/YYYY” so I fixed those
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 9c0746622..de2bd60f1 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index a2335ba75..03b55892d 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index df49067ee..ab779e046 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 2b692bf2c..6b5f905b0 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 0de658482..81016c70b 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index 7bb6515ba..9e74f5921 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index b299bc6d3..17c7f89d0 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html
index ebfd92975..9335cab8d 100644
--- a/docs/categories/notes/page/7/index.html
+++ b/docs/categories/notes/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index 84bea4267..73bd226d8 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index fc52a958b..a7fdf026b 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 3bf6358dd..bd711a0c5 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index a758065a3..8473b0cf8 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index d42837416..188ba74fd 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 2be6524dc..a4f564d00 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 67f4abacc..7de054cd5 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index 719ff012a..ba71b452b 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/9/index.html b/docs/page/9/index.html
index 49170e008..981e2e119 100644
--- a/docs/page/9/index.html
+++ b/docs/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index e8bef1b45..584f3ed47 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 048b2362f..dc87ade39 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 9f42da1df..cbe3091d0 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 0705466aa..0f80c0ff0 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 49b3efb61..f62a16be1 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index dfd1aaccf..1c15eebaf 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 22395588b..4e5fcc00a 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index e558356d5..2f0bc5d0b 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html
index 417933eb9..76e0a0094 100644
--- a/docs/posts/page/9/index.html
+++ b/docs/posts/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 6f07d5118..65f8257cf 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/2022-08/
- 2022-08-18T13:45:48-07:00
+ 2022-08-18T22:43:37-07:00
https://alanorth.github.io/cgspace-notes/categories/
- 2022-08-18T13:45:48-07:00
+ 2022-08-18T22:43:37-07:00
https://alanorth.github.io/cgspace-notes/
- 2022-08-18T13:45:48-07:00
+ 2022-08-18T22:43:37-07:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2022-08-18T13:45:48-07:00
+ 2022-08-18T22:43:37-07:00
https://alanorth.github.io/cgspace-notes/posts/
- 2022-08-18T13:45:48-07:00
+ 2022-08-18T22:43:37-07:00
https://alanorth.github.io/cgspace-notes/2022-07/
2022-07-31T15:49:35+03:00