diff --git a/content/posts/2021-05.md b/content/posts/2021-05.md
index 4f8b97f1a..82013d217 100644
--- a/content/posts/2021-05.md
+++ b/content/posts/2021-05.md
@@ -184,4 +184,34 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
- I checked their community using the DSpace Statistics API and found very accurate numbers for 2020 and 2019 for them
- I think they had been using AReS, which actually doesn't even give stats for a time period...
+## 2021-05-11
+
+- The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time
+- I also spent some time looking at IWMI's reports again
+ - On AReS we don't have a way to group by peer reviewed or item type other than doing "if type is Journal Article"
+ - Also, we don't have a way to check the IWMI Strategic Priorities because those are communities, not metadata...
+ - We can get the collections an item is in from the `parentCollectionList` metadata, but it is saved in Elasticsearch as a string instead of a list...
+ - I told them it won't be possible to replicate their reports exactly
+- I decided to look at the CLARISA controlled vocabularies again
+ - They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)
+ - They have updated their Swagger interface but it still requires an API key if you want to use it from curl
+ - They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like "Russian Federation (the)", which is not in ISO 3166 as far as I can see
+ - I exported a list of the institutions to look closer
+ - I found twelve items with whitespace issues
+ - There are some weird entries like `Research Institute for Aquaculture No1` and `Research Institute for Aquaculture No2`
+ - A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0
+ - I found 100+ items with multiple languages in there name like `Ministère de l’Agriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries`
+ - Over 600 institutions have the country in their name like `Ministry of Coordination of Environmental Affairs (Mozambique)`
+ - For URLs they have `null` in some places... which is weird... why not just leave it blank?
+- I checked the CLARISA list against ROR's April, 2020 release ("Version 9", on figshare, though it is version 8 in the dump):
+
+```console
+$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
+$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
+1770
+```
+
+- With 1770 out of 6230 matched, that's 28.5%...
+- Meeting with GARDIAN developers about CG Core and how GARDIAN works
+
diff --git a/docs/2021-05/index.html b/docs/2021-05/index.html
index 80241614c..67ef2e11e 100644
--- a/docs/2021-05/index.html
+++ b/docs/2021-05/index.html
@@ -20,7 +20,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
-
+
@@ -46,9 +46,9 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
"@type": "BlogPosting",
"headline": "May, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-05/",
- "wordCount": "1200",
+ "wordCount": "1566",
"datePublished": "2021-05-02T09:50:54+03:00",
- "dateModified": "2021-05-09T19:11:51+03:00",
+ "dateModified": "2021-05-10T17:16:32+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -299,6 +299,43 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
+
2021-05-11
+
+- The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time
+- I also spent some time looking at IWMI’s reports again
+
+- On AReS we don’t have a way to group by peer reviewed or item type other than doing “if type is Journal Article”
+- Also, we don’t have a way to check the IWMI Strategic Priorities because those are communities, not metadata…
+- We can get the collections an item is in from the
parentCollectionList
metadata, but it is saved in Elasticsearch as a string instead of a list…
+- I told them it won’t be possible to replicate their reports exactly
+
+
+- I decided to look at the CLARISA controlled vocabularies again
+
+- They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)
+- They have updated their Swagger interface but it still requires an API key if you want to use it from curl
+- They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like “Russian Federation (the)”, which is not in ISO 3166 as far as I can see
+- I exported a list of the institutions to look closer
+
+- I found twelve items with whitespace issues
+- There are some weird entries like
Research Institute for Aquaculture No1
and Research Institute for Aquaculture No2
+- A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0
+- I found 100+ items with multiple languages in there name like
Ministère de l’Agriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries
+- Over 600 institutions have the country in their name like
Ministry of Coordination of Environmental Affairs (Mozambique)
+- For URLs they have
null
in some places… which is weird… why not just leave it blank?
+
+
+
+
+- I checked the CLARISA list against ROR’s April, 2020 release (“Version 9”, on figshare, though it is version 8 in the dump):
+
+$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
+$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
+1770
+
+- With 1770 out of 6230 matched, that’s 28.5%…
+- Meeting with GARDIAN developers about CG Core and how GARDIAN works
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 03cb2f826..024059fa0 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index d4b50ad50..df8ea62c4 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index 055f193a3..055120173 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index c6d165b65..a70329faa 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 7109c00f0..608e2d5ef 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index f76ceaf8b..d2289a0ed 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index 24f8f39d6..0788ff08c 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index ab6cf27c8..79b0b91b8 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 6e869564e..63d71bc51 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index 2cb638acc..91f4ebc46 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index b057ce8d7..5c02baa8d 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index b8493badf..8da33d914 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index bd25fa77f..9ec745488 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index eca942d17..e659053b3 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 26f57e012..ea48bed8d 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 3a2dade15..101785881 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index c7c2e2a70..baa4890e2 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 252d0756c..49e3f45e0 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index bde9d12df..911e4ce96 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 627773a14..126a2093c 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 113369bea..af4d4561a 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2021-05-09T19:11:51+03:00
+ 2021-05-10T17:16:32+03:00
https://alanorth.github.io/cgspace-notes/
- 2021-05-09T19:11:51+03:00
+ 2021-05-10T17:16:32+03:00
https://alanorth.github.io/cgspace-notes/2021-05/
- 2021-05-09T19:11:51+03:00
+ 2021-05-10T17:16:32+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2021-05-09T19:11:51+03:00
+ 2021-05-10T17:16:32+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2021-05-09T19:11:51+03:00
+ 2021-05-10T17:16:32+03:00
https://alanorth.github.io/cgspace-notes/2021-04/
2021-04-28T18:57:48+03:00