diff --git a/content/posts/2021-05.md b/content/posts/2021-05.md index 4f8b97f1a..82013d217 100644 --- a/content/posts/2021-05.md +++ b/content/posts/2021-05.md @@ -184,4 +184,34 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o - I checked their community using the DSpace Statistics API and found very accurate numbers for 2020 and 2019 for them - I think they had been using AReS, which actually doesn't even give stats for a time period... +## 2021-05-11 + +- The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time +- I also spent some time looking at IWMI's reports again + - On AReS we don't have a way to group by peer reviewed or item type other than doing "if type is Journal Article" + - Also, we don't have a way to check the IWMI Strategic Priorities because those are communities, not metadata... + - We can get the collections an item is in from the `parentCollectionList` metadata, but it is saved in Elasticsearch as a string instead of a list... + - I told them it won't be possible to replicate their reports exactly +- I decided to look at the CLARISA controlled vocabularies again + - They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07) + - They have updated their Swagger interface but it still requires an API key if you want to use it from curl + - They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like "Russian Federation (the)", which is not in ISO 3166 as far as I can see + - I exported a list of the institutions to look closer + - I found twelve items with whitespace issues + - There are some weird entries like `Research Institute for Aquaculture No1` and `Research Institute for Aquaculture No2` + - A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0 + - I found 100+ items with multiple languages in there name like `Ministère de l’Agriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries` + - Over 600 institutions have the country in their name like `Ministry of Coordination of Environmental Affairs (Mozambique)` + - For URLs they have `null` in some places... which is weird... why not just leave it blank? +- I checked the CLARISA list against ROR's April, 2020 release ("Version 9", on figshare, though it is version 8 in the dump): + +```console +$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv +$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l +1770 +``` + +- With 1770 out of 6230 matched, that's 28.5%... +- Meeting with GARDIAN developers about CG Core and how GARDIAN works + diff --git a/docs/2021-05/index.html b/docs/2021-05/index.html index 80241614c..67ef2e11e 100644 --- a/docs/2021-05/index.html +++ b/docs/2021-05/index.html @@ -20,7 +20,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from - + @@ -46,9 +46,9 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from "@type": "BlogPosting", "headline": "May, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-05/", - "wordCount": "1200", + "wordCount": "1566", "datePublished": "2021-05-02T09:50:54+03:00", - "dateModified": "2021-05-09T19:11:51+03:00", + "dateModified": "2021-05-10T17:16:32+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -299,6 +299,43 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o +

2021-05-11

+ +
$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
+$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
+1770
+
diff --git a/docs/categories/index.html b/docs/categories/index.html index 03cb2f826..024059fa0 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index d4b50ad50..df8ea62c4 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 055f193a3..055120173 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index c6d165b65..a70329faa 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 7109c00f0..608e2d5ef 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index f76ceaf8b..d2289a0ed 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 24f8f39d6..0788ff08c 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index ab6cf27c8..79b0b91b8 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 6e869564e..63d71bc51 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 2cb638acc..91f4ebc46 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index b057ce8d7..5c02baa8d 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index b8493badf..8da33d914 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index bd25fa77f..9ec745488 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index eca942d17..e659053b3 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 26f57e012..ea48bed8d 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 3a2dade15..101785881 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index c7c2e2a70..baa4890e2 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 252d0756c..49e3f45e0 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index bde9d12df..911e4ce96 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 627773a14..126a2093c 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 113369bea..af4d4561a 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-05-09T19:11:51+03:00 + 2021-05-10T17:16:32+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-05-09T19:11:51+03:00 + 2021-05-10T17:16:32+03:00 https://alanorth.github.io/cgspace-notes/2021-05/ - 2021-05-09T19:11:51+03:00 + 2021-05-10T17:16:32+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-05-09T19:11:51+03:00 + 2021-05-10T17:16:32+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-05-09T19:11:51+03:00 + 2021-05-10T17:16:32+03:00 https://alanorth.github.io/cgspace-notes/2021-04/ 2021-04-28T18:57:48+03:00