Add notes for 2021-05-11

This commit is contained in:
2021-05-11 17:58:45 +03:00
parent bf80328223
commit 928a64b91b
23 changed files with 95 additions and 28 deletions

View File

@ -184,4 +184,34 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
- I checked their community using the DSpace Statistics API and found very accurate numbers for 2020 and 2019 for them
- I think they had been using AReS, which actually doesn't even give stats for a time period...
## 2021-05-11
- The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time
- I also spent some time looking at IWMI's reports again
- On AReS we don't have a way to group by peer reviewed or item type other than doing "if type is Journal Article"
- Also, we don't have a way to check the IWMI Strategic Priorities because those are communities, not metadata...
- We can get the collections an item is in from the `parentCollectionList` metadata, but it is saved in Elasticsearch as a string instead of a list...
- I told them it won't be possible to replicate their reports exactly
- I decided to look at the CLARISA controlled vocabularies again
- They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)
- They have updated their Swagger interface but it still requires an API key if you want to use it from curl
- They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like "Russian Federation (the)", which is not in ISO 3166 as far as I can see
- I exported a list of the institutions to look closer
- I found twelve items with whitespace issues
- There are some weird entries like `Research Institute for Aquaculture No1` and `Research Institute for Aquaculture No2`
- A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0
- I found 100+ items with multiple languages in there name like `Ministère de lAgriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries`
- Over 600 institutions have the country in their name like `Ministry of Coordination of Environmental Affairs (Mozambique)`
- For URLs they have `null` in some places... which is weird... why not just leave it blank?
- I checked the CLARISA list against ROR's April, 2020 release ("Version 9", on figshare, though it is version 8 in the dump):
```console
$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
1770
```
- With 1770 out of 6230 matched, that's 28.5%...
- Meeting with GARDIAN developers about CG Core and how GARDIAN works
<!-- vim: set sw=2 ts=2: -->