mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-05-11
This commit is contained in:
@ -184,4 +184,34 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
|
||||
- I checked their community using the DSpace Statistics API and found very accurate numbers for 2020 and 2019 for them
|
||||
- I think they had been using AReS, which actually doesn't even give stats for a time period...
|
||||
|
||||
## 2021-05-11
|
||||
|
||||
- The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time
|
||||
- I also spent some time looking at IWMI's reports again
|
||||
- On AReS we don't have a way to group by peer reviewed or item type other than doing "if type is Journal Article"
|
||||
- Also, we don't have a way to check the IWMI Strategic Priorities because those are communities, not metadata...
|
||||
- We can get the collections an item is in from the `parentCollectionList` metadata, but it is saved in Elasticsearch as a string instead of a list...
|
||||
- I told them it won't be possible to replicate their reports exactly
|
||||
- I decided to look at the CLARISA controlled vocabularies again
|
||||
- They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)
|
||||
- They have updated their Swagger interface but it still requires an API key if you want to use it from curl
|
||||
- They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like "Russian Federation (the)", which is not in ISO 3166 as far as I can see
|
||||
- I exported a list of the institutions to look closer
|
||||
- I found twelve items with whitespace issues
|
||||
- There are some weird entries like `Research Institute for Aquaculture No1` and `Research Institute for Aquaculture No2`
|
||||
- A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0
|
||||
- I found 100+ items with multiple languages in there name like `Ministère de l’Agriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries`
|
||||
- Over 600 institutions have the country in their name like `Ministry of Coordination of Environmental Affairs (Mozambique)`
|
||||
- For URLs they have `null` in some places... which is weird... why not just leave it blank?
|
||||
- I checked the CLARISA list against ROR's April, 2020 release ("Version 9", on figshare, though it is version 8 in the dump):
|
||||
|
||||
```console
|
||||
$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
|
||||
$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
|
||||
1770
|
||||
```
|
||||
|
||||
- With 1770 out of 6230 matched, that's 28.5%...
|
||||
- Meeting with GARDIAN developers about CG Core and how GARDIAN works
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user