Add notes for 2021-05-11

This commit is contained in:
Alan Orth 2021-05-11 17:58:45 +03:00
parent bf80328223
commit 928a64b91b
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
23 changed files with 95 additions and 28 deletions

View File

@ -184,4 +184,34 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
- I checked their community using the DSpace Statistics API and found very accurate numbers for 2020 and 2019 for them
- I think they had been using AReS, which actually doesn't even give stats for a time period...
## 2021-05-11
- The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time
- I also spent some time looking at IWMI's reports again
- On AReS we don't have a way to group by peer reviewed or item type other than doing "if type is Journal Article"
- Also, we don't have a way to check the IWMI Strategic Priorities because those are communities, not metadata...
- We can get the collections an item is in from the `parentCollectionList` metadata, but it is saved in Elasticsearch as a string instead of a list...
- I told them it won't be possible to replicate their reports exactly
- I decided to look at the CLARISA controlled vocabularies again
- They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)
- They have updated their Swagger interface but it still requires an API key if you want to use it from curl
- They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like "Russian Federation (the)", which is not in ISO 3166 as far as I can see
- I exported a list of the institutions to look closer
- I found twelve items with whitespace issues
- There are some weird entries like `Research Institute for Aquaculture No1` and `Research Institute for Aquaculture No2`
- A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0
- I found 100+ items with multiple languages in there name like `Ministère de lAgriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries`
- Over 600 institutions have the country in their name like `Ministry of Coordination of Environmental Affairs (Mozambique)`
- For URLs they have `null` in some places... which is weird... why not just leave it blank?
- I checked the CLARISA list against ROR's April, 2020 release ("Version 9", on figshare, though it is version 8 in the dump):
```console
$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
1770
```
- With 1770 out of 6230 matched, that's 28.5%...
- Meeting with GARDIAN developers about CG Core and how GARDIAN works
<!-- vim: set sw=2 ts=2: -->

View File

@ -20,7 +20,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-05/" />
<meta property="article:published_time" content="2021-05-02T09:50:54+03:00" />
<meta property="article:modified_time" content="2021-05-09T19:11:51+03:00" />
<meta property="article:modified_time" content="2021-05-10T17:16:32+03:00" />
@ -46,9 +46,9 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
"@type": "BlogPosting",
"headline": "May, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-05/",
"wordCount": "1200",
"wordCount": "1566",
"datePublished": "2021-05-02T09:50:54+03:00",
"dateModified": "2021-05-09T19:11:51+03:00",
"dateModified": "2021-05-10T17:16:32+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -299,6 +299,43 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
</ul>
</li>
</ul>
<h2 id="2021-05-11">2021-05-11</h2>
<ul>
<li>The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time</li>
<li>I also spent some time looking at IWMI&rsquo;s reports again
<ul>
<li>On AReS we don&rsquo;t have a way to group by peer reviewed or item type other than doing &ldquo;if type is Journal Article&rdquo;</li>
<li>Also, we don&rsquo;t have a way to check the IWMI Strategic Priorities because those are communities, not metadata&hellip;</li>
<li>We can get the collections an item is in from the <code>parentCollectionList</code> metadata, but it is saved in Elasticsearch as a string instead of a list&hellip;</li>
<li>I told them it won&rsquo;t be possible to replicate their reports exactly</li>
</ul>
</li>
<li>I decided to look at the CLARISA controlled vocabularies again
<ul>
<li>They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)</li>
<li>They have updated their Swagger interface but it still requires an API key if you want to use it from curl</li>
<li>They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like &ldquo;Russian Federation (the)&rdquo;, which is not in ISO 3166 as far as I can see</li>
<li>I exported a list of the institutions to look closer
<ul>
<li>I found twelve items with whitespace issues</li>
<li>There are some weird entries like <code>Research Institute for Aquaculture No1</code> and <code>Research Institute for Aquaculture No2</code></li>
<li>A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0</li>
<li>I found 100+ items with multiple languages in there name like <code>Ministère de lAgriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries</code></li>
<li>Over 600 institutions have the country in their name like <code>Ministry of Coordination of Environmental Affairs (Mozambique)</code></li>
<li>For URLs they have <code>null</code> in some places&hellip; which is weird&hellip; why not just leave it blank?</li>
</ul>
</li>
</ul>
</li>
<li>I checked the CLARISA list against ROR&rsquo;s April, 2020 release (&ldquo;Version 9&rdquo;, on figshare, though it is version 8 in the dump):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
1770
</code></pre><ul>
<li>With 1770 out of 6230 matched, that&rsquo;s 28.5%&hellip;</li>
<li>Meeting with GARDIAN developers about CG Core and how GARDIAN works</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-05-09T19:11:51+03:00" />
<meta property="og:updated_time" content="2021-05-10T17:16:32+03:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2021-05-09T19:11:51+03:00</lastmod>
<lastmod>2021-05-10T17:16:32+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2021-05-09T19:11:51+03:00</lastmod>
<lastmod>2021-05-10T17:16:32+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-05/</loc>
<lastmod>2021-05-09T19:11:51+03:00</lastmod>
<lastmod>2021-05-10T17:16:32+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2021-05-09T19:11:51+03:00</lastmod>
<lastmod>2021-05-10T17:16:32+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2021-05-09T19:11:51+03:00</lastmod>
<lastmod>2021-05-10T17:16:32+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-04/</loc>
<lastmod>2021-04-28T18:57:48+03:00</lastmod>