Add notes for 2022-12-03

This commit is contained in:
Alan Orth 2022-12-04 03:19:49 +03:00
parent 1dd80f769a
commit 12b4f1660d
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
30 changed files with 105 additions and 37 deletions

View File

@ -25,4 +25,42 @@ categories: ["Notes"]
- [Set the description when submitting bitstreams to CGSpace](https://github.com/CodeObia/MEL/issues/11067)
- [Some items have a Creative Commons license, but are Limited Access and bitstreams are locked](https://github.com/CodeObia/MEL/issues/11068)
## 2022-12-03
- I downloaded a fresh copy of CLARISA's institutions list as well as ROR's latest dump from 2022-12-01 to check how many are matching:
```console
$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
1864
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
```
- Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
- If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:
```console
$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
```
- I checked CGSpace's top 1,000 institutions too, first exporting from PostgreSQL:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
```
- Then cutting (tab is the default delimeter):
```console
$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
542
```
- So that's a 54% match for our top institutions
<!-- vim: set sw=2 ts=2: -->

View File

@ -24,7 +24,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-11/" />
<meta property="article:published_time" content="2022-11-01T09:11:36+03:00" />
<meta property="article:modified_time" content="2022-11-30T18:21:20+03:00" />
<meta property="article:modified_time" content="2022-12-03T10:46:29+03:00" />
@ -56,7 +56,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
"url": "https://alanorth.github.io/cgspace-notes/2022-11/",
"wordCount": "3414",
"datePublished": "2022-11-01T09:11:36+03:00",
"dateModified": "2022-11-30T18:21:20+03:00",
"dateModified": "2022-12-03T10:46:29+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -20,7 +20,7 @@ Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpac
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-12/" />
<meta property="article:published_time" content="2022-12-01T08:52:36+03:00" />
<meta property="article:modified_time" content="2022-12-01T08:52:36+03:00" />
<meta property="article:modified_time" content="2022-12-03T10:46:29+03:00" />
@ -46,9 +46,9 @@ Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpac
"@type": "BlogPosting",
"headline": "December, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-12/",
"wordCount": "159",
"wordCount": "376",
"datePublished": "2022-12-01T08:52:36+03:00",
"dateModified": "2022-12-01T08:52:36+03:00",
"dateModified": "2022-12-03T10:46:29+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -147,6 +147,36 @@ Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpac
</ul>
</li>
</ul>
<h2 id="2022-12-03">2022-12-03</h2>
<ul>
<li>I downloaded a fresh copy of CLARISA&rsquo;s institutions list as well as ROR&rsquo;s latest dump from 2022-12-01 to check how many are matching:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp &gt; ~/Downloads/2022-12-03-CLARISA-institutions.json
</span></span><span style="display:flex;"><span>$ jq -r <span style="color:#e6db74">&#39;.[] | .name&#39;</span> ~/Downloads/2022-12-03-CLARISA-institutions.json &gt; ~/Downloads/2022-12-03-CLARISA-institutions.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
</span></span><span style="display:flex;"><span>1864
</span></span><span style="display:flex;"><span>$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
</span></span><span style="display:flex;"><span>7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
</span></span></code></pre></div><ul>
<li>Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher</li>
<li>If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sed -e <span style="color:#e6db74">&#39;s_ / _\n_&#39;</span> -e <span style="color:#e6db74">&#39;s_/_\n_&#39;</span> -e <span style="color:#e6db74">&#39;s/ \?(.*)$//&#39;</span> ~/Downloads/2022-12-03-CLARISA-institutions.txt &gt; ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
</span></span></code></pre></div><ul>
<li>I checked CGSpace&rsquo;s top 1,000 institutions too, first exporting from PostgreSQL:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
</span></span></code></pre></div><ul>
<li>Then cutting (tab is the default delimeter):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cut -f <span style="color:#ae81ff">1</span> /tmp/2022-11-22-affiliations.csv &gt; 2022-11-22-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
</span></span><span style="display:flex;"><span>542
</span></span></code></pre></div><ul>
<li>So that&rsquo;s a 54% match for our top institutions</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-12-01T08:52:36+03:00" />
<meta property="og:updated_time" content="2022-12-03T10:46:29+03:00" />

View File

@ -3,22 +3,22 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2022-12-01T08:52:36+03:00</lastmod>
<lastmod>2022-12-03T10:46:29+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2022-12-01T08:52:36+03:00</lastmod>
<lastmod>2022-12-03T10:46:29+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-12/</loc>
<lastmod>2022-12-01T08:52:36+03:00</lastmod>
<lastmod>2022-12-03T10:46:29+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2022-12-01T08:52:36+03:00</lastmod>
<lastmod>2022-12-03T10:46:29+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2022-12-01T08:52:36+03:00</lastmod>
<lastmod>2022-12-03T10:46:29+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-11/</loc>
<lastmod>2022-11-30T18:21:20+03:00</lastmod>
<lastmod>2022-12-03T10:46:29+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-10/</loc>
<lastmod>2022-10-31T16:59:47+03:00</lastmod>