Add notes for 2022-12-03

2025-01-27 05:49:12 +01:00 · 2022-12-04 03:19:49 +03:00
parent 1dd80f769a
commit 12b4f1660d
30 changed files with 105 additions and 37 deletions
--- a/docs/2022-12/index.html
+++ b/docs/2022-12/index.html
@ -20,7 +20,7 @@ Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpac
 <meta property="og:type" content="article" />
 <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-12/" />
 <meta property="article:published_time" content="2022-12-01T08:52:36+03:00" />
-<meta property="article:modified_time" content="2022-12-01T08:52:36+03:00" />
+<meta property="article:modified_time" content="2022-12-03T10:46:29+03:00" />



@ -46,9 +46,9 @@ Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpac
  "@type": "BlogPosting",
  "headline": "December, 2022",
  "url": "https://alanorth.github.io/cgspace-notes/2022-12/",
-  "wordCount": "159",
+  "wordCount": "376",
  "datePublished": "2022-12-01T08:52:36+03:00",
-  "dateModified": "2022-12-01T08:52:36+03:00",
+  "dateModified": "2022-12-03T10:46:29+03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@ -147,6 +147,36 @@ Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpac
 </ul>
 </li>
 </ul>
+<h2 id="2022-12-03">2022-12-03</h2>
+<ul>
+<li>I downloaded a fresh copy of CLARISA&rsquo;s institutions list as well as ROR&rsquo;s latest dump from 2022-12-01 to check how many are matching:</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp &gt; ~/Downloads/2022-12-03-CLARISA-institutions.json
+</span></span><span style="display:flex;"><span>$ jq -r <span style="color:#e6db74">&#39;.[] | .name&#39;</span> ~/Downloads/2022-12-03-CLARISA-institutions.json &gt; ~/Downloads/2022-12-03-CLARISA-institutions.txt
+</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
+</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
+</span></span><span style="display:flex;"><span>1864
+</span></span><span style="display:flex;"><span>$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
+</span></span><span style="display:flex;"><span>7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
+</span></span></code></pre></div><ul>
+<li>Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher</li>
+<li>If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sed -e <span style="color:#e6db74">&#39;s_ / _\n_&#39;</span> -e <span style="color:#e6db74">&#39;s_/_\n_&#39;</span> -e <span style="color:#e6db74">&#39;s/ \?(.*)$//&#39;</span> ~/Downloads/2022-12-03-CLARISA-institutions.txt &gt; ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
+</span></span></code></pre></div><ul>
+<li>I checked CGSpace&rsquo;s top 1,000 institutions too, first exporting from PostgreSQL:</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
+</span></span></code></pre></div><ul>
+<li>Then cutting (tab is the default delimeter):</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cut -f <span style="color:#ae81ff">1</span> /tmp/2022-11-22-affiliations.csv &gt; 2022-11-22-affiliations.txt
+</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
+</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
+</span></span><span style="display:flex;"><span>542
+</span></span></code></pre></div><ul>
+<li>So that&rsquo;s a 54% match for our top institutions</li>
+</ul>
 <!-- raw HTML omitted -->