mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-12-03
This commit is contained in:
@ -20,7 +20,7 @@ Replace “East Asia” with “Eastern Asia” region on CGSpac
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-12/" />
|
||||
<meta property="article:published_time" content="2022-12-01T08:52:36+03:00" />
|
||||
<meta property="article:modified_time" content="2022-12-01T08:52:36+03:00" />
|
||||
<meta property="article:modified_time" content="2022-12-03T10:46:29+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -46,9 +46,9 @@ Replace “East Asia” with “Eastern Asia” region on CGSpac
|
||||
"@type": "BlogPosting",
|
||||
"headline": "December, 2022",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2022-12/",
|
||||
"wordCount": "159",
|
||||
"wordCount": "376",
|
||||
"datePublished": "2022-12-01T08:52:36+03:00",
|
||||
"dateModified": "2022-12-01T08:52:36+03:00",
|
||||
"dateModified": "2022-12-03T10:46:29+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -147,6 +147,36 @@ Replace “East Asia” with “Eastern Asia” region on CGSpac
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2022-12-03">2022-12-03</h2>
|
||||
<ul>
|
||||
<li>I downloaded a fresh copy of CLARISA’s institutions list as well as ROR’s latest dump from 2022-12-01 to check how many are matching:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
|
||||
</span></span><span style="display:flex;"><span>$ jq -r <span style="color:#e6db74">'.[] | .name'</span> ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
|
||||
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
|
||||
</span></span><span style="display:flex;"><span>1864
|
||||
</span></span><span style="display:flex;"><span>$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
|
||||
</span></span><span style="display:flex;"><span>7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher</li>
|
||||
<li>If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sed -e <span style="color:#e6db74">'s_ / _\n_'</span> -e <span style="color:#e6db74">'s_/_\n_'</span> -e <span style="color:#e6db74">'s/ \?(.*)$//'</span> ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I checked CGSpace’s top 1,000 institutions too, first exporting from PostgreSQL:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Then cutting (tab is the default delimeter):</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cut -f <span style="color:#ae81ff">1</span> /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
|
||||
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
|
||||
</span></span><span style="display:flex;"><span>542
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>So that’s a 54% match for our top institutions</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user