Add notes for 2024-03-08

This commit is contained in:
2024-03-08 17:31:19 +03:00
parent 5ff70af33b
commit 11f1935f85
148 changed files with 275 additions and 186 deletions

View File

@ -19,7 +19,7 @@ It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-03/" />
<meta property="article:published_time" content="2024-03-01T09:55:00+03:00" />
<meta property="article:modified_time" content="2024-03-01T09:55:00+03:00" />
<meta property="article:modified_time" content="2024-03-04T10:02:14+03:00" />
@ -34,7 +34,7 @@ It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
"/>
<meta name="generator" content="Hugo 0.123.7">
<meta name="generator" content="Hugo 0.123.8">
@ -44,9 +44,9 @@ It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
"@type": "BlogPosting",
"headline": "March, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-03/",
"wordCount": "93",
"wordCount": "317",
"datePublished": "2024-03-01T09:55:00+03:00",
"dateModified": "2024-03-01T09:55:00+03:00",
"dateModified": "2024-03-04T10:02:14+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -141,6 +141,51 @@ It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808
<li>I did some cleanups on abstracts, licenses, and dates from CrossRef</li>
<li>I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list</li>
</ul>
<h2 id="2024-03-05">2024-03-05</h2>
<ul>
<li>I tried a new technique to get some affiliations from Crossref using OpenRefine
<ul>
<li>First I split them and clustered, resolving a few hundred clusters out of 1500 (!)</li>
<li>Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work</li>
<li>Then I joined them with our affiliations, paying no attention to duplicates</li>
<li>Then I deduped them using the Jython technique I learned in 2023-02</li>
</ul>
</li>
</ul>
<h2 id="2024-03-06">2024-03-06</h2>
<ul>
<li>Peter sent me some more corrections for the authors that I had sent him in 2023-12</li>
</ul>
<h2 id="2024-03-08">2024-03-08</h2>
<ul>
<li>IFPRI sent me their 2023 records from CONTENTdm so I started working on those
<ul>
<li>I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> re
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;/tmp/cg-creator-identifier.txt&#34;</span>,<span style="color:#e6db74">&#39;r&#39;</span>) <span style="color:#66d9ef">as</span> f :
</span></span><span style="display:flex;"><span> orcid_ids <span style="color:#f92672">=</span> [orcid_id<span style="color:#f92672">.</span>strip() <span style="color:#66d9ef">for</span> orcid_id <span style="color:#f92672">in</span> f]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>matched <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> orcid_id <span style="color:#f92672">in</span> orcid_ids:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> re<span style="color:#f92672">.</span>search(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#39;.+: </span><span style="color:#e6db74">{}</span><span style="color:#e6db74">&#39;</span><span style="color:#f92672">.</span>format(value), orcid_id):
</span></span><span style="display:flex;"><span> matched <span style="color:#f92672">=</span> <span style="color:#66d9ef">True</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">break</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> matched:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> orcid_id
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> value
</span></span></code></pre></div><ul>
<li>I realized that <a href="https://www.unicef.org/about-unicef/frequently-asked-questions#3">UNICEF was renamed to its current name in 1953</a> so I replaced all other variations in our vocabularies and metadata:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> text_value<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;United Nations Children&#39;&#39;s Fund&#39;</span> <span style="color:#66d9ef">WHERE</span> dspace_object_id <span style="color:#66d9ef">IN</span> (<span style="color:#66d9ef">SELECT</span> uuid <span style="color:#66d9ef">FROM</span> item) <span style="color:#66d9ef">AND</span> text_value <span style="color:#66d9ef">IN</span> (<span style="color:#e6db74">&#39;United Nations International Children&#39;&#39;s Emergency Fund&#39;</span>, <span style="color:#e6db74">&#39;United Nations International Children&#39;&#39;s Emergency Fund&#39;</span>, <span style="color:#e6db74">&#39;UNICEF&#39;</span>);
</span></span></code></pre></div><ul>
<li>Note the use of two single quotes to escape the one in the name</li>
</ul>
<!-- raw HTML omitted -->