cgspace-notes/docs/2024-04/index.html

358 lines
17 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="April, 2024" />
<meta property="og:description" content="2024-04-04
Work on CGSpace duplicate DOIs more
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-04/" />
<meta property="article:published_time" content="2024-04-04T10:23:00+03:00" />
<meta property="article:modified_time" content="2024-04-27T11:22:58+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2024"/>
<meta name="twitter:description" content="2024-04-04
Work on CGSpace duplicate DOIs more
"/>
<meta name="generator" content="Hugo 0.125.4">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-04/",
"wordCount": "852",
"datePublished": "2024-04-04T10:23:00+03:00",
"dateModified": "2024-04-27T11:22:58+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2024-04/">
<title>April, 2024 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2024-04/">April, 2024</a></h2>
<p class="blog-post-meta">
<time datetime="2024-04-04T10:23:00+03:00">Thu Apr 04, 2024</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2024-04-04">2024-04-04</h2>
<ul>
<li>Work on CGSpace duplicate DOIs more</li>
</ul>
<h2 id="2024-04-08">2024-04-08</h2>
<ul>
<li>Start working on IFPRI&rsquo;s 2022 batch import
<ul>
<li>I ran the duplicate checker against CGSpace and started downloading all linked PDFs</li>
</ul>
</li>
</ul>
<h2 id="2024-04-09">2024-04-09</h2>
<ul>
<li>Continue working on IFPRI&rsquo;s 2022 batch import
<ul>
<li>I started validating the potential duplicates in OpenRefine</li>
</ul>
</li>
</ul>
<h2 id="2024-04-12">2024-04-12</h2>
<ul>
<li>Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
<ul>
<li>I need to merge the metadata for the remaining 212 that are already on CGSpace</li>
</ul>
</li>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-04-13">2024-04-13</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-04-14">2024-04-14</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-04-15">2024-04-15</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
<li>Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: <a href="https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418">https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418</a></li>
<li>Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:</li>
</ul>
<pre tabindex="0"><code>$ time curl -s -o /dev/null &#39;https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&amp;scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&amp;page=0&amp;size=100&amp;embed=thumbnail,bundles/bitstreams&amp;sort=dcterms.issued,desc&#39;
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
$ time curl -s -o /dev/null &#39;https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&amp;scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&amp;page=0&amp;size=100&amp;sort=dcterms.issued,desc&#39;
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
</code></pre><ul>
<li>Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
<ul>
<li>I merged metadata with the existing items</li>
<li>There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself</li>
</ul>
</li>
</ul>
<h2 id="2024-04-16">2024-04-16</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
<li>Assist Deborah with an advanced query on CGSpace for biodiversity and health:</li>
</ul>
<pre tabindex="0"><code>dcterms.issued:[2010 TO 2024] AND dcterms.type:&#34;Journal Article&#34; AND (dc.title:&#34;biodiversity&#34; OR dcterms.subject:&#34;biodiversity&#34; OR dc.title:&#34;health&#34; OR dcterms.subject:&#34;health&#34;)
</code></pre><ul>
<li>Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
<ul>
<li>I used this Jython expression in OpenRefine with <a href="https://citation.crosscite.org/docs.html">Crossref&rsquo;s content negotiation</a> to get citations for all DOIs:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> urllib2
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>doi <span style="color:#f92672">=</span> cells[<span style="color:#e6db74">&#39;cg.identifier.doi[en_US]&#39;</span>]<span style="color:#f92672">.</span>value
</span></span><span style="display:flex;"><span>url <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;https://api.crossref.org/works/&#34;</span> <span style="color:#f92672">+</span> doi <span style="color:#f92672">+</span> <span style="color:#e6db74">&#34;/transform/text/x-bibliography&#34;</span>
</span></span><span style="display:flex;"><span>useragent <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;Python (mailto:a.o@cgiar.org)&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>request <span style="color:#f92672">=</span> urllib2<span style="color:#f92672">.</span>Request(url<span style="color:#f92672">.</span>encode(<span style="color:#e6db74">&#34;utf-8&#34;</span>), headers<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;User-Agent&#34;</span> : useragent})
</span></span><span style="display:flex;"><span>get <span style="color:#f92672">=</span> urllib2<span style="color:#f92672">.</span>urlopen(request)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">return</span> get<span style="color:#f92672">.</span>read()<span style="color:#f92672">.</span>decode(<span style="color:#e6db74">&#39;utf-8&#39;</span>)
</span></span></code></pre></div><ul>
<li>It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!</li>
</ul>
<h2 id="2024-04-18">2024-04-18</h2>
<ul>
<li>Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">SELECT</span> m.text_value, h.handle <span style="color:#66d9ef">FROM</span> metadatavalue m <span style="color:#66d9ef">JOIN</span> handle h <span style="color:#66d9ef">on</span> m.dspace_object_id <span style="color:#f92672">=</span> h.resource_id <span style="color:#66d9ef">WHERE</span> m.metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">28</span> <span style="color:#66d9ef">AND</span> m.text_value <span style="color:#66d9ef">LIKE</span> <span style="color:#e6db74">&#39;Original URL%&#39;</span> <span style="color:#66d9ef">AND</span> h.resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>;
</span></span></code></pre></div><ul>
<li>Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for <code>dcterms.replaces</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">SELECT</span> m.text_value <span style="color:#66d9ef">AS</span> handle_from, h.handle <span style="color:#66d9ef">AS</span> handle_to <span style="color:#66d9ef">FROM</span> metadatavalue m <span style="color:#66d9ef">JOIN</span> handle h <span style="color:#66d9ef">on</span> m.dspace_object_id <span style="color:#f92672">=</span> h.resource_id <span style="color:#66d9ef">WHERE</span> m.metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">181</span> <span style="color:#66d9ef">AND</span> h.resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>;
</span></span></code></pre></div><ul>
<li>Then I can work that list into an nginx map with redirect, for example:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>server {
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> if ($new_uri) {
</span></span><span style="display:flex;"><span> return 301 $new_uri;
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>map $request_uri $new_uri {
</span></span><span style="display:flex;"><span> /handle/10568/112821 /handle/10568/97605;
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="2024-04-19">2024-04-19</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
<li>Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary</li>
</ul>
<h2 id="2024-04-20">2024-04-20</h2>
<ul>
<li>I read an <a href="https://github.com/greenelab/scihub/issues/9">interesting thread about DOI casing</a>
<ul>
<li>Apparently the DOI specification says ASCII characters in DOIs are case insensitive</li>
<li>Indeed, <a href="https://www.crossref.org/documentation/member-setup/constructing-your-dois/">Crossref recommends lower case</a> for all DOIs</li>
<li>I was curious about the DOIs in our database so I checked before and after lower casing:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !=&#39;&#39;) TO /tmp/dois-sql-before.txt;
</span></span><span style="display:flex;"><span>COPY 25675
</span></span><span style="display:flex;"><span>localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !=&#39;&#39;) TO /tmp/dois-sql-after.txt;
</span></span><span style="display:flex;"><span>COPY 25666
</span></span></code></pre></div><ul>
<li>I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata&hellip;</li>
</ul>
<h2 id="2024-04-23">2024-04-23</h2>
<ul>
<li>Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
<ul>
<li>The workflow curation tasks are not documented very well but I got a basic configuration working</li>
<li>I found a bug in DSpace curation tasks and discussed on Slack</li>
<li>I finalized the <code>NormalizeDOIs</code> curation task and released v7.6.1.1 of the <a href="https://github.com/ilri/cgspace-java-helpers">cgspace-java-helpers</a> project</li>
</ul>
</li>
</ul>
<h2 id="2024-04-24">2024-04-24</h2>
<ul>
<li>A bit more testing of the curation tasks
<ul>
<li>I tested a patch by Mark Wood</li>
</ul>
</li>
<li>I added support for normalizing DOIs to this same format to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> project</li>
</ul>
<h2 id="2024-04-25">2024-04-25</h2>
<ul>
<li>I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters</li>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-04-26">2024-04-26</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-04-29">2024-04-29</h2>
<ul>
<li>Start working on the IFPRI 20202021 batch migration
<ul>
<li>I modified my <code>check_duplicates.py</code> script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact</li>
</ul>
</li>
<li>I noticed something in the Tomcat log:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename=&#34;Literature review on Womens Empowerment and their Resilience2.pdf&#34;] has been removed from the response because it is invalid
</span></span><span style="display:flex;"><span>tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
</span></span></code></pre></div><ul>
<li>I found the bitstream&rsquo;s ID and then used the <code>ds6_bitstream2itemhandle</code> <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6">SQL helper function</a> to find the item&rsquo;s handle
<ul>
<li>Then I replaced the curly quote with a regular quote in all bistreams</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-04/">April, 2024</a></li>
<li><a href="/cgspace-notes/2024-03/">March, 2024</a></li>
<li><a href="/cgspace-notes/2024-02/">February, 2024</a></li>
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
<li><a href="/cgspace-notes/2023-12/">December, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>