cgspace-notes/docs/2024-05/index.html

300 lines
11 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2024" />
<meta property="og:description" content="2024-05-01
I dumped all the CGSpace DOIs and resolved them with my crossref_doi_lookup.py script
Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-05/" />
<meta property="article:published_time" content="2024-05-01T10:39:00+03:00" />
<meta property="article:modified_time" content="2024-05-13T16:24:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2024"/>
<meta name="twitter:description" content="2024-05-01
I dumped all the CGSpace DOIs and resolved them with my crossref_doi_lookup.py script
Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
"/>
<meta name="generator" content="Hugo 0.126.1">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-05/",
"wordCount": "652",
"datePublished": "2024-05-01T10:39:00+03:00",
"dateModified": "2024-05-13T16:24:11+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2024-05/">
<title>May, 2024 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2024-05/">May, 2024</a></h2>
<p class="blog-post-meta">
<time datetime="2024-05-01T10:39:00+03:00">Wed May 01, 2024</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2024-05-01">2024-05-01</h2>
<ul>
<li>I dumped all the CGSpace DOIs and resolved them with my <code>crossref_doi_lookup.py</code> script
<ul>
<li>Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc</li>
</ul>
</li>
</ul>
<h2 id="2024-05-05">2024-05-05</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-05-06">2024-05-06</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-05-07">2024-05-07</h2>
<ul>
<li>Discuss RSS feeds and OpenSearch with IWMI
<ul>
<li>It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch</li>
</ul>
</li>
<li>I saw a patch for an interesting issue on DSpace GitHub: <a href="https://github.com/DSpace/DSpace/issues/9544">Error submitting or deleting items - URI too long when user is in a large number of groups</a>
<ul>
<li>I hadn&rsquo;t realized it, but we have lots of those errors:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ zstdgrep -a <span style="color:#e6db74">&#39;URI Too Long&#39;</span> log/dspace.log-2024-04-* | wc -l
</span></span><span style="display:flex;"><span>1423
</span></span></code></pre></div><ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-05-08">2024-05-08</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;
<ul>
<li>I finally finished looking at the duplicate DOIs for journal articles</li>
<li>I updated the list of handle redirects and there are 386 of them!</li>
</ul>
</li>
</ul>
<h2 id="2024-05-09">2024-05-09</h2>
<ul>
<li>Spend some time working on the IFPRI 20202021 batch
<ul>
<li>I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date</li>
</ul>
</li>
</ul>
<h2 id="2024-05-12">2024-05-12</h2>
<ul>
<li>I couldn&rsquo;t figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:</li>
</ul>
<pre tabindex="0"><code class="language-psql" data-lang="psql">dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE &#39;Submitted by%&#39;) TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
</code></pre><ul>
<li>Then joined them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv &gt; /tmp/withdrawn.csv
</span></span></code></pre></div><ul>
<li>This gives me an insight into who submitted at 334 of the duplicates over the past few years&hellip;</li>
<li>I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl</li>
</ul>
<h2 id="2024-05-13">2024-05-13</h2>
<ul>
<li>Export a list of IFPRI information products with handle links and CONTENTdm links:</li>
</ul>
<pre tabindex="0"><code>$ csvgrep -c &#39;dc.description.provenance[en_US]&#39; -m &#39;CONTENTdm&#39; cgspace.csv \
| csvcut -c &#39;id,dc.description.provenance[en_US],dc.identifier.uri[en_US]&#39; \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
2645
</code></pre><ul>
<li>I discovered the <code>/server/api/pid/find</code> endpoint today, which is much more direct and manageable than the <code>/server/api/discover/search/objects?query=</code> endpoint when trying to get metadata for a Handle (item, collection, or community)
<ul>
<li>The &ldquo;pid&rdquo; stands for permanent identifiers apparently, and we can use it like this:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
</code></pre><h2 id="2024-05-15">2024-05-15</h2>
<ul>
<li>I got journal titles for 2,900 journal articles that were missing them from Crossref</li>
</ul>
<h2 id="2024-05-16">2024-05-16</h2>
<p>Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:</p>
<ul>
<li><a href="https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024">https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024</a></li>
<li><a href="https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D">https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D</a> — note the Lucene search syntax is URL encoded version of <code>:[2024-01-01T00:00:00Z TO *]</code></li>
</ul>
<p>Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax</p>
<p>I wrote a new version of the <code>check_duplicates.py</code> script to help identify duplicates with different types</p>
<ul>
<li>Initially I called it <code>check_duplicates_fast.py</code> but it&rsquo;s actually not faster</li>
<li>I need to find a way to deal with duplicates from IFPRI&rsquo;s repository because there are some mismatched types&hellip;</li>
</ul>
<h2 id="2024-05-20">2024-05-20</h2>
<p>Continue working through alternative duplicate matching for IFPRI</p>
<ul>
<li>Their item types are sometimes different than ours&hellip;</li>
<li>One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check</li>
<li>Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-05/">May, 2024</a></li>
<li><a href="/cgspace-notes/2024-04/">April, 2024</a></li>
<li><a href="/cgspace-notes/2024-03/">March, 2024</a></li>
<li><a href="/cgspace-notes/2024-02/">February, 2024</a></li>
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>