cgspace-notes/docs/2024-05/index.html

277 lines
9.2 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2024" />
<meta property="og:description" content="2024-05-01
I dumped all the CGSpace DOIs and resolved them with my crossref_doi_lookup.py script
Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-05/" />
<meta property="article:published_time" content="2024-05-01T10:39:00+03:00" />
<meta property="article:modified_time" content="2024-05-13T08:21:17+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2024"/>
<meta name="twitter:description" content="2024-05-01
I dumped all the CGSpace DOIs and resolved them with my crossref_doi_lookup.py script
Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
"/>
<meta name="generator" content="Hugo 0.125.7">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-05/",
"wordCount": "438",
"datePublished": "2024-05-01T10:39:00+03:00",
"dateModified": "2024-05-13T08:21:17+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2024-05/">
<title>May, 2024 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2024-05/">May, 2024</a></h2>
<p class="blog-post-meta">
<time datetime="2024-05-01T10:39:00+03:00">Wed May 01, 2024</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2024-05-01">2024-05-01</h2>
<ul>
<li>I dumped all the CGSpace DOIs and resolved them with my <code>crossref_doi_lookup.py</code> script
<ul>
<li>Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc</li>
</ul>
</li>
</ul>
<h2 id="2024-05-05">2024-05-05</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-05-06">2024-05-06</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-05-07">2024-05-07</h2>
<ul>
<li>Discuss RSS feeds and OpenSearch with IWMI
<ul>
<li>It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch</li>
</ul>
</li>
<li>I saw a patch for an interesting issue on DSpace GitHub: <a href="https://github.com/DSpace/DSpace/issues/9544">Error submitting or deleting items - URI too long when user is in a large number of groups</a>
<ul>
<li>I hadn&rsquo;t realized it, but we have lots of those errors:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ zstdgrep -a <span style="color:#e6db74">&#39;URI Too Long&#39;</span> log/dspace.log-2024-04-* | wc -l
</span></span><span style="display:flex;"><span>1423
</span></span></code></pre></div><ul>
<li>Spend some time looking at duplicate DOIs again&hellip;</li>
</ul>
<h2 id="2024-05-08">2024-05-08</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again&hellip;
<ul>
<li>I finally finished looking at the duplicate DOIs for journal articles</li>
<li>I updated the list of handle redirects and there are 386 of them!</li>
</ul>
</li>
</ul>
<h2 id="2024-05-09">2024-05-09</h2>
<ul>
<li>Spend some time working on the IFPRI 20202021 batch
<ul>
<li>I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date</li>
</ul>
</li>
</ul>
<h2 id="2024-05-12">2024-05-12</h2>
<ul>
<li>I couldn&rsquo;t figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:</li>
</ul>
<pre tabindex="0"><code class="language-psql" data-lang="psql">dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE &#39;Submitted by%&#39;) TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
</code></pre><ul>
<li>Then joined them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv &gt; /tmp/withdrawn.csv
</span></span></code></pre></div><ul>
<li>This gives me an insight into who submitted at 334 of the duplicates over the past few years&hellip;</li>
<li>I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl</li>
</ul>
<h2 id="2024-05-13">2024-05-13</h2>
<ul>
<li>Export a list of IFPRI information products with handle links and CONTENTdm links:</li>
</ul>
<pre tabindex="0"><code>$ csvgrep -c &#39;dc.description.provenance[en_US]&#39; -m &#39;CONTENTdm&#39; cgspace.csv \
| csvcut -c &#39;id,dc.description.provenance[en_US],dc.identifier.uri[en_US]&#39; \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
2645
</code></pre><ul>
<li>I discovered the <code>/server/api/pid/find</code> endpoint today, which is much more direct and manageable than the <code>/server/api/discover/search/objects?query=</code> endpoint when trying to get metadata for a Handle (item, collection, or community)
<ul>
<li>The &ldquo;pid&rdquo; stands for permanent identifiers apparently, and we can use it like this:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
</code></pre><!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-05/">May, 2024</a></li>
<li><a href="/cgspace-notes/2024-04/">April, 2024</a></li>
<li><a href="/cgspace-notes/2024-03/">March, 2024</a></li>
<li><a href="/cgspace-notes/2024-02/">February, 2024</a></li>
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>