cgspace-notes/docs/2023-10/index.html
2024-11-19 10:40:23 +03:00

400 lines
15 KiB
HTML

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="October, 2023" />
<meta property="og:description" content="2023-10-02
Export CGSpace to check DOIs against Crossref
I found that Crossref&rsquo;s metadata is in the public domain under the CC0 license
One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-10/" />
<meta property="article:published_time" content="2023-10-02T09:05:36+03:00" />
<meta property="article:modified_time" content="2023-11-02T20:58:43+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2023"/>
<meta name="twitter:description" content="2023-10-02
Export CGSpace to check DOIs against Crossref
I found that Crossref&rsquo;s metadata is in the public domain under the CC0 license
One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive
We can be on the safe side by using only abstracts for items that are licensed under Creative Commons
"/>
<meta name="generator" content="Hugo 0.133.1">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-10/",
"wordCount": "1153",
"datePublished": "2023-10-02T09:05:36+03:00",
"dateModified": "2023-11-02T20:58:43+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2023-10/">
<title>October, 2023 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2023-10/">October, 2023</a></h2>
<p class="blog-post-meta">
<time datetime="2023-10-02T09:05:36+03:00">Mon Oct 02, 2023</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2023-10-02">2023-10-02</h2>
<ul>
<li>Export CGSpace to check DOIs against Crossref
<ul>
<li>I found that <a href="https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/">Crossref&rsquo;s metadata is in the public domain under the CC0 license</a></li>
<li>One interesting thing is the abstracts, which are copyrighted by the copyright owner, meaning Crossref cannot waive the copyright under the terms of the CC0 license, because it is not theirs to waive</li>
<li>We can be on the safe side by using only abstracts for items that are licensed under Creative Commons</li>
</ul>
</li>
</ul>
<ul>
<li>This GREL extracts the <em>text</em> content of the <code>&lt;jats:p&gt;</code> tags (ie, no other JATS XML markup tags like <code>&lt;jats:i&gt;</code>, <code>&lt;jats:sub&gt;</code>, etc):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>forEach(value.parseXml().select(&#34;jats|p&#34;),i,i.xmlText()).join(&#34;&#34;)
</span></span></code></pre></div><ul>
<li>Note that we need to use <code>select(&quot;jats|p&quot;)</code> instead of <code>select(&quot;jats:p&quot;)</code> for OpenRefine&rsquo;s parseXml, and we need to <code>join()</code> on the end</li>
<li>I updated metadata for about 3,000 items using Crossref metadata
<ul>
<li>I stripped trailing periods for titles where they were missing on the Crossref titles</li>
<li>I copied abstracts for about 600 items that were missing them, for items that were Creative Commons</li>
<li>I updated publishers for a few thousand more where ours and Crossref disagreed, checking a handful manually first</li>
</ul>
</li>
<li>I also added subjects to the <code>crossref_doi_lookup.py</code> script to see if they will be useful for us
<ul>
<li>When checking with csv-metadata-quality I can validate those subjects against AGROVOC and add them if they are valid</li>
</ul>
</li>
</ul>
<h2 id="2023-10-03">2023-10-03</h2>
<ul>
<li>I added the item type to the collection subscription email on DSpace 6
<ul>
<li>It&rsquo;s done differently on DSpace 7 so I&rsquo;ll have to see how to do it there&hellip;</li>
</ul>
</li>
<li>Test a patch that fixes a bug with item versioning disabled in DSpace 7
<ul>
<li>I hadn&rsquo;t realized that DSpace 7 defaulted to versioning being enabled, whereas we never used this in DSpace 6 (yet)</li>
</ul>
</li>
<li>Submit <a href="https://github.com/DSpace/DSpace/issues/9104">an issue regarding duplicate Discovery sort fields</a> in DSpace 7</li>
</ul>
<h2 id="2023-10-05">2023-10-05</h2>
<ul>
<li>Some discussion this week about issue and online dates for journal articles, with regards to PRMS
<ul>
<li>I looked more closely at the <a href="https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md">Crossref API docs</a> and realized (again) that their &ldquo;issue&rdquo; date is not the same as our issue date—they take the earlier of the print and online dates!</li>
<li>Also, <em>very many</em> items have no print date at all, perhaps due to delays, errors, or simply because the journal is &ldquo;online only&rdquo;!</li>
<li>I suggested again that PRMS should consider both, and take the earlier of the two, then make sure whether the date is in the current reporting period</li>
<li>I managed to find 80 items with print publishing dates from 2023 and updated those from Crossref, but for the rest we will have to think about how we handle them</li>
</ul>
</li>
</ul>
<h2 id="2023-10-06">2023-10-06</h2>
<ul>
<li>More discussion about dates after looking closely at them yesterday and today
<ul>
<li>Crossref doesn&rsquo;t always have both issued and online dates—sometimes they have one, sometimes the other, and sometimes both, so we cannot rely on them 100% for that.</li>
<li>In some cases, the item is available online for months (or even a year!), but has not been included in an issue yet, and thus has no &ldquo;issue&rdquo; date, for example:
<ul>
<li><a href="https://doi.org/10.1002/csc2.20914">https://doi.org/10.1002/csc2.20914</a> &lt;&mdash; published online January 2023!</li>
<li><a href="https://doi.org/10.1111/mcn.13401">https://doi.org/10.1111/mcn.13401</a> &lt;&mdash; published online July 2022!</li>
</ul>
</li>
<li>Even journals make mistakes: this journal article was &ldquo;issued&rdquo; in 2022, but online in 2023! This is not Crossref&rsquo;s fault, but the journal&rsquo;s!
<ul>
<li><a href="https://doi.org/10.1186/s40066-022-00400-6">https://doi.org/10.1186/s40066-022-00400-6</a></li>
</ul>
</li>
<li>I found a bunch more strange cases regarding dates and recommended to PRMS team that they use the earlier of the issued and online dates</li>
</ul>
</li>
<li>Meet with Aditi to start discussing the scope of knowledge products we can get for the CGIAR climate change synthesis</li>
</ul>
<h2 id="2023-10-07">2023-10-07</h2>
<ul>
<li>I spent a few hours (!) debugging an issue in Python when downloading PDFs
<ul>
<li>I think it ended up being due to <code>requests_cache</code>!!! Grrrr</li>
<li>On a positive note I&rsquo;ve greatly refactored my script for discovering and downloading PDFs from Unpaywall</li>
</ul>
</li>
<li>Export CGSpace to check for missing Initiative collection mappings</li>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2023-10-08">2023-10-08</h2>
<ul>
<li>Starting to see some stuck locks on CGSpace this morning
<ul>
<li>I will give notice and restart CGSpace</li>
</ul>
</li>
<li>Work on Python script to harvest DSpace REST API and save to CSV</li>
</ul>
<h2 id="2023-10-11">2023-10-11</h2>
<ul>
<li>File an issue on the DSpace issue tracker regarding the MaxMind JSON objects in our Solr statistics: <a href="https://github.com/DSpace/DSpace/issues/9118">https://github.com/DSpace/DSpace/issues/9118</a></li>
</ul>
<h2 id="2023-10-12">2023-10-12</h2>
<ul>
<li>Discuss MODS issues in CGSpace&rsquo;s OAI-PMH with Stefano and Valentina
<ul>
<li>AGRIS can currently only support MODS 3.7 so they need us to roll our 3.8 work from 2023-06 back down, which requires some minor changes to the crosswalk</li>
</ul>
</li>
</ul>
<h2 id="2023-10-13">2023-10-13</h2>
<ul>
<li>I did some more minor work to get the MODS 3.7 changes ready for AGRIS on DSpace Test</li>
</ul>
<h2 id="2023-10-14">2023-10-14</h2>
<ul>
<li>Export CGSpace to check for missing Initiative collection mappings</li>
<li>Start a harvest on AReS</li>
<li>I deployed the AGRIS changes for OAI-PMH on CGSpace</li>
</ul>
<h2 id="2023-10-16">2023-10-16</h2>
<ul>
<li>Fix some typos in ILRI subjects on CGSpace
<ul>
<li>These were affecting the taxonomy on ilri.org</li>
<li>I exported CGSpace and did some validation and cleanup on ILRI subjects, moving some to AGROVOC subjects</li>
</ul>
</li>
<li>Port the MODS 3.7 crosswalk from DSpace 6 to DSpace 7
<ul>
<li>It works fine, we only need to take note that the OAI-PMH endpoint is now relative to the <code>/server</code> path instead of a dedicated OAI path</li>
</ul>
</li>
</ul>
<h2 id="2023-10-17">2023-10-17</h2>
<ul>
<li>Export CGSpace to do some cleanups all over on invalid metadata values
<ul>
<li>I found many metadata values in the wrong field, wrong format, etc</li>
<li>This ended up being cleanups for 694 items</li>
</ul>
</li>
</ul>
<h2 id="2023-10-20">2023-10-20</h2>
<ul>
<li>Export CGSpace to check for missing Initiative collection mappings</li>
<li>I also did a run of looking up all Initiative outputs with DOIs against Crossref to check for missing dates, publishers, etc
<ul>
<li>I found issued dates for a few, and online dates for over 100</li>
<li>I also fixed some incorrect licenses, access status, and abstracts</li>
</ul>
</li>
</ul>
<h2 id="2023-10-23">2023-10-23</h2>
<ul>
<li>Export a list of Internal Documents for Peter to review to see if we can re-classify some
<ul>
<li>Peter sent changes for 740 items so I applied them on CGSpace</li>
</ul>
</li>
<li>Testing the changes for OpenRXV DSpace 7 compatibility</li>
</ul>
<h2 id="2023-10-24">2023-10-24</h2>
<ul>
<li>Sync DSpace 7 Test with a fresh CGSpace snapshot</li>
<li>Meeting with FARA to discuss DSpace training and support</li>
<li>Meeting with IFPRI about migrating to CGSpace</li>
</ul>
<h2 id="2023-10-25">2023-10-25</h2>
<ul>
<li>Maria was asking about an error deleting an item in the Alliance community
<ul>
<li>The error was &ldquo;Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:&hellip;&rdquo;</li>
<li>According to my notes this error happened a few times in the past and is some kind of corner case regarding permissions</li>
<li>I deleted the item for her</li>
</ul>
</li>
<li>I deleted a handful of old CRP groups on CGSpace</li>
</ul>
<h2 id="2023-10-27">2023-10-27</h2>
<ul>
<li>Peter sent me a list of journal articles from Altmetric that have an ILRI affiliation, but no Handle
<ul>
<li>I used my <code>crossref_doi_lookup.py</code> script to fetch the metadata for them using their DOIs, then did a bunch of cleanup in OpenRefine</li>
</ul>
</li>
<li>Test some LDAP patches for DSpace 7</li>
</ul>
<h2 id="2023-10-30">2023-10-30</h2>
<ul>
<li>Some work on metadata for Aditi&rsquo;s review
<ul>
<li>I found more preprints grrrr</li>
</ul>
</li>
</ul>
<h2 id="2023-10-31">2023-10-31</h2>
<ul>
<li>Peter got back to me with the cleanups on ILRI journal articles from Altmetric that we didn&rsquo;t have on CGSpace
<ul>
<li>I did another duplicate check and found four more duplicates that had been uploaded yesterday</li>
<li>Then I did a quick sanity check and uploaded the remaining 19 items to CGSpace</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
<li><a href="/cgspace-notes/2024-08/">August, 2024</a></li>
<li><a href="/cgspace-notes/2024-07/">July, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>