cgspace-notes/docs/2024-06/index.html

319 lines
13 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="June, 2024" />
<meta property="og:description" content="2024-06-03
Working on IFPRI datasets
I noticed the licenses were missing from Nilam&rsquo;s original file so I found a way to check Dataverse&rsquo;s API for a persistent identifier
We have both Handles and DOIs for these datasets, both from Harvard&rsquo;s Dataverse
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-06/" />
<meta property="article:published_time" content="2024-06-03T14:14:00+03:00" />
<meta property="article:modified_time" content="2024-07-02T11:12:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2024"/>
<meta name="twitter:description" content="2024-06-03
Working on IFPRI datasets
I noticed the licenses were missing from Nilam&rsquo;s original file so I found a way to check Dataverse&rsquo;s API for a persistent identifier
We have both Handles and DOIs for these datasets, both from Harvard&rsquo;s Dataverse
"/>
<meta name="generator" content="Hugo 0.133.1">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "June, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-06/",
"wordCount": "607",
"datePublished": "2024-06-03T14:14:00+03:00",
"dateModified": "2024-07-02T11:12:03+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2024-06/">
<title>June, 2024 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2024-06/">June, 2024</a></h2>
<p class="blog-post-meta">
<time datetime="2024-06-03T14:14:00+03:00">Mon Jun 03, 2024</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2024-06-03">2024-06-03</h2>
<ul>
<li>Working on IFPRI datasets
<ul>
<li>I noticed the licenses were missing from Nilam&rsquo;s original file so I found a way to check <a href="https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats">Dataverse&rsquo;s API for a persistent identifier</a></li>
<li>We have both Handles and DOIs for these datasets, both from Harvard&rsquo;s Dataverse</li>
</ul>
</li>
</ul>
<ul>
<li>I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):</li>
</ul>
<pre tabindex="0"><code>&#34;https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&amp;persistentId=doi:&#34; + value.split(&#39;https://doi.org/&#39;)[-1].toUppercase()
</code></pre><ul>
<li>Then I was able to extract the license text from the JSON response using:</li>
</ul>
<pre tabindex="0"><code>value.parseJson()[&#39;datasetVersion&#39;][&#39;termsOfUse&#39;]
</code></pre><ul>
<li>Similar for the Handle&hellip;</li>
</ul>
<h2 id="2024-06-04">2024-06-04</h2>
<ul>
<li>Some Dataverse entries have the license in <code>['datasetVersion']['license']</code> instead&hellip;</li>
<li>I finalized cleaning the 722 IFPRI datasets and uploaded them to CGSpace</li>
</ul>
<h2 id="2024-06-14">2024-06-14</h2>
<ul>
<li>Minor cleanups on IFPRI&rsquo;s 20162019 batch migration file
<ul>
<li>I will start with duplicates on unique identifiers like DOIs</li>
</ul>
</li>
</ul>
<h2 id="2026-06-18">2026-06-18</h2>
<ul>
<li>Merge and upload metadata for duplicates in IFPRI&rsquo;s 20162019 set:
<ul>
<li>144 exact match on CGSpace via DOI, type, and date</li>
<li>32 with CGSpace handles</li>
<li>I also spent some time converting the <code>ilri/post_bitstreams.py</code> script to use the DSpace 7 REST API via dspace-rest-client</li>
<li>There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them</li>
</ul>
</li>
</ul>
<h2 id="2024-06-19">2024-06-19</h2>
<ul>
<li>Spent some time checking the remaining 3312 IFPRI 20162019 migration set for duplicates on CGSpace
<ul>
<li>There seem to be about 50 exact matches of title, type, and issue date</li>
</ul>
</li>
</ul>
<h2 id="2024-06-20">2024-06-20</h2>
<ul>
<li>Finalize merging and uploading metadata for 48 duplicates from the IFPRI 20162019 migration set</li>
<li>Heavy load on both CGSpace and DSpace 7 Test this afternoon
<ul>
<li>Took me a while to figure out it was due to someone / something hammering <code>/search</code> for a bunch of facets</li>
<li>The <code>pm2 logs</code> command was more useful than the nginx logs to see the requests at least, for example:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&amp;spc.page=1&amp;f.accessRights=Open%20Access,equals&amp;f.dateIssued.min=2023&amp;f.dateIssued.max=2024&amp;f.country=Colombia,equals&amp;f.subject=climate%20change,equals&amp;f.region=Latin%20America%20and%20the%20Caribbean,equals&amp;f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&amp;spc.page=1&amp;f.sponsorship=CGIAR%20Trust%20Fund,equals&amp;f.impactArea=Climate%20adaptation%20and%20mitigation,equals&amp;f.region=Eastern%20Africa,equals&amp;f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&amp;f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&amp;spc.page=1&amp;f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&amp;f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&amp;f.dateIssued.min=2020&amp;f.dateIssued.max=2021&amp;f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
</code></pre><ul>
<li>Still difficult to find the client, because the logs are all <a href="https://github.com/DSpace/dspace-angular/issues/2902">coming from Angular&rsquo;s user agent</a> and IP
<ul>
<li>I changed the nginx logging to use the <code>X-Forwarded-For</code> header, as the default <code>combined</code> log format uses <code>$remote_addr</code> by default, which is only accurate if the request doesn&rsquo;t come from Angular (ie directly to the API)</li>
<li>From what I can see now the IPs are all coming from Huawei Cloud and Tencent</li>
<li>The ASNs are AS136907 (Huawei) and AS132203 (Tencent)</li>
<li>For now I will just add those to the list of bot networks</li>
</ul>
</li>
</ul>
<h2 id="2024-06-21">2024-06-21</h2>
<ul>
<li>Update the nginx logging to use <a href="http://nginx.org/en/docs/http/ngx_http_realip_module.html">nginx&rsquo;s <code>real_ip</code> module</a> to log the correct client IP
<ul>
<li>I think this means we will start sending &lsquo;bot&rsquo; to the Angular / Express frontend because bot IPs will be properly classified now&hellip;</li>
<li>I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in <a href="https://github.com/DSpace/dspace-angular/issues/2902">https://github.com/DSpace/dspace-angular/issues/2902</a> is to pass on the client&rsquo;s user-agent</li>
</ul>
</li>
<li>Then I updated the list of bot networks:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/list/AS12876 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> https://asn.ipinfo.app/api/text/list/AS132203 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS13238 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS136907 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS14061 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS14618 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS16276 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS16509 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS203020 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS204287 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS21859 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS23576 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS24940 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS396982 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS45102 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS50245 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS55286 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS6939 \
</span></span><span style="display:flex;"><span> https://asn.ipinfo.app/api/text/list/AS8075
</span></span><span style="display:flex;"><span>$ cat AS* | ~/go/bin/mapcidr -a &gt; /tmp/networks.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/networks.txt
</span></span><span style="display:flex;"><span>8675 /tmp/networks.txt
</span></span></code></pre></div><ul>
<li>Update list of ORCID identifiers with new ones from Alliance and IFPRI</li>
<li>Finalize uploading the remaining 3,264 items from IFPRI&rsquo;s 20162019 batch migration to CGSpace</li>
</ul>
<h2 id="2024-06-24">2024-06-24</h2>
<ul>
<li>Minor updates to <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> and <a href="https://github.com/ilri/cgspace-java-helpers">cgspace-java-helpers</a> to normalize a few more invalid DOI formats</li>
</ul>
<h2 id="2024-06-25">2024-06-25</h2>
<ul>
<li>Work on uploading some missing PDFs from the IFPRI 20162019 batch migration</li>
</ul>
<h2 id="2024-06-26">2024-06-26</h2>
<ul>
<li>Did a big cleanup of several thousand journal articles based on metadata from Crossref</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
<li><a href="/cgspace-notes/2024-08/">August, 2024</a></li>
<li><a href="/cgspace-notes/2024-07/">July, 2024</a></li>
<li><a href="/cgspace-notes/2024-06/">June, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>