cgspace-notes/docs/2023-01/index.html

383 lines
16 KiB
HTML

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="January, 2023" />
<meta property="og:description" content="2023-01-01
Apply some more ORCID identifiers to items on CGSpace using my 2022-09-22-add-orcids.csv file
I want to update all ORCID names and refresh them in the database
I see we have some new ones that aren&rsquo;t in our list if I combine with this file:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-01/" />
<meta property="article:published_time" content="2023-01-01T08:44:36+03:00" />
<meta property="article:modified_time" content="2023-01-12T23:11:42+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2023"/>
<meta name="twitter:description" content="2023-01-01
Apply some more ORCID identifiers to items on CGSpace using my 2022-09-22-add-orcids.csv file
I want to update all ORCID names and refresh them in the database
I see we have some new ones that aren&rsquo;t in our list if I combine with this file:
"/>
<meta name="generator" content="Hugo 0.109.0">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-01/",
"wordCount": "1103",
"datePublished": "2023-01-01T08:44:36+03:00",
"dateModified": "2023-01-12T23:11:42+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2023-01/">
<title>January, 2023 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2023-01/">January, 2023</a></h2>
<p class="blog-post-meta">
<time datetime="2023-01-01T08:44:36+03:00">Sun Jan 01, 2023</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2023-01-01">2023-01-01</h2>
<ul>
<li>Apply some more ORCID identifiers to items on CGSpace using my <code>2022-09-22-add-orcids.csv</code> file
<ul>
<li>I want to update all ORCID names and refresh them in the database</li>
<li>I see we have some new ones that aren&rsquo;t in our list if I combine with this file:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep - oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u | wc -l
</span></span><span style="display:flex;"><span>1939
</span></span><span style="display:flex;"><span>$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u | wc -l
</span></span><span style="display:flex;"><span>1973
</span></span></code></pre></div><ul>
<li>I will extract and process them with my <code>resolve-orcids.py</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml 2022-09-22-add-orcids.csv| grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u &gt; /tmp/2023-01-01-orcids.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2023-01-01-orcids.txt -o /tmp/2023-01-01-orcids-names.txt -d
</span></span></code></pre></div><ul>
<li></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/update-orcids.py -i /tmp/2023-01-01-orcids-names.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -m <span style="color:#ae81ff">247</span>
</span></span></code></pre></div><ul>
<li>Load on CGSpace is high around 9.x
<ul>
<li>I see there is a CIAT bot harvesting via the REST API with IP 45.5.186.2</li>
<li>Other than that I don&rsquo;t see any particular system stats as alarming</li>
<li>There has been a marked increase in load in the last few weeks, perhaps due to Initiative activity&hellip;</li>
<li>Perhaps there are some stuck PostgreSQL locks from CLI tools?</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | grep -o -E <span style="color:#e6db74">&#39;(dspaceWeb|dspaceApi|dspaceCli)&#39;</span> | sort | uniq -c
</span></span><span style="display:flex;"><span> 58 dspaceCli
</span></span><span style="display:flex;"><span> 46 dspaceWeb
</span></span></code></pre></div><ul>
<li>The current time on the server is 08:52 and I see the dspaceCli locks were started at 04:00 and 05:00&hellip; so I need to check which cron jobs those belong to as I think I noticed this last month too
<ul>
<li>I&rsquo;m going to wait and see if they finish, but by tomorrow I will kill them</li>
</ul>
</li>
</ul>
<h2 id="2023-01-02">2023-01-02</h2>
<ul>
<li>The load on the server is now very low and there are no more locks from dspaceCli
<ul>
<li>So there <em>was</em> some long-running process that was running and had to finish!</li>
<li>That finally sheds some light on the &ldquo;high load on Sunday&rdquo; problem where I couldn&rsquo;t find any other distinct pattern in the nginx or Tomcat requests</li>
</ul>
</li>
</ul>
<h2 id="2023-01-03">2023-01-03</h2>
<ul>
<li>The load from the server on Sundays, which I have noticed for a long time, seems to be coming from the DSpace checker cron job
<ul>
<li>This checks the checksums of all bitstreams to see if they match the ones in the database</li>
</ul>
</li>
<li>I exported the entire CGSpace metadata to do country/region checks with <code>csv-metadata-quality</code>
<ul>
<li>I extracted only the items with countries, which was about 48,000, then split the file into parts of 10,000 items, but the upload found 2,000 changes in the first one and took several hours to complete&hellip;</li>
</ul>
</li>
<li>IWMI sent me ORCID identifiers for new scientsts, bringing our total to 2,010</li>
</ul>
<h2 id="2023-01-04">2023-01-04</h2>
<ul>
<li>I finally finished applying the region imports (in five batches of 10,000)
<ul>
<li>It was about 7,500 missing regions in total&hellip;</li>
</ul>
</li>
<li>Now I will move on to doing the Initiative mappings
<ul>
<li>I modified my <code>fix-initiative-mappings.py</code> script to only write out the items that have updated mappings</li>
<li>This makes it way easier to apply fixes to the entire CGSpace because we don&rsquo;t try to import 100,000 items with no changes in mappings</li>
</ul>
</li>
<li>More dspaceCli locks from 04:00 this morning (current time on server is 07:33) and today is a Wednesday
<ul>
<li>The checker cron job runs on <code>0,3</code>, which is Sunday and Wednesday, so this is from that&hellip;</li>
<li>Finally at 16:30 I decided to kill the PIDs associated with those locks&hellip;</li>
<li>I am going to disable that cron job for now and watch the server load for a few weeks</li>
</ul>
</li>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2023-01-08">2023-01-08</h2>
<ul>
<li>It&rsquo;s Sunday and I see some PostgreSQL locks belonging to dspaceCli that started at 05:00
<ul>
<li>That&rsquo;s strange because I disabled the <code>dspace checker</code> one last week, so I&rsquo;m not sure which this is&hellip;</li>
<li>It&rsquo;s currently 2:30PM on the server so these locks have been there for almost twelve hours</li>
</ul>
</li>
<li>I exported the entire CGSpace to update the Initiative mappings
<ul>
<li>Items were mapped to ~58 new Initiative collections</li>
</ul>
</li>
<li>Then I ran the ORCID import to catch any new ones that might not have been tagged</li>
<li>Then I started a harvest on AReS</li>
</ul>
<h2 id="2023-01-09">2023-01-09</h2>
<ul>
<li>Fix some invalid Initiative names on CGSpace and then check for missing mappings</li>
<li>Check for missing regions in the Initiatives collection</li>
<li>Export a list of author affiliations from the Initiatives community for Peter to check
<ul>
<li>Was slightly ghetto because I did it from a CSV export of the Initiatives community, then imported to OpenRefine to split multi-value fields, then did some sed nonsense to handle the quoting:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;cg.contributor.affiliation[en_US]&#39;</span> ~/Downloads/2023-01-09-initiatives.csv | <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> sed -e &#39;s/^&#34;//&#39; -e &#39;s/&#34;$//&#39; -e &#39;s/||/\n/g&#39; | \
</span></span><span style="display:flex;"><span> sort -u | \
</span></span><span style="display:flex;"><span> sed -e &#39;s/^\(.*\)/&#34;\1/&#39; -e &#39;s/\(.*\)$/\1&#34;/&#39; &gt; /tmp/2023-01-09-initiatives-affiliations.csv
</span></span></code></pre></div><h2 id="2023-01-10">2023-01-10</h2>
<ul>
<li>Export the CGSpace Initiatives collection to check for missing regions and collection mappings</li>
</ul>
<h2 id="2023-01-11">2023-01-11</h2>
<ul>
<li>I&rsquo;m trying the DSpace 7 REST API again
<ul>
<li>While following onathe <a href="https://github.com/DSpace/RestContract/blob/main/authentication.md">DSpace 7 REST API authentication docs</a> I still cannot log in via curl on the command line because I get a <code>Access is denied. Invalid CSRF token.</code> message</li>
<li>Logging in via the HAL Browser works&hellip;</li>
<li>Someone on the DSpace Slack mentioned that the <a href="https://github.com/DSpace/RestContract/issues/209">authentication documentation is out of date</a> and we need to specify the cookie too</li>
<li>I tried it and finally got it to work:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl --head https://dspace7test.ilri.org/server/api
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>set-cookie: DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519; Path=/server; Secure; HttpOnly; SameSite=None
</span></span><span style="display:flex;"><span>dspace-xsrf-token: 42c78c56-613d-464f-89ea-79142fc5b519
</span></span><span style="display:flex;"><span>$ curl -v -X POST https://dspace7test.ilri.org/server/api/authn/login --data <span style="color:#e6db74">&#34;user=alantest%40cgiar.org&amp;password=dspace&#34;</span> -H <span style="color:#e6db74">&#34;X-XSRF-TOKEN: 42c78c56-613d-464f-89ea-79142fc5b519&#34;</span> -b <span style="color:#e6db74">&#34;DSPACE-XSRF-COOKIE=42c78c56-613d-464f-89ea-79142fc5b519&#34;</span>
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>authorization: Bearer eyJh...9-0
</span></span><span style="display:flex;"><span>$ curl -v <span style="color:#e6db74">&#34;https://dspace7test.ilri.org/api/core/items&#34;</span> -H <span style="color:#e6db74">&#34;Authorization: Bearer eyJh...9-0&#34;</span>
</span></span></code></pre></div><ul>
<li>I created <a href="https://github.com/DSpace/RestContract/pull/213">a pull request</a> to fix the docs</li>
<li>I did quite a lot of cleanup and updates on the IFPRI batch items for the Gender Equality batch upload
<ul>
<li>Then I uploaded them to CGSpace</li>
</ul>
</li>
<li>I added about twenty more ORCID identifiers to my list and tagged them on CGSpace</li>
</ul>
<h2 id="2023-01-12">2023-01-12</h2>
<ul>
<li>I exported the entire CGSpace and did some cleanups on all metadata in OpenRefine
<ul>
<li>I was primarily interested in normalizing the DOIs, but I also normalized a bunch of publishing places</li>
<li>After this imports I will export it again to do the Initiative and region mappings</li>
<li>I ran the <code>fix-initiative-mappings.py</code> script and got forty-nine new mappings&hellip;</li>
</ul>
</li>
<li>I added several dozen new ORCID identifiers to my list and tagged ~500 on CGSpace</li>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2023-01-13">2023-01-13</h2>
<ul>
<li>Do a bit more cleanup on licenses, issue dates, and publishers
<ul>
<li>Then I started importing my large list of 5,000 items changed from yesterday</li>
</ul>
</li>
<li>Help Karen add abstracts to a bunch of SAPLING items that were missing them on CGSpace
<ul>
<li>For now I only did open access journal articles, but I should do the reports and others too</li>
</ul>
</li>
</ul>
<h2 id="2023-01-14">2023-01-14</h2>
<ul>
<li>Export CGSpace and check for missing Initiative mappings
<ul>
<li>There were a total of twenty-five</li>
<li>Then I exported the Initiatives communinty to check the countries and regions</li>
</ul>
</li>
</ul>
<h2 id="2023-01-15">2023-01-15</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
<li><a href="/cgspace-notes/2022-12/">December, 2022</a></li>
<li><a href="/cgspace-notes/2022-11/">November, 2022</a></li>
<li><a href="/cgspace-notes/2022-10/">October, 2022</a></li>
<li><a href="/cgspace-notes/2022-09/">September, 2022</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>