cgspace-notes/docs/2021-03/index.html

930 lines
66 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="March, 2021" />
<meta property="og:description" content="2021-03-01
Discuss some OpenRXV issues with Abdullah from CodeObia
He&rsquo;s trying to work on the DSpace 6&#43; metadata schema autoimport using the DSpace 6&#43; REST API
Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-03/" />
<meta property="article:published_time" content="2021-03-01T10:13:54+02:00" />
<meta property="article:modified_time" content="2021-04-13T21:13:08+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2021"/>
<meta name="twitter:description" content="2021-03-01
Discuss some OpenRXV issues with Abdullah from CodeObia
He&rsquo;s trying to work on the DSpace 6&#43; metadata schema autoimport using the DSpace 6&#43; REST API
Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies
"/>
<meta name="generator" content="Hugo 0.133.1">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "March, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-03/",
"wordCount": "4453",
"datePublished": "2021-03-01T10:13:54+02:00",
"dateModified": "2021-04-13T21:13:08+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-03/">
<title>March, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-03/">March, 2021</a></h2>
<p class="blog-post-meta">
<time datetime="2021-03-01T10:13:54+02:00">Mon Mar 01, 2021</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2021-03-01">2021-03-01</h2>
<ul>
<li>Discuss some OpenRXV issues with Abdullah from CodeObia
<ul>
<li>He&rsquo;s trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API</li>
<li>Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies</li>
</ul>
</li>
</ul>
<h2 id="2021-03-02">2021-03-02</h2>
<ul>
<li>I fixed three build and runtime issues in OpenRXV:
<ul>
<li><a href="https://github.com/ilri/OpenRXV/pull/80">fix highcharts-angular and ngx-tour-core build</a></li>
<li><a href="https://github.com/ilri/OpenRXV/pull/82">frontend/package.json: Pin @types/ramda at 0.27.34</a></li>
</ul>
</li>
<li>Then I merged a few fixes that Abdullah had worked on last week</li>
</ul>
<h2 id="2021-03-03">2021-03-03</h2>
<ul>
<li>I <a href="https://github.com/ilri/OpenRXV/issues/83">fixed another frontend build warning on OpenRXV</a></li>
<li>Then I <a href="https://github.com/ilri/OpenRXV/pull/84">updated the frontend container to use Node.js 12 and Ubuntu 20.04</a></li>
<li>Also, I <a href="https://github.com/ilri/OpenRXV/pull/85">added a GitHub Actions workflow to build the frontend</a></li>
<li>I did some testing of Abdullah&rsquo;s patch for the values mapping search on OpenRXV
<ul>
<li>It still doesn&rsquo;t work with multi-word values, so I recorded a video with wf-recorder and uploaded it to <a href="https://github.com/ilri/OpenRXV/issues/43">the issue</a> for him to investigate</li>
</ul>
</li>
</ul>
<h2 id="2021-03-04">2021-03-04</h2>
<ul>
<li>Peter is having issues with the workflow since yesterday
<ul>
<li>I looked at the Munin stats and see a high number of database locks since yesterday</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2021/03/postgres_locks_ALL-week.png" alt="PostgreSQL locks week">
<img src="/cgspace-notes/2021/03/postgres_connections_cgspace-week.png" alt="PostgreSQL connections week"></p>
<ul>
<li>I looked at the number of connections in PostgreSQL and it&rsquo;s definitely high again:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>1020
</span></span></code></pre></div><ul>
<li>I reported it to Atmire to take a look, on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851">same issue</a> we had been tracking this before</li>
<li>Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell</li>
<li>I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my <code>add-orcid-identifier.py</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-03-04-add-zoe-campbell-orcid.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Campbell, Zoë&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
</span></span><span style="display:flex;"><span>&#34;Campbell, Zoe A.&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>I still need to do cleanup on the journal articles metadata
<ul>
<li>Peter sent me some cleanups but I can&rsquo;t use them in the search/replace format he gave</li>
<li>I think it&rsquo;s better to export the metadata values with IDs and import cleaned up ones as CSV</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT dspace_object_id AS id, text_value as &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 32087
</span></span></code></pre></div><ul>
<li>I used OpenRefine to remove all journal values that didn&rsquo;t have one of these values: ; ( )
<ul>
<li>Then I cloned the <code>cg.journal</code> field to <code>cg.volume</code> and <code>cg.issue</code></li>
<li>I used some GREL expressions like these to extract the journal name, volume, and issue:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>value.partition(&#39;;&#39;)[0].trim() # to get journal names
</span></span><span style="display:flex;"><span>value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,&#34;$1&#34;) # to get journal volumes
</span></span><span style="display:flex;"><span>value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;) # to get journal issues
</span></span></code></pre></div><ul>
<li>Then I uploaded the changes to CGSpace using <code>dspace metadata-import</code></li>
<li>Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
<ul>
<li>The error was &ldquo;Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e &hellip;&rdquo;</li>
<li>I searched the DSpace issue tracker and found several issues reporting this:
<ul>
<li><a href="https://jira.lyrasis.org/browse/DS-3985">DS-3985 Delete item fails</a></li>
<li><a href="https://jira.lyrasis.org/browse/DS-4004">DS-4004 Authorization denied Exception when trying to delete permanently an item, collection or community as a non-Admin user</a></li>
<li><a href="https://jira.lyrasis.org/browse/DS-4297">DS-4297 Authorization error when trying to delete item by submitter/administrator</a></li>
</ul>
</li>
<li>The issue is apparently with non-admin users who are in the admin and submit groups of the owning collection&hellip;</li>
<li>In this case the item was uploaded to the CCAFS Reports collection, and Margarita is a non-admin user who is a member of the collection&rsquo;s admin and submit groups, exactly as the issue described</li>
<li>I added a comment about our issue to <a href="https://jira.lyrasis.org/browse/DS-4297">DS-4297</a></li>
</ul>
</li>
<li>Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana&rsquo;s submissions
<ul>
<li>I edited Udana&rsquo;s submission to CGSpace:
<ul>
<li>corrected the title</li>
<li>added language English</li>
<li>changed the link to the external item page instead of PDF</li>
<li>added SDGs from the external item page</li>
<li>added AGROVOC subjects from the external item page</li>
<li>added pagination (extent)</li>
<li>changed the license to &ldquo;other&rdquo; because CC-BY-NC-ND is not printed anywhere in the PDF or external item page</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="2021-03-05">2021-03-05</h2>
<ul>
<li>I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml down
</span></span><span style="display:flex;"><span>$ docker volume create docker_esData_7
</span></span><span style="display:flex;"><span>$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
</span></span><span style="display:flex;"><span>$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
</span></span><span style="display:flex;"><span>$ docker rm es_dummy
</span></span><span style="display:flex;"><span># edit docker/docker-compose.yml to switch from bind mount to volume
</span></span><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml up -d
</span></span></code></pre></div><ul>
<li>The trick is that when you create a volume like &ldquo;myvolume&rdquo; from a <code>docker-compose.yml</code> file, Docker will create it with the name &ldquo;docker_myvolume&rdquo;
<ul>
<li>If you create it manually on the command line with <code>docker volume create myvolume</code> then the name is literally &ldquo;myvolume&rdquo;</li>
</ul>
</li>
<li>I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit</li>
<li>Delete the <code>openrxv-items-temp</code> index to test a fresh harvesting:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span></code></pre></div><h2 id="2021-03-05-1">2021-03-05</h2>
<ul>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 101761,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
</span></span></code></pre></div><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span></code></pre></div><ul>
<li>Then delete the temp and backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>{&#34;acknowledged&#34;:true}%
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-05&#39;</span>
</span></span></code></pre></div><ul>
<li>I made some pull requests to OpenRXV:
<ul>
<li><a href="https://github.com/ilri/OpenRXV/pull/86">docker/docker-compose.yml: Use docker volumes</a></li>
<li><a href="https://github.com/ilri/OpenRXV/pull/87">docker/docker-compose.yml: Pin Redis to version 5</a></li>
</ul>
</li>
<li>I deployed the latest changes from the last few days on AReS production</li>
</ul>
<h2 id="2021-03-07">2021-03-07</h2>
<ul>
<li>I realized there is something wrong with the Elasticsearch indexes on AReS
<ul>
<li>On a new test environment I see <code>openrxv-items</code> is correctly an alias of <code>openrxv-items-final</code>:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><ul>
<li>But on AReS production <code>openrxv-items</code> has somehow become a concrete index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><ul>
<li>I fixed the issue on production by cloning the <code>openrxv-items</code> index to <code>openrxv-items-final</code>, deleting <code>openrxv-items</code>, and then re-creating it as an alias:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</span></span></code></pre></div><ul>
<li>Delete backups and remove read-only mode on <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-07&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><ul>
<li>Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
<ul>
<li>Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E <span style="color:#e6db74">&#39;0[56]/Mar/2021&#39;</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</span></span></code></pre></div><ul>
<li>I see the usual IPs for CCAFS and ILRI importer bots, but also <code>143.233.242.132</code> which appears to be for GARDIAN:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c Delphi
</span></span><span style="display:flex;"><span>6237
</span></span><span style="display:flex;"><span># zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c -v Delphi
</span></span><span style="display:flex;"><span>6418
</span></span></code></pre></div><ul>
<li>They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a &ldquo;normal&rdquo; user agent
<ul>
<li>Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020</li>
<li>I will add this IP to the list of bots in nginx and purge it from Solr with my <code>check-spider-ip-hits.sh</code> script</li>
</ul>
</li>
<li>I made a few changes to OpenRXV:
<ul>
<li><a href="https://github.com/ilri/OpenRXV/issues/89">Migrated away from links to use networks</a></li>
<li><a href="https://github.com/ilri/OpenRXV/issues/68">Converted the backend container to use a custom image that includes <code>unoconv</code></a> so we don&rsquo;t have to manually install it anymore</li>
</ul>
</li>
</ul>
<h2 id="2021-03-08">2021-03-08</h2>
<ul>
<li>I approved the WLE item that I edited last week, and all the metadata is there: <a href="https://hdl.handle.net/10568/111810">https://hdl.handle.net/10568/111810</a>
<ul>
<li>So I&rsquo;m not sure what Niroshini&rsquo;s issue with metadata is&hellip;</li>
</ul>
</li>
<li>Peter sent a message yesterday saying that his item finally got committed
<ul>
<li>I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>13
</span></span></code></pre></div><ul>
<li>On 2021-03-03 the PostgreSQL transactions started rising:</li>
</ul>
<p><img src="/cgspace-notes/2021/03/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p>
<ul>
<li>After that the connections and locks started going up, peaking on 2021-03-06:</li>
</ul>
<p><img src="/cgspace-notes/2021/03/postgres_locks_ALL-week.png" alt="PostgreSQL locks week">
<img src="/cgspace-notes/2021/03/postgres_connections_ALL-week.png" alt="PostgreSQL connections week"></p>
<ul>
<li>I sent another message to Atmire to ask if they have time to look into this</li>
<li>CIFOR is pressuring me to upload the batch items from last week
<ul>
<li>Vika sent me a final file with some duplicates that Peter identified removed</li>
<li>I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through <code>csv-metadata-quality</code> checker and uploaded them to CGSpace</li>
<li>In total there are 1,088 items</li>
</ul>
</li>
<li>Udana from IWMI emailed to ask about CGSpace thumbnails</li>
<li>Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
<ul>
<li><a href="https://hdl.handle.net/10568/111794">The item</a> was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item</li>
</ul>
</li>
<li>Abenet got a quote from Atmire to buy 125 credits for 3750€</li>
<li>Maria at Bioversity sent some feedback about duplicate items on AReS</li>
<li>I&rsquo;m wondering if the issue of the <code>openrxv-items-final</code> index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
<ul>
<li>I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
</span></span><span style="display:flex;"><span># start harvesting on AReS
</span></span></code></pre></div><ul>
<li>As I saw on my local test instance, even when you cancel a harvesting, it replaces the <code>openrxv-items-final</code> index with whatever is in <code>openrxv-items-temp</code> automatically, so I assume it will do the same now</li>
</ul>
<h2 id="2021-03-09">2021-03-09</h2>
<ul>
<li>The harvesting on AReS finished last night and everything worked as expected, with no manual intervention
<ul>
<li>This means that <a href="https://github.com/ilri/OpenRXV/issues/64">the issue</a> we were facing for a few months was due to the <code>openrxv-items</code> index being deleted and re-created as a standalone index instead of an alias of <code>openrxv-items-final</code></li>
</ul>
</li>
<li>Talk to Moayad about OpenRXV development
<ul>
<li>We realized that the missing/duplicate items issue is probably due to the long harvesting time on the REST API, as the time between starting the harvesting on page 0 and finishing the harvesting on page 900 (in the CGSpace example), some items will have been added to the repository, which causes the pages to shift</li>
<li>I proposed a solution in the <a href="https://github.com/ilri/OpenRXV/issues/67">GitHub issue</a>, where we consult the site&rsquo;s XML sitemap after harvesting to see if we missed any items, and then we harvest them individually</li>
</ul>
</li>
<li>Peter sent me a list of 356 DOIs from Altmetric that don&rsquo;t have our Handles, so we need to Tweet them
<ul>
<li>I used my <code>doi-to-handle.py</code> script to generate a list of handles and titles for him:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><h2 id="2021-03-10">2021-03-10</h2>
<ul>
<li>Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses <code>cg.isijournal</code> and MELSpace uses <code>mel.impact-factor</code>
<ul>
<li>I filed <a href="https://github.com/AgriculturalSemantics/cg-core/issues/39">an issue</a> on the cg-core project to ask colleagues for ideas</li>
</ul>
</li>
<li>Peter said he doesn&rsquo;t see &ldquo;Source Code&rdquo; or &ldquo;Software&rdquo; in the <a href="https://cgspace.cgiar.org/handle/10568/1/search-filter?field=type">output type facet on the ILRI community</a>, but I see it on the home page, so I will try to do a full Discovery re-index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 318m20.485s
</span></span><span style="display:flex;"><span>user 215m15.196s
</span></span><span style="display:flex;"><span>sys 2m51.529s
</span></span></code></pre></div><ul>
<li>Now I see ten items for &ldquo;Source Code&rdquo; in the facets&hellip;</li>
<li>Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code</li>
<li>Added the ability to check <code>dcterms.license</code> values against the SPDX licenses in the csv-metadata-quality tool
<ul>
<li>Also, I made some other minor fixes and released <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.6">version 0.4.6</a> on GitHub</li>
</ul>
</li>
<li>Proof and upload twenty-seven items to CGSpace for Peter Ballantyne
<ul>
<li>Mostly Ugandan outputs for CRP Livestock and Livestock and Fish</li>
</ul>
</li>
</ul>
<h2 id="2021-03-14">2021-03-14</h2>
<ul>
<li>Switch to linux-kvm kernel on linode20 and linode18:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt update <span style="color:#f92672">&amp;&amp;</span> apt full-upgrade
</span></span><span style="display:flex;"><span># apt install linux-kvm
</span></span><span style="display:flex;"><span># apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
</span></span><span style="display:flex;"><span># apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
</span></span><span style="display:flex;"><span># reboot
</span></span></code></pre></div><ul>
<li>Deploy latest changes from <code>6_x-prod</code> branch on CGSpace</li>
<li>Deploy latest changes from OpenRXV <code>master</code> branch on AReS</li>
<li>Last week Peter added OpenRXV to CGSpace: <a href="https://hdl.handle.net/10568/112982">https://hdl.handle.net/10568/112982</a></li>
<li>Back up the current <code>openrxv-items-final</code> index on AReS to start a new harvest:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><ul>
<li>After the harvesting finished it seems the indexes got messed up again, as <code>openrxv-items</code> is an alias of <code>openrxv-items-temp</code> instead of <code>openrxv-items-final</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><ul>
<li>Anyways, the number of items in <code>openrxv-items</code> seems OK and the AReS Explorer UI is working fine
<ul>
<li>I will have to manually fix the indexes before the next harvesting</li>
</ul>
</li>
<li>Publish the web version of the DSpace CSV Metadata Quality checker tool that I wrote this weekend on GitHub: <a href="https://github.com/ilri/csv-metadata-quality-web">https://github.com/ilri/csv-metadata-quality-web</a>
<ul>
<li>Also, it is deployed on Heroku: <a href="https://fierce-ocean-30836.herokuapp.com/">https://fierce-ocean-30836.herokuapp.com/</a></li>
<li>I was running it on Google App Engine originally, but they have <em>way</em> too aggressive caching of static assets</li>
</ul>
</li>
</ul>
<h2 id="2021-03-16">2021-03-16</h2>
<ul>
<li>Review ten items for Livestock and Fish and Dryland Systems from Peter
<ul>
<li>I told him to try the new web-based CSV Metadata Qualiter checker and he thought it was cool</li>
<li>I found one exact duplicate item and it gave me an idea to try to detect this in the tool</li>
</ul>
</li>
</ul>
<h2 id="2021-03-17">2021-03-17</h2>
<ul>
<li>I added the ability to check for duplicate items to csv-metadata-quality</li>
<li>I also made some minor optimizations in the Pandas code</li>
<li>I <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.7">tagged version 0.4.7 of csv-metadata-quality on GitHub</a></li>
</ul>
<h2 id="2021-03-18">2021-03-18</h2>
<ul>
<li>I added the ability to check for, and fix, &ldquo;mojibake&rdquo; characters in csv-metadata-quality</li>
</ul>
<h2 id="2021-03-21">2021-03-21</h2>
<ul>
<li>Last week Atmire asked me which browser I was using to test the duplicate checker, which I had <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934">reported</a> as not loading
<ul>
<li>I tried to load it in Chrome and it works&hellip; hmmm</li>
</ul>
</li>
<li>Back up the current <code>openrxv-items-final</code> index to start a fresh AReS Harvest:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><ul>
<li>Then start harvesting in the AReS Explorer admin UI</li>
</ul>
<h2 id="2021-03-22">2021-03-22</h2>
<ul>
<li>The harvesting on AReS yesterday completed, but somehow I have twice the number of items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 206204,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Hmmm and even my backup index has a strange number of items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 844,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>I deleted all indexes and re-created the openrxv-items alias:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span></code></pre></div><ul>
<li>Then I started a new harvesting</li>
<li>I switched the Node.js in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to v12 since v10 will cease to be supported soon
<ul>
<li>I re-deployed DSpace Test (linode26) with Node.js 12 and restarted the server</li>
</ul>
</li>
<li>The AReS harvest finally finished, with 1047 pages of items, but the <code>openrxv-items-final</code> index is empty and the <code>openrxv-items-temp</code> index has a 103,000 items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 103162,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>I tried to clone the temp index to the final, but got an error:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
</span></span><span style="display:flex;"><span>{&#34;error&#34;:{&#34;root_cause&#34;:[{&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;}],&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;},&#34;status&#34;:400}%
</span></span></code></pre></div><ul>
<li>I looked in the Docker logs for Elasticsearch and saw a few memory errors:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>java.lang.OutOfMemoryError: Java heap space
</span></span></code></pre></div><ul>
<li>According to <code>/usr/share/elasticsearch/config/jvm.options</code> in the Elasticsearch container the default JVM heap is 1g
<ul>
<li>I see the running Java process has <code>-Xms 1g -Xmx 1g</code> in its process invocation so I guess that it must be indeed using 1g</li>
<li>We can <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html">change the heap size with the ES_JAVA_OPTS environment variable</a></li>
<li>Or perhaps better, we should <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/jvm-options.html">use a jvm.options.d file</a> because if you use the environment variable it overrides all other JVM options from the default <code>jvm.options</code></li>
<li>I tried to set memory to 1536m by binding an options file and restarting the container, but it didn&rsquo;t seem to work</li>
<li>Nevertheless, after restarting I see 103,000 items in the Explorer&hellip;</li>
<li>But the indexes are still kinda messed up&hellip; the <code>openrxv-items</code> index is an alias of the wrong index!</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><h2 id="2021-03-23">2021-03-23</h2>
<ul>
<li>For reference you can also get the Elasticsearch JVM stats from the API:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_nodes/jvm?human&#39;</span> | python -m json.tool
</span></span></code></pre></div><ul>
<li>I re-deployed AReS with 1.5GB of heap using the <code>ES_JAVA_OPTS</code> environment variable
<ul>
<li>It turns out that this <em>is</em> the recommended way to set the heap: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html">https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html</a></li>
</ul>
</li>
<li>Then I fixed the aliases to make sure <code>openrxv-items</code> was an alias of <code>openrxv-items-final</code>, similar to how I did a few weeks ago</li>
<li>I re-created the temp index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span></code></pre></div><h2 id="2021-03-24">2021-03-24</h2>
<ul>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934">ticket about the Duplicate Checker</a>
<ul>
<li>He says it works for him in Firefox, so I checked and it seems to have been an issue with my LocalCDN addon</li>
</ul>
</li>
<li>I re-deployed DSpace Test (linode26) from the latest CGSpace (linode18) data
<ul>
<li>I want to try to finish up processing the duplicates in Solr that <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">Atmire advised on last month</a></li>
<li>The current statistics core is 57861236 kilobytes:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># du -s /home/dspacetest.cgiar.org/solr/statistics
</span></span><span style="display:flex;"><span>57861236 /home/dspacetest.cgiar.org/solr/statistics
</span></span></code></pre></div><ul>
<li>I applied their changes to <code>config/spring/api/atmire-cua-update.xml</code> and started the duplicate processor:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx4096m&#39;</span>
</span></span><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">1000</span> -c statistics -t <span style="color:#ae81ff">12</span>
</span></span></code></pre></div><ul>
<li>The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)</li>
<li>Hah, I still got a memory error after only a few minutes:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s
</span></span><span style="display:flex;"><span>Exception: GC overhead limit exceeded
</span></span><span style="display:flex;"><span>java.lang.OutOfMemoryError: GC overhead limit exceeded
</span></span></code></pre></div><ul>
<li>I guess we really do have to use <code>-r 100</code></li>
<li>Now the thing runs for a few minutes and &ldquo;finishes&rdquo;:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">100</span> -c statistics -t <span style="color:#ae81ff">12</span>
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>*************************
</span></span><span style="display:flex;"><span>* Update Script Started *
</span></span><span style="display:flex;"><span>*************************
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Run 1
</span></span><span style="display:flex;"><span>Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:42:41 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:46 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:54 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:58 CET 2021
</span></span><span style="display:flex;"><span>Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:59 CET 2021
</span></span><span style="display:flex;"><span>Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
</span></span><span style="display:flex;"><span>Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
</span></span><span style="display:flex;"><span>Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
</span></span><span style="display:flex;"><span>Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
</span></span><span style="display:flex;"><span>Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
</span></span><span style="display:flex;"><span>Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
</span></span><span style="display:flex;"><span>Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
</span></span><span style="display:flex;"><span>Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
</span></span><span style="display:flex;"><span>Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
</span></span><span style="display:flex;"><span>Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
</span></span><span style="display:flex;"><span>Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
</span></span><span style="display:flex;"><span>Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
</span></span><span style="display:flex;"><span>Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
</span></span><span style="display:flex;"><span>Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
</span></span><span style="display:flex;"><span>Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
</span></span><span style="display:flex;"><span>Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
</span></span><span style="display:flex;"><span>Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
</span></span><span style="display:flex;"><span>Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
</span></span><span style="display:flex;"><span>Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
</span></span><span style="display:flex;"><span>Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
</span></span><span style="display:flex;"><span>Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
</span></span><span style="display:flex;"><span>Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
</span></span><span style="display:flex;"><span>Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
</span></span><span style="display:flex;"><span>Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
</span></span><span style="display:flex;"><span>Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
</span></span><span style="display:flex;"><span>Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
</span></span><span style="display:flex;"><span>Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
</span></span><span style="display:flex;"><span>Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
</span></span><span style="display:flex;"><span>Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
</span></span><span style="display:flex;"><span>Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
</span></span><span style="display:flex;"><span>Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
</span></span><span style="display:flex;"><span>Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
</span></span><span style="display:flex;"><span>Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
</span></span><span style="display:flex;"><span>Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
</span></span><span style="display:flex;"><span>Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
</span></span><span style="display:flex;"><span>Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
</span></span><span style="display:flex;"><span>Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
</span></span><span style="display:flex;"><span>Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
</span></span><span style="display:flex;"><span>Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
</span></span><span style="display:flex;"><span>Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
</span></span><span style="display:flex;"><span>Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
</span></span><span style="display:flex;"><span>Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
</span></span><span style="display:flex;"><span>Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
</span></span><span style="display:flex;"><span>Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
</span></span><span style="display:flex;"><span>Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
</span></span><span style="display:flex;"><span>Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
</span></span><span style="display:flex;"><span>Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
</span></span><span style="display:flex;"><span>Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
</span></span><span style="display:flex;"><span>Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
</span></span><span style="display:flex;"><span>Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
</span></span><span style="display:flex;"><span>Run 1 took 5m 53s
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>**************************
</span></span><span style="display:flex;"><span>* Update Script Finished *
</span></span><span style="display:flex;"><span>**************************
</span></span></code></pre></div><ul>
<li>If I run it again it finds the same 4,824 docs and processes them&hellip;
<ul>
<li>I asked Atmire for feedback on this: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
</ul>
</li>
</ul>
<h2 id="2021-03-25">2021-03-25</h2>
<ul>
<li>Niroshini from IWMI is still having problems adding metadata during the edit step of the workflow on CGSpace
<ul>
<li>I told her to try to register using a private email account and we&rsquo;ll add her to the WLE group so she can try that way</li>
</ul>
</li>
</ul>
<h2 id="2021-03-28">2021-03-28</h2>
<ul>
<li>Make a backup of the <code>openrxv-items-final</code> index on AReS Explorer and start a new harvest</li>
</ul>
<h2 id="2021-03-29">2021-03-29</h2>
<ul>
<li>The AReS harvesting that I started yesterday finished successfully and all indexes look OK:
<ul>
<li><code>openrxv-items</code> is an alias of <code>openrxv-items-final</code> and has a correct number of items</li>
</ul>
</li>
<li>Last week Bosede from IITA said she was trying to move an item from one collection to another and the system was &ldquo;rolling&rdquo; and never finished
<ul>
<li>I looked in Munin and I don&rsquo;t see anything particularly wrong that day, so I told her to try again</li>
</ul>
</li>
<li>Marianne Gadeberg asked about mapping an item last week
<ul>
<li>Searched for <a href="https://hdl.handle.net/10568/110633">the item</a>&rsquo;s handle, the title, the title in quotes, the UUID, with pluses instead of spaces, etc in the item mapper&hellip; but I can never find it in the results</li>
<li>I see someone has reported this issue on Jira in DSpace 5.x&rsquo;s XMLUI item mapper: <a href="https://jira.lyrasis.org/browse/DS-2761">https://jira.lyrasis.org/browse/DS-2761</a></li>
<li>The Solr log shows that my query (with and without quotes, etc) has 143 results:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&amp;wt=javabin&amp;version=2} hits=143 status=0 QTime=0
</span></span></code></pre></div><ul>
<li>But the item mapper only displays ten items, with no pagination
<ul>
<li>There is no way to search by handle or ID</li>
<li>I mapped the item manually using a CSV</li>
</ul>
</li>
</ul>
<h2 id="2021-03-30">2021-03-30</h2>
<ul>
<li>I realized I never finished deleting all the old fields after our CG Core migration a few months ago
<ul>
<li>I found a few occurrences of old metadata so I had to move them where possible and delete them where not</li>
</ul>
</li>
<li>I updated the <a href="/cgspace-notes/cgspace-cgcorev2-migration/">CG Core v2 migration page</a></li>
<li>Marianne Gadeberg wrote to ask why the item she wanted to map a few days ago still doesn&rsquo;t appear in the mapped collection
<ul>
<li>I looked on the item page itself and it lists the collection, but doesn&rsquo;t appear in the collection list</li>
<li>I tried to forceably reindex the collection and the item, but it didn&rsquo;t seem to work</li>
<li>Now I will try a complete Discovery re-index</li>
</ul>
</li>
</ul>
<h2 id="2021-03-31">2021-03-31</h2>
<ul>
<li>The Discovery re-index finished, but <a href="https://hdl.handle.net/10568/110633">the CIP item</a> still does not appear in the GENDER Platform grants collection
<ul>
<li>The item page itself DOES list the grants collection! WTF</li>
<li>I sent a message to the dspace-tech mailing list to see if someone can comment</li>
<li>I even tried unmapping and re-mapping, but it doesn&rsquo;t change anything: the item still doesn&rsquo;t appear in the collection, but I can see that it is mapped</li>
</ul>
</li>
<li>I signed up for a SHERPA API key so I can try to write something to get journal names from ISSN
<ul>
<li>This code seems to get a journal title, though I only tried it with a few ISSNs:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> requests
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>query_params <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#39;item-type&#39;</span>: <span style="color:#e6db74">&#39;publication&#39;</span>, <span style="color:#e6db74">&#39;format&#39;</span>: <span style="color:#e6db74">&#39;Json&#39;</span>, <span style="color:#e6db74">&#39;limit&#39;</span>: <span style="color:#ae81ff">10</span>, <span style="color:#e6db74">&#39;offset&#39;</span>: <span style="color:#ae81ff">0</span>, <span style="color:#e6db74">&#39;api-key&#39;</span>: <span style="color:#e6db74">&#39;blahhhahahah&#39;</span>, <span style="color:#e6db74">&#39;filter&#39;</span>: <span style="color:#e6db74">&#39;[[&#34;issn&#34;,&#34;equals&#34;,&#34;0011-183X&#34;]]&#39;</span>}
</span></span><span style="display:flex;"><span>r <span style="color:#f92672">=</span> requests<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;https://v2.sherpa.ac.uk/cgi/retrieve&#39;</span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> r<span style="color:#f92672">.</span>status_code <span style="color:#f92672">and</span> len(r<span style="color:#f92672">.</span>json()[<span style="color:#e6db74">&#39;items&#39;</span>]) <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0</span>:
</span></span><span style="display:flex;"><span> r<span style="color:#f92672">.</span>json()[<span style="color:#e6db74">&#39;items&#39;</span>][<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">&#39;title&#39;</span>][<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">&#39;title&#39;</span>]
</span></span></code></pre></div><ul>
<li>I exported a list of all our ISSNs from CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
</span></span><span style="display:flex;"><span>COPY 3081
</span></span></code></pre></div><ul>
<li>I wrote a script to check the ISSNs against Crossref&rsquo;s API: <code>crossref-issn-lookup.py</code>
<ul>
<li>I suspect Crossref might have better data actually&hellip;</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
<li><a href="/cgspace-notes/2024-08/">August, 2024</a></li>
<li><a href="/cgspace-notes/2024-07/">July, 2024</a></li>
<li><a href="/cgspace-notes/2024-06/">June, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>