cgspace-notes/docs/2021-01/index.html
2024-06-16 16:40:54 +03:00

743 lines
45 KiB
HTML

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="January, 2021" />
<meta property="og:description" content="2021-01-03
Peter notified me that some filters on AReS were broken again
It&rsquo;s the same issue with the field names getting .keyword appended to the end that I already filed an issue on OpenRXV about last month
I fixed the broken filters (careful to not edit any others, lest they break too!)
Fix an issue with start page number for the DSpace REST API and statistics API in OpenRXV
The start page had been &ldquo;1&rdquo; in the UI, but in the backend they were doing some gymnastics to adjust to the zero-based offset/limit/page of the DSpace REST API and the statistics API
I adjusted it to default to 0 and added a note to the admin screen
I realized that this issue was actually causing the first page of 100 statistics to be missing&hellip;
For example, this item has 51 views on CGSpace, but 0 on AReS
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-01/" />
<meta property="article:published_time" content="2021-01-03T10:13:54+02:00" />
<meta property="article:modified_time" content="2021-01-31T16:32:16+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2021"/>
<meta name="twitter:description" content="2021-01-03
Peter notified me that some filters on AReS were broken again
It&rsquo;s the same issue with the field names getting .keyword appended to the end that I already filed an issue on OpenRXV about last month
I fixed the broken filters (careful to not edit any others, lest they break too!)
Fix an issue with start page number for the DSpace REST API and statistics API in OpenRXV
The start page had been &ldquo;1&rdquo; in the UI, but in the backend they were doing some gymnastics to adjust to the zero-based offset/limit/page of the DSpace REST API and the statistics API
I adjusted it to default to 0 and added a note to the admin screen
I realized that this issue was actually causing the first page of 100 statistics to be missing&hellip;
For example, this item has 51 views on CGSpace, but 0 on AReS
"/>
<meta name="generator" content="Hugo 0.127.0">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-01/",
"wordCount": "3157",
"datePublished": "2021-01-03T10:13:54+02:00",
"dateModified": "2021-01-31T16:32:16+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-01/">
<title>January, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-01/">January, 2021</a></h2>
<p class="blog-post-meta">
<time datetime="2021-01-03T10:13:54+02:00">Sun Jan 03, 2021</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2021-01-03">2021-01-03</h2>
<ul>
<li>Peter notified me that some filters on AReS were broken again
<ul>
<li>It&rsquo;s the same issue with the field names getting <code>.keyword</code> appended to the end that I already <a href="https://github.com/ilri/OpenRXV/issues/66">filed an issue on OpenRXV about last month</a></li>
<li>I fixed the broken filters (careful to not edit any others, lest they break too!)</li>
</ul>
</li>
<li>Fix an issue with start page number for the DSpace REST API and statistics API in OpenRXV
<ul>
<li>The start page had been &ldquo;1&rdquo; in the UI, but in the backend they were doing some gymnastics to adjust to the zero-based offset/limit/page of the DSpace REST API and the statistics API</li>
<li>I adjusted it to default to 0 and added a note to the admin screen</li>
<li>I realized that this issue was actually causing the first page of 100 statistics to be missing&hellip;</li>
<li>For example, <a href="https://cgspace.cgiar.org/handle/10568/66839">this item</a> has 51 views on CGSpace, but 0 on AReS</li>
</ul>
</li>
</ul>
<ul>
<li>Start a re-index on AReS
<ul>
<li>First delete the old Elasticsearch temp index:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100278,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-04
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-04&#39;</span>
</span></span></code></pre></div><h2 id="2021-01-04">2021-01-04</h2>
<ul>
<li>There is one item that appears twice in AReS: <a href="https://cgspace.cgiar.org/handle/10568/66839">10568/66839</a>
<ul>
<li>If I use the Handle filter I see it twice&hellip; whereas other items don&rsquo;t appear twice</li>
<li>I filed a bug on OpenRXV: <a href="https://github.com/ilri/OpenRXV/issues/67">https://github.com/ilri/OpenRXV/issues/67</a></li>
</ul>
</li>
<li>Help Peter troubleshoot an issue with Altmetric badges on AReS
<ul>
<li>He generated a report of our repository from Altmetric and noticed that many were missing scores despite having scores on CGSpace item pages</li>
<li>AReS harvest Altmetric scores using the Handle prefix (10568) in batch, while CGSpace uses the DOI if it is found, and falls back to using the Handle</li>
<li>I think it&rsquo;s due to the fact that some items were never tweeted, so Altmetric never made the link between the DOI and the Handle</li>
<li>I did some tweets of five items and within an hour or so the DOI API link registers the associated Handle, and within an hour or so the Handle API link is live with the same score</li>
</ul>
</li>
</ul>
<h2 id="2021-01-05">2021-01-05</h2>
<ul>
<li>A user sent me <a href="https://github.com/ilri/dspace-statistics-api/issues/12">feedback about the dspace-statistics-api</a>
<ul>
<li>He noticed that the indexer fails if there are unmigrated legacy records in Solr</li>
<li>I added a UUID filter to the queries in the indexer</li>
</ul>
</li>
<li>I generated a CSV of titles and Handles for 2019 and 2020 items for Peter to Tweet
<ul>
<li>We need to make sure that Altmetric has linked them all with their DOIs</li>
<li>I wrote a quick and dirty script called <a href="https://gist.github.com/alanorth/281b7624301049e8fa91742b9b8c51b9">doi-to-handle.py</a> to read the DOIs from a text file, query the database, and save the handles and titles to a CSV</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./doi-to-handle.py -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -i /tmp/dois.txt -o /tmp/out.csv
</span></span></code></pre></div><ul>
<li>Help Udana export IWMI records from AReS
<ul>
<li>He wanted me to give him CSV export permissions on CGSpace, but I told him that this requires super admin so I&rsquo;m not comfortable with it</li>
</ul>
</li>
<li>Import one item to CGSpace for Peter</li>
</ul>
<h2 id="2021-01-07">2021-01-07</h2>
<ul>
<li>Import twenty CABI book chapters for Abenet</li>
<li>Udana and some editors from IWMI are still having problems editing metadata during the workflow step
<ul>
<li>It is the same issue Peter reported last month, that values he edits are not saved when the item gets archived</li>
<li>I added myself the the edit and approval steps of <a href="https://dspacetest.cgiar.org/handle/10568/81589">the collection</a> on DSpace Test and asked Udana to submit an item there for me to test</li>
</ul>
</li>
<li>Atmire got back to me about the duplicate data in Solr
<ul>
<li>They want to arrange a time for us to do the stats processing so they can monitor it</li>
<li>I proposed that I set everything up with a fresh Solr snapshot from CGSpace and then let them start the stats process</li>
</ul>
</li>
</ul>
<h2 id="2021-01-10">2021-01-10</h2>
<ul>
<li>Dominique from IWMI asked about API access to the IWMI collections
<ul>
<li>A partner of theirs called AMCOW is interested in harvesting their publications</li>
<li>I told her that they can use the REST API or OAI to get them from the <a href="https://cgspace.cgiar.org/handle/10568/36185">IWMI Journal Articles collection</a>:
<ul>
<li>CGSpace REST API: <a href="https://cgspace.cgiar.org/rest/collections/c2618391-184e-4091-8a93-280fdf01238b/items">https://cgspace.cgiar.org/rest/collections/c2618391-184e-4091-8a93-280fdf01238b/items</a></li>
<li>CGSpace OAI API: <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=col_10568_36185">https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=col_10568_36185</a></li>
</ul>
</li>
</ul>
</li>
<li>Udana submitted an item to <a href="https://dspacetest.cgiar.org/handle/10568/81589">the collection</a> on DSpace Test that I discussed last week
<ul>
<li>I was able to take the task, add a new AGROVOC subject, approve the task, and commit it to archive</li>
<li>The final item had my new AGROVOC subject, so I don&rsquo;t see the issue</li>
<li>Perhaps the issue only occurs when we replace an existing field? Or only on IWMI fields? I don&rsquo;t know&hellip;</li>
<li>Also there is this warning that occurs in the DSpace log during editing (and many other operations):</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID=&#34;TX35636856957739531161091194485578658698&#34;)
</span></span></code></pre></div><ul>
<li>I filed <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=907">a bug on Atmire&rsquo;s issue tracker</a></li>
<li>Peter asked me to move the CGIAR Gender Platform community to the top level of CGSpace, but I get an error when I use the community-filiator command:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>Exception: null
</span></span><span style="display:flex;"><span>java.lang.UnsupportedOperationException
</span></span><span style="display:flex;"><span> at java.util.AbstractList.remove(AbstractList.java:161)
</span></span><span style="display:flex;"><span> at java.util.AbstractList$Itr.remove(AbstractList.java:374)
</span></span><span style="display:flex;"><span> at java.util.AbstractCollection.remove(AbstractCollection.java:293)
</span></span><span style="display:flex;"><span> at org.dspace.administer.CommunityFiliator.defiliate(CommunityFiliator.java:264)
</span></span><span style="display:flex;"><span> at org.dspace.administer.CommunityFiliator.main(CommunityFiliator.java:164)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>There is apparently <a href="https://jira.lyrasis.org/browse/DS-3914">a bug</a> in DSpace 6.x that makes community-filiator not work
<ul>
<li>There is <a href="https://github.com/DSpace/DSpace/pull/2178">a patch</a> for the as-of-yet unreleased DSpace 6.4 so I will try that</li>
<li>I tested the patch on DSpace Test and it worked, so I will do the same on CGSpace tomorrow</li>
</ul>
</li>
<li>Udana had asked about exporting IWMI&rsquo;s community on CGSpace, but we don&rsquo;t want to give him super admin permissions to do that
<ul>
<li>I suggested that he use AReS, but there are some fields missing because we don&rsquo;t harvest them all</li>
<li>I added a few more fields to the configuration and will start a fresh harvest.</li>
</ul>
</li>
<li>Start a re-index on AReS
<ul>
<li>First delete the old Elasticsearch temp index:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span><span style="display:flex;"><span>... after ten hours
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100411,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span></code></pre></div><ul>
<li>Looking over the last month of Solr stats I see a familiar bot that <em>should</em> have been marked as a bot months ago:</li>
</ul>
<blockquote>
<p>Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)</p>
</blockquote>
<ul>
<li>There are 51,961 hits from this bot on 64.62.202.71 and 64.62.202.73
<ul>
<li>Ah! Actually I added the bot pattern to the Tomcat Crawler Session Manager Valve, which mitigated the abuse of Tomcat sessions:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat log/dspace.log.2020-12-2* | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>0
</span></span></code></pre></div><ul>
<li>So now I should really add it to the DSpace spider agent list so it doesn&rsquo;t create Solr hits
<ul>
<li>I added it to the &ldquo;ilri&rdquo; lists of spider agent patterns</li>
</ul>
</li>
<li>I purged the existing hits using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
</span></span></code></pre></div><h2 id="2021-01-11">2021-01-11</h2>
<ul>
<li>The AReS indexing finished this morning and I moved the <code>openrxv-items-temp</code> core to <code>openrxv-items</code> (see above)
<ul>
<li>I sorted the explorer results by Altmetric attention score and I see a few new ones on the top so I think the recent tweeting of Handles by Peter and myself worked</li>
</ul>
</li>
<li>I deployed the community-filiator fix on CGSpace and moved the Gender Platform community to the top level of CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
</span></span></code></pre></div><h2 id="2021-01-12">2021-01-12</h2>
<ul>
<li>IWMI is really pressuring us to have a periodic CSV export of their community
<ul>
<li>I decided to write a systemd timer to use <code>dspace metadata-export</code> every week, and made an nginx alias to make it available <a href="https://cgspace.cgiar.org/iwmi.csv">publicly</a></li>
<li>It is part of the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> that I use to provision the servers</li>
</ul>
</li>
<li>I wrote to Atmire to tell them to try their CUA duplicates processor on DSpace Test whenever they get a chance this week
<ul>
<li>I verified that there were indeed duplicate metadata values in the <code>userAgent_ngram</code> and <code>userAgent_search</code> fields, even in the first few results I saw in Solr</li>
<li>For reference, the UID of the record I saw with duplicate metadata was: 50e52a06-ffb7-4597-8d92-1c608cc71c98</li>
</ul>
</li>
</ul>
<h2 id="2021-01-13">2021-01-13</h2>
<ul>
<li>I filed <a href="https://github.com/AgriculturalSemantics/cg-core/issues/30">an issue on cg-core</a> asking about how to handle series name / number
<ul>
<li>Currently the values are in format &ldquo;series name; series number&rdquo; in the <code>dc.relation.ispartofseries</code> field, but Peter wants to be able to separate them</li>
</ul>
</li>
<li>Start working on CG Core v2 migration for DSpace 6, using <a href="https://alanorth.github.io/cgspace-notes/cgspace-cgcorev2-migration/">my work</a> from last year on DSpace 5</li>
</ul>
<h2 id="2021-01-14">2021-01-14</h2>
<ul>
<li>More work on the CG Core v2 migration for DSpace 6</li>
<li>Publish <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.1">v1.4.1 of the DSpace Statistics API</a> based on feedback from the community
<ul>
<li>This includes the fix for limiting the Solr query to UUIDs</li>
</ul>
</li>
</ul>
<h2 id="2021-01-17">2021-01-17</h2>
<ul>
<li>Start a re-index on AReS
<ul>
<li>First delete the old Elasticsearch temp index:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100540,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-18
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-18&#39;</span>
</span></span></code></pre></div><h2 id="2021-01-18">2021-01-18</h2>
<ul>
<li>Finish the indexing on AReS that I started yesterday</li>
<li>Udana from IWMI emailed me to ask why the iwmi.csv doesn&rsquo;t include items he approved to CGSpace this morning
<ul>
<li>I told him it is generated every Sunday night</li>
<li>I regenerated the file manually for him</li>
<li>I adjusted the script to run on Monday and Friday</li>
</ul>
</li>
<li>Meeting with Peter and Abenet about CG Core v2
<ul>
<li>We also need to remove CTA and CPWF subjects from the input form since they are both closed now and no longer submitting items</li>
<li>Peter also wants to create new fields on CGSpace for the SDGs and CGIAR Impact Areas
<ul>
<li>I suggested <code>cg.subject.sdg</code> and <code>cg.subject.impactArea</code></li>
</ul>
</li>
<li>We also agreed to remove the following fields:
<ul>
<li>cg.livestock.agegroup</li>
<li>cg.livestock.function</li>
<li>cg.message.sms</li>
<li>cg.message.voice</li>
</ul>
</li>
<li>I removed them from the input form, metadata registry, and deleted all the values in the database:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>localhost/dspace63= &gt; BEGIN;
localhost/dspace63= &gt; DELETE FROM metadatavalue WHERE metadata_field_id IN (115, 116, 117, 118);
DELETE 27
localhost/dspace63= &gt; COMMIT;
</code></pre><ul>
<li>I submitted <a href="https://github.com/AgriculturalSemantics/cg-core/issues/31">an issue</a> to CG Core v2 to propose standardizing the camel case convention for a few more fields of ours</li>
<li>I submitted <a href="https://github.com/AgriculturalSemantics/cg-core/issues/32">an issue</a> to CG Core v2 to propose removing <code>cg.series</code> and <code>cg.pages</code> in favor of <code>dcterms.isPartOf</code> and <code>dcterms.extent</code>, respectively</li>
<li>It looks like we will roll all these changes into a CG Core v2.1 release</li>
</ul>
<h2 id="2021-01-19">2021-01-19</h2>
<ul>
<li>Abenet said that the PDF reports on AReS aren&rsquo;t working
<ul>
<li>I had to install <code>unoconv</code> in the backend api container again</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker exec -it api /bin/bash
</span></span><span style="display:flex;"><span># apt update <span style="color:#f92672">&amp;&amp;</span> apt install unoconv
</span></span></code></pre></div><ul>
<li>Help Peter get a list of titles and DOIs for CGSpace items that Altmetric does not have an attention score for
<ul>
<li>He generated a list from their dashboard and I extracted the DOIs in OpenRefine (because it was WINDOWS-1252 and csvcut couldn&rsquo;t do it)</li>
<li>Then I looked up the titles and handles using the <code>doi-to-handle.py</code> script that I wrote last week</li>
</ul>
</li>
<li>I created <a href="https://github.com/AgriculturalSemantics/cg-core/pull/34">a pull request</a> to convert several CG Core v2 fields to consistent &ldquo;camel case&rdquo;
<ul>
<li>Marie said we should create a new minor version of CG Core v2 for this so I tagged it with the <a href="https://github.com/AgriculturalSemantics/cg-core/milestone/1">&ldquo;CG Core v2.1&rdquo; milestone</a></li>
</ul>
</li>
<li>I created <a href="https://github.com/AgriculturalSemantics/cg-core/pull/35">a pull request</a> to fix some links in cgcore.html</li>
</ul>
<h2 id="2021-01-21">2021-01-21</h2>
<ul>
<li>File <a href="https://github.com/ilri/OpenRXV/issues/68">an issue</a> for the OpenRXV backend API container&rsquo;s missing <code>unoconv</code>
<ul>
<li>This causes PDF reports to not work, and I always have to go manually re-install it after rebooting the server</li>
</ul>
</li>
<li>A little bit more work on the CG Core v2 migration in CGSpace
<ul>
<li>I updated the <code>migrate-fields.sh</code> script for DSpace 6 and created all the new fields in my test instance</li>
</ul>
</li>
</ul>
<h2 id="2021-01-24">2021-01-24</h2>
<ul>
<li>Abenet mentioned that Alan Duncan could not find one of his items on AReS, but it is on CGSpace
<ul>
<li>The item is: <a href="https://hdl.handle.net/10568/110133">https://hdl.handle.net/10568/110133</a></li>
<li>The handle does not appear on AReS when I try to filter by Handle</li>
<li>I suspect it is related to the issue of the missing Livestock CRP community and I added a comment on <a href="https://github.com/ilri/OpenRXV/issues/62">the GitHub issue</a></li>
</ul>
</li>
<li>Import fifteen items to CGSpace for Peter after doing a brief check in OpenRefine and csv-metadata-quality</li>
<li>Ben Hack asked me why I&rsquo;m still using the default favicon on CGSpace
<ul>
<li>I used an <a href="https://commons.wikimedia.org/wiki/File:CGIAR-logo.svg">SVG version of the CGIAR logo</a> with <a href="https://realfavicongenerator.net">https://realfavicongenerator.net</a> to to make a better favicon setup and it is currently running on DSpace Test</li>
</ul>
</li>
<li>Start a re-index on AReS
<ul>
<li>First delete the old Elasticsearch temp index:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100699,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#960050;background-color:#1e0010">&#39;</span><span style="color:#f92672">{</span><span style="color:#e6db74">&#34;settings&#34;</span>: <span style="color:#f92672">{</span><span style="color:#e6db74">&#34;index.b
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"></span>locks.write&#34;:true}}&#39;
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-25
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-25&#39;</span>
</span></span></code></pre></div><ul>
<li>Resume working on CG Core v2, I realized a few things:
<ul>
<li>We are trying to move from <code>dc.identifier.issn</code> (and ISBN) to <code>cg.issn</code>, but this is currently implemented as a &ldquo;qualdrop&rdquo; input in DSpace&rsquo;s submission form, which only works to fill in the qualifier (ie <code>dc.identier.xxxx</code>)
<ul>
<li>If we really want to use <code>cg.issn</code> and <code>cg.isbn</code> we would need to add a new input field for each separately</li>
</ul>
</li>
<li>We are trying to move series name/number fro m<code>dc.relation.ispartofseries</code> to <code>dcterms.isPartOf</code>, but this uses a special &ldquo;series&rdquo; input type in DSpace&rsquo;s submission form that joins series name and number with a colon (;)
<ul>
<li>If we really want to do that we need to add two separate input fields for each</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="2021-01-25">2021-01-25</h2>
<ul>
<li>Finish indexing AReS and adjusting the indexes (see above)</li>
<li>Merged the changes for the favicon in to the <code>6_x-prod</code> branch</li>
<li>Meeting with Peter and Abenet about CG Core v2
<ul>
<li>We agreed to go ahead with it ASAP and share a list of the changes with Macaroni, Fabio, and others and give them a firm timeline</li>
<li>We also discussed the CSV export option on DSpace 6 and were surprised to see that it kinda works</li>
<li>If you do a free-text search it works properly, but if you try to use the metadata filters it doesn&rsquo;t</li>
<li>I changed the default setting to make it available to any logged in user and will deploy it on CGSpace this week</li>
</ul>
</li>
</ul>
<h2 id="2021-01-26">2021-01-26</h2>
<ul>
<li>Email some CIAT users who submitted items with upper case AGROVOC terms
<ul>
<li>I will do another global replace soon after they reply</li>
</ul>
</li>
<li>Add CGIAR Impact Areas and UN Sustainable Development Goals (SDGs) to the <code>6x_prod</code> branch</li>
<li>Looking into the issue with exporting search results in XMLUI again
<ul>
<li>I notice that there is an HTTP 400 when you try to export search results containing a filter</li>
<li>The Tomcat logs show:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>Jan 26, 2021 10:47:23 AM org.apache.coyote.http11.AbstractHttp11Processor process
INFO: Error parsing HTTP request header
Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
java.lang.IllegalArgumentException: Invalid character found in the request target [/discover/search/csv?query=*&amp;scope=~&amp;filters=author:(Alan\%20Orth)]. The valid characters are defined in RFC 7230 and RFC 3986
at org.apache.coyote.http11.InternalInputBuffer.parseRequestLine(InternalInputBuffer.java:213)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1108)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:654)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
</code></pre><ul>
<li>This actually seems to be a simple issue, as I notice DSpace is escaping the space for some reason:
<ul>
<li>The URL that fails is: <a href="https://dspacetest.cgiar.org/discover/search/csv?query=">https://dspacetest.cgiar.org/discover/search/csv?query=</a>*&amp;scope=~&amp;filters=author:(Alan%20Orth)</li>
<li>The URL that works is: <a href="https://dspacetest.cgiar.org/discover/search/csv?query=">https://dspacetest.cgiar.org/discover/search/csv?query=</a>*&amp;scope=~&amp;filters=author:(Alan%20Orth)</li>
</ul>
</li>
<li>I <a href="https://jira.lyrasis.org/browse/DS-4566">filed a bug</a> on DSpace&rsquo;s issue tracker (though I accidentally hit Enter and submitted it before I finished, and there is no edit function)</li>
<li>Looking into Linode report that the load outbound traffic rate was high this morning:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># grep -E <span style="color:#e6db74">&#39;26/Jan/2021:(08|09|10|11|12)&#39;</span> /var/log/nginx/rest.log | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</span></span></code></pre></div><ul>
<li>The culprit seems to be the ILRI publications importer, so that&rsquo;s OK</li>
<li>But I also see an IP in Jordan hitting the REST API 1,100 times today:</li>
</ul>
<pre tabindex="0"><code>80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] &#34;GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0&#34; 302 138 &#34;http://wp.local/&#34; &#34;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36&#34;
</code></pre><ul>
<li>Seems to be someone from CodeObia working on WordPress
<ul>
<li>I told them to please use a bot user agent so it doesn&rsquo;t affect our stats, and to use DSpace Test if possible</li>
</ul>
</li>
<li>I purged all ~3,000 statistics hits that have the &ldquo;<a href="http://wp.local/%22">http://wp.local/&quot;</a> referrer:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;referrer:http\:\/\/wp\.local\/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><ul>
<li>Tag version 0.4.3 of the csv-metadata-quality tool on GitHub: <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3">https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3</a>
<ul>
<li>I just realized that I never submitted this to CGSpace as a Big Data Platform output</li>
<li>I used my previous <a href="https://hdl.handle.net/10568/99143">DSpace Statistics API submission</a> as a reference and submitted it to CGSpace</li>
</ul>
</li>
</ul>
<h2 id="2021-01-27">2021-01-27</h2>
<ul>
<li>Abenet approved my submission to CGSpace for the CSV metadata quality checker: <a href="https://hdl.handle.net/10568/110997">https://hdl.handle.net/10568/110997</a></li>
<li>Add SDGs and Impact Areas to the XMLUI item display</li>
<li>Last week Atmire got back to me about the duplicates in Solr
<ul>
<li>The deduplicator appears to be working, but you need to limit the number of records, for example <code>-r 100</code> so it doesn&rsquo;t crash due to memory</li>
<li>They pointed to a few records <code>solr_update_time_stamp:1605635765897</code> that have hundreds of duplicates which are now gone (still present if you look on the production server)</li>
<li>I need to try this again before doing it on CGSpace</li>
</ul>
</li>
</ul>
<h2 id="2021-01-28">2021-01-28</h2>
<ul>
<li>I did some more work on CG Core v2
<ul>
<li>I tested using <code>cg.number</code> twice in the submission form: once for journal issue, and once for series number</li>
<li>DSpace gets confused and ends up storing the number twice, even if you only enter it in one of the fields</li>
<li>I suggested to Marie that we use <code>cg.issue</code> for journal issue, since we&rsquo;re already going to use <code>cg.volume</code></li>
<li>That would free up <code>cg.number</code> for use by series number</li>
</ul>
</li>
<li>I deployed the SDGs, Impact Areas, and favicon changes to CGSpace and posted a note on Yammer for the editors
<ul>
<li>Also ran all system updates and rebooted the server (linode18)</li>
</ul>
</li>
</ul>
<h2 id="2021-01-31">2021-01-31</h2>
<ul>
<li>AReS Explorer has been down since yesterday for some reason
<ul>
<li>First I ran all updates and rebooted the server (linode20)</li>
<li>Then start a re-index, first deleting the old Elasticsearch temp index:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Sent out emails about CG Core v2 to Macaroni Bros, Fabio, Hector at CCAFS, Dani and Tariku</li>
<li>A bit more minor work on testing the series/report/journal changes for CG Core v2</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-06/">June, 2024</a></li>
<li><a href="/cgspace-notes/2024-05/">May, 2024</a></li>
<li><a href="/cgspace-notes/2024-04/">April, 2024</a></li>
<li><a href="/cgspace-notes/2024-03/">March, 2024</a></li>
<li><a href="/cgspace-notes/2024-02/">February, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>