cgspace-notes/docs/2021-09/index.html

643 lines
39 KiB
HTML
Raw Normal View History

2023-07-04 07:03:36 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="September, 2021" />
<meta property="og:description" content="2021-09-02
Troubleshooting the missing Altmetric scores on AReS
Turns out that I didn&rsquo;t actually fix them last month because the check for content.altmetric still exists, and I can&rsquo;t access the DOIs using _h.source.DOI for some reason
I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
I will change DOI to tomato in the repository setup and start a re-harvest&hellip; I need to see if this is some kind of reserved word or something&hellip;
Even as tomato I can&rsquo;t access that field as _h.source.tomato in Angular, but it does work as a filter source&hellip; sigh
I&rsquo;m having problems using the OpenRXV API
The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search query properly&hellip;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-09/" />
<meta property="article:published_time" content="2021-09-01T09:14:07+03:00" />
<meta property="article:modified_time" content="2021-10-04T11:10:54+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2021"/>
<meta name="twitter:description" content="2021-09-02
Troubleshooting the missing Altmetric scores on AReS
Turns out that I didn&rsquo;t actually fix them last month because the check for content.altmetric still exists, and I can&rsquo;t access the DOIs using _h.source.DOI for some reason
I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
I will change DOI to tomato in the repository setup and start a re-harvest&hellip; I need to see if this is some kind of reserved word or something&hellip;
Even as tomato I can&rsquo;t access that field as _h.source.tomato in Angular, but it does work as a filter source&hellip; sigh
I&rsquo;m having problems using the OpenRXV API
The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search query properly&hellip;
"/>
2024-08-28 10:35:05 +02:00
<meta name="generator" content="Hugo 0.133.1">
2023-07-04 07:03:36 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "September, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-09/",
"wordCount": "2864",
"datePublished": "2021-09-01T09:14:07+03:00",
"dateModified": "2021-10-04T11:10:54+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-09/">
<title>September, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-09/">September, 2021</a></h2>
<p class="blog-post-meta">
<time datetime="2021-09-01T09:14:07+03:00">Wed Sep 01, 2021</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2021-09-02">2021-09-02</h2>
<ul>
<li>Troubleshooting the missing Altmetric scores on AReS
<ul>
<li>Turns out that I didn&rsquo;t actually fix them last month because the check for <code>content.altmetric</code> still exists, and I can&rsquo;t access the DOIs using <code>_h.source.DOI</code> for some reason</li>
<li>I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!</li>
<li>I will change <code>DOI</code> to <code>tomato</code> in the repository setup and start a re-harvest&hellip; I need to see if this is some kind of reserved word or something&hellip;</li>
<li>Even as <code>tomato</code> I can&rsquo;t access that field as <code>_h.source.tomato</code> in Angular, but it does work as a filter source&hellip; sigh</li>
</ul>
</li>
<li>I&rsquo;m having problems using the OpenRXV API
<ul>
<li>The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search query properly&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2021-09-05">2021-09-05</h2>
<ul>
<li>Update Docker images on AReS server (linode20) and rebuild OpenRXV:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose build
</span></span></code></pre></div><ul>
<li>Then run system updates and reboot the server
<ul>
<li>After the system came back up I started a fresh re-harvesting</li>
</ul>
</li>
</ul>
<h2 id="2021-09-07">2021-09-07</h2>
<ul>
<li>Checking last month&rsquo;s Solr statistics to see if there are any new bots that I need to purge and add to the list
<ul>
<li>78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent: <code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36</code></li>
<li>It&rsquo;s a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser</li>
<li>130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0</code></li>
<li>35.174.144.154 is on Amazon and made 28,000 requests with this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36</code></li>
<li>192.121.135.6 is in Sweden and made 9,000 requests with this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0</code></li>
<li>185.38.40.66 is in Germany and made 6,000 requests with this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4</code></li>
<li>3.225.28.105 is in Amazon and made 3,000 requests with this user agent: <code>Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36</code></li>
<li>I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
<li>I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com.</li>
<li>I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again</li>
<li>While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others</li>
<li>They must be related, because I see them all using the exact same user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
<li>So this startdedicated.com DNS is some Bing bot also&hellip;</li>
</ul>
</li>
<li>I extracted all the IPs and purged them using my <code>check-spider-ip-hits.sh</code> script
<ul>
<li>In total I purged 225,000 hits&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2021-09-12">2021-09-12</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2021-09-13">2021-09-13</h2>
<ul>
<li>Mishell Portilla asked me about thumbnails on CGSpace being small
<ul>
<li>For example, <a href="https://cgspace.cgiar.org/handle/10568/114576">10568/114576</a> has a lot of white space on the left side</li>
<li>I created a new thumbnail with vipsthumbnail:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</span></span></code></pre></div><ul>
<li>Looking at the PDF&rsquo;s metadata I see:
<ul>
<li>Producer: iLovePDF</li>
<li>Creator: Adobe InDesign 15.0 (Windows)</li>
<li>Format: PDF-1.7</li>
</ul>
</li>
<li>Eventually I should do more tests on this and perhaps file a bug with DSpace&hellip;</li>
<li>Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool
<ul>
<li>I told them I can give them access to DSpace Test and that we should have a meeting soon</li>
<li>We need to figure out what controlled vocabularies they should use</li>
</ul>
</li>
</ul>
<h2 id="2021-09-14">2021-09-14</h2>
<ul>
<li>Some people from the Alliance contacted me last week about AICCRA metadata
<ul>
<li>They have internal things called Components and Clusters, so they were asking how to store these in CGSpace</li>
<li>I suggested adding new metadata values: <code>cg.subject.aiccraComponent</code> and <code>cg.subject.aiccraCluster</code></li>
<li>On second thought, these are identifiers so perhaps this is better: <code>cg.identifier.aiccraComponent</code> and <code>cg.identifier.aiccraCluster</code></li>
</ul>
</li>
</ul>
<h2 id="2021-09-15">2021-09-15</h2>
<ul>
<li>Add ORCID identifier for new ILRI staff to our controlled vocabualary
<ul>
<li>Also tag their twenty-five existing items on CGSpace:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-09-15-add-orcids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Kotchofa, Pacem&#34;,&#34;Pacem Kotchofa: 0000-0002-1640-8807&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
<ul>
<li>I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using</li>
<li>I also told them that I would create some documentation listing the metadata fields, which are mandatory, and the respective controlled vocabularies</li>
</ul>
</li>
</ul>
<h2 id="2021-09-16">2021-09-16</h2>
<ul>
<li>Start writing a Python script to parse <code>input-forms.xml</code> to create documentation for submissions
<ul>
<li>Found a bug with the DSpace 6.3 REST API, it returns HTTP 500 for <code>dc.title</code> even though it exists in the registry: <a href="https://demo.dspace.org/rest/registries/schema/dc/metadata-fields/title">https://demo.dspace.org/rest/registries/schema/dc/metadata-fields/title</a></li>
<li>Seems to be with any field that does not have a qualifier</li>
<li>I filed an issue: <a href="https://github.com/DSpace/DSpace/issues/7946">https://github.com/DSpace/DSpace/issues/7946</a></li>
</ul>
</li>
<li>I decided to update all the metadata field descriptions in our registry so I can use that instead of the &ldquo;hint&rdquo; for each field in the input form
<ul>
<li>I will include examples as well so that it becomes a better resource</li>
</ul>
</li>
</ul>
<h2 id="2021-09-17">2021-09-17</h2>
<ul>
<li>I filed <a href="https://github.com/AgriculturalSemantics/cg-core/issues/41">an issue about using SPDX License Identifiers in CG Core v2</a></li>
<li>Peter Ballantyne emailed me to say that CGSpace was very slow
<ul>
<li>The front page was returning a blank white page</li>
<li>I looked at the database and the connections look low:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>63
</span></span></code></pre></div><ul>
<li>Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin</li>
<li>But the DSpace log file shows tons of database issues:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-17
</span></span><span style="display:flex;"><span>14779
</span></span></code></pre></div><ul>
<li>The earliest one I see is around midnight (now is 2PM):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
</span></span><span style="display:flex;"><span>2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</span></span></code></pre></div><ul>
<li>But I was definitely logged into the site this morning so there were no issues then&hellip;</li>
<li>It seems that a few errors are normal, but there&rsquo;s obviously something wrong today:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-*
</span></span><span style="display:flex;"><span>dspace.log.2021-09-01:116
</span></span><span style="display:flex;"><span>dspace.log.2021-09-02:163
</span></span><span style="display:flex;"><span>dspace.log.2021-09-03:77
</span></span><span style="display:flex;"><span>dspace.log.2021-09-04:13
</span></span><span style="display:flex;"><span>dspace.log.2021-09-05:310
</span></span><span style="display:flex;"><span>dspace.log.2021-09-06:0
</span></span><span style="display:flex;"><span>dspace.log.2021-09-07:29
</span></span><span style="display:flex;"><span>dspace.log.2021-09-08:86
</span></span><span style="display:flex;"><span>dspace.log.2021-09-09:24
</span></span><span style="display:flex;"><span>dspace.log.2021-09-10:26
</span></span><span style="display:flex;"><span>dspace.log.2021-09-11:12
</span></span><span style="display:flex;"><span>dspace.log.2021-09-12:5
</span></span><span style="display:flex;"><span>dspace.log.2021-09-13:10
</span></span><span style="display:flex;"><span>dspace.log.2021-09-14:102
</span></span><span style="display:flex;"><span>dspace.log.2021-09-15:542
</span></span><span style="display:flex;"><span>dspace.log.2021-09-16:368
</span></span><span style="display:flex;"><span>dspace.log.2021-09-17:15235
</span></span></code></pre></div><ul>
<li>I restarted the server and DSpace came up fine&hellip; so it must have been some kind of fluke</li>
<li>Continue working on cleaning up and annotating the metadata registry on CGSpace
<ul>
<li>I removed two old metadata fields that we stopped using earlier this year with the CG Core v2 migration: <code>cg.targetaudience</code> and <code>cg.title.journal</code></li>
</ul>
</li>
</ul>
<h2 id="2021-09-18">2021-09-18</h2>
<ul>
<li>Make more progress on parsing and documenting the CGSpace submission form
<ul>
<li>Publish on GitHub: <a href="https://github.com/ilri/cgspace-submission-guidelines">https://github.com/ilri/cgspace-submission-guidelines</a></li>
</ul>
</li>
</ul>
<h2 id="2021-09-19">2021-09-19</h2>
<ul>
<li>Improve CGSpace Submission Guidelines metadata parsing and documentation
<ul>
<li>GitHub Pages is live now: <a href="https://ilri.github.io/cgspace-submission-guidelines/">https://ilri.github.io/cgspace-submission-guidelines/</a></li>
</ul>
</li>
<li>Start a full harvest on AReS
<ul>
<li>The harvest completed successfully, but for some reason there were only 92,000 items&hellip;</li>
<li>I updated all Docker images, rebuilt the application, then ran all system updates and rebooted the system:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose build
</span></span></code></pre></div><h2 id="2021-09-20">2021-09-20</h2>
<ul>
<li>I synchronized the production CGSpace PostreSQL, Solr, and Assetstore data with DSpace Test</li>
<li>Over the weekend a few users reported that they could not log into CGSpace
<ul>
<li>I checked LDAP and it seems there is something wrong:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-ldap-account@cgiarad.org&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=someaccountnametocheck)&#34;</span>
</span></span><span style="display:flex;"><span>Enter LDAP Password:
</span></span><span style="display:flex;"><span>ldap_sasl_bind(SIMPLE): Can&#39;t contact LDAP server (-1)
</span></span></code></pre></div><ul>
<li>I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
<ul>
<li>It turns out that CGNET created a new Active Directory server (AZCGNEROOT3.cgiarad.org) and decomissioned the old one last week</li>
<li>I updated the configuration on CGSpace and confirmed that it is working</li>
</ul>
</li>
<li>Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p <span style="color:#e6db74">&#39;fuuuuuuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
<ul>
<li>According to my notes from <a href="/cgspace-notes/2020-10/">2020-10</a> the account must be in the admin group in order to submit via the REST API</li>
</ul>
</li>
<li>Run <code>dspace cleanup -v</code> process on CGSpace to clean up old bitstreams</li>
<li>Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 80901
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.donor&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 1274
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 8091
</span></span></code></pre></div><h2 id="2021-09-23">2021-09-23</h2>
<ul>
<li>Peter sent me back the corrections for the affiliations
<ul>
<li>It is about 1,280 corrections and fourteen deletions</li>
<li>I cleaned them up in csv-metadata-quality and then extracted the deletes and fixes to separate files to run with <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code>:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> /tmp/affiliations.csv &gt; /tmp/affiliations-delete.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -r <span style="color:#e6db74">&#39;^.+$&#39;</span> /tmp/affiliations.csv | csvgrep -i -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> &gt; /tmp/affiliations-fix.csv
</span></span><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">211</span>
</span></span><span style="display:flex;"><span>$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -m <span style="color:#ae81ff">211</span>
</span></span></code></pre></div><ul>
<li>Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-09-23-affiliations.csv | sed 1d &gt; /tmp/affiliations.txt
</span></span></code></pre></div><ul>
<li>Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too</li>
<li>Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne</li>
<li>Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 1139
</span></span></code></pre></div><h2 id="2021-09-24">2021-09-24</h2>
<ul>
<li>Peter and Abenet agreed that we should consider converting more of our UPPER CASE metadata values to Title Case
<ul>
<li>It seems that these fields are all still using UPPER CASE:
<ul>
<li>cg.subject.alliancebiovciat</li>
<li>cg.species.breed</li>
<li>cg.subject.bioversity</li>
<li>cg.subject.ccafs</li>
<li>cg.subject.ciat</li>
<li>cg.subject.cip</li>
<li>cg.identifier.iitatheme</li>
<li>cg.subject.iita</li>
<li>cg.subject.ilri</li>
<li>cg.subject.pabra</li>
<li>cg.river.basin</li>
<li>cg.coverage.subregion (done)</li>
<li>dcterms.audience (done)</li>
<li>cg.subject.wle</li>
</ul>
</li>
<li>We can do some of these without even asking anyone, for example <code>cg.coverage.subregion</code>, <code>cg.river.basin</code>, and <code>dcterms.audience</code></li>
</ul>
</li>
<li>First, I will look at <code>cg.coverage.subregion</code>
<ul>
<li>These should ideally come from ISO 3166-2 subdivisions</li>
<li>I will sentence case them and then create a controlled vocabulary from those that are matching (and worry about cleaning the rest up later)</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
</span></span><span style="display:flex;"><span>UPDATE 2903
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.coverage.subregion&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
</span></span><span style="display:flex;"><span>COPY 1200
</span></span></code></pre></div><ul>
<li>Then I process the list for matches with my <code>subdivision-lookup.py</code> script, and extract only the values that matched:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/subregions.csv | csvcut -c <span style="color:#ae81ff">1</span> | sed 1d &gt; /tmp/subregions-matched.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/subregions-matched.txt
</span></span><span style="display:flex;"><span>81 /tmp/subregions-matched.txt
</span></span></code></pre></div><ul>
<li>Then I updated the controlled vocabulary in the submission forms</li>
<li>I did the same for <code>dcterms.audience</code>, taking special care to a few all-caps values:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != &#39;NGOS&#39; AND text_value != &#39;CGIAR&#39;;
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=&#39;NGOs&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;NGOS&#39;;
</span></span></code></pre></div><ul>
<li>Update submission form comment for DOIs because it was still recommending people use the &ldquo;dx.doi.org&rdquo; format even though I batch updated all DOIs to the &ldquo;doi.org&rdquo; format a few times in the last year
<ul>
<li>Then I updated all existing metadata to the new format again:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
</span></span><span style="display:flex;"><span>UPDATE 49
</span></span></code></pre></div><h2 id="2021-09-26">2021-09-26</h2>
<ul>
<li>Mohammed Salem told me last week that MELSpace and WorldFish have been upgraded to DSpace 6 so I updated the repository setup in AReS to use the UUID field instead of IDs
<ul>
<li>This could explain how I had problems harvesting last week, when I only had 90,000 items&hellip;</li>
</ul>
</li>
<li>I started a fresh harvest on AReS
<ul>
<li>I realized that the sitemap on MELSpace is missing so AReS skips it, which means we cannot harvest right now&hellip; ouch</li>
<li>I sent a message to Salem and he fixed it quickly</li>
<li>I added WorldFish&rsquo;s DSpace Statistics API instance to AReS before starting the plugins and now our numbers are much higher, nice!</li>
</ul>
</li>
</ul>
<h2 id="2021-09-27">2021-09-27</h2>
<ul>
<li>Add CGIAR Action Area (cg.subject.actionArea) to CGSpace as Peter had asked me a few days ago</li>
</ul>
<h2 id="2021-09-28">2021-09-28</h2>
<ul>
<li>Francesca from the Alliance asked for help moving a bunch of reports from one collections to another on CGSpace
<ul>
<li>She is having problems with the &ldquo;move&rdquo; dialog taking minutes for each item</li>
<li>I exported the collection and sent her a copy with just the few fields she would need in order to mark the ones that need to move, then I can do the rest:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,collection,dc.title[en_US]&#39;</span> ~/Downloads/10568-106990.csv &gt; /tmp/2021-09-28-alliance-reports.csv
</span></span></code></pre></div><ul>
<li>She sent it back fairly quickly with a new column marked &ldquo;Move&rdquo; so I extracted those items that matched and set them to the new owning collection:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c Move -m <span style="color:#e6db74">&#39;Yes&#39;</span> ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed <span style="color:#e6db74">&#39;s_10568/106990_10568/111506_&#39;</span> &gt; /tmp/alliance-move.csv
</span></span></code></pre></div><ul>
<li>Maria from the Alliance emailed us to say that approving submissions was slow on CGSpace
<ul>
<li>I looked at the PostgreSQL activity and it seems low:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_stat_activity&#39; | wc -l
</span></span><span style="display:flex;"><span>59
</span></span></code></pre></div><ul>
<li>Locks look high though:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | sort | uniq -c | wc -l
</span></span><span style="display:flex;"><span>1154
</span></span></code></pre></div><ul>
<li>Indeed it seems something started causing locks to increase yesterday:</li>
</ul>
<p><img src="/cgspace-notes/2021/09/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
<ul>
<li>And query length increasing since yesterday:</li>
</ul>
<p><img src="/cgspace-notes/2021/09/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p>
<ul>
<li>The number of DSpace sessions is normal, hovering around 1,000&hellip;</li>
<li>Looking closer at the PostgreSQL activity log, I see the locks are all held by the <code>dspaceCli</code> user&hellip; which seem weird:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceCli&#39;;&#34; | wc -l
</span></span><span style="display:flex;"><span>1096
</span></span></code></pre></div><ul>
<li>Now I&rsquo;m wondering why there are no connections from <code>dspaceApi</code> or <code>dspaceWeb</code>. Could it be that our Tomcat JDBC pooling via JNDI isn&rsquo;t working?
<ul>
<li>I see the same thing on DSpace Test hmmmm</li>
<li>The configuration in <code>server.xml</code> is correct, but it could be that when I changed to using the updated JDBC driver from <code>pom.xml</code> instead of dropping it in the Tomcat lib directory that something broke&hellip;</li>
<li>I downloaded the latest JDBC jar and put it in Tomcat&rsquo;s lib directory on DSpace Test and after restarting Tomcat I can see connections from <code>dspaceWeb</code> and <code>dspaceApi</code> again</li>
<li>I will do the same on CGSpace and then revert the JDBC change in Ansible and DSpace <code>pom.xml</code></li>
</ul>
</li>
</ul>
<h2 id="2021-09-29">2021-09-29</h2>
<ul>
<li>Export a list of ILRI subjects from CGSpace to validate against AGROVOC for Peter and Abenet:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
</span></span><span style="display:flex;"><span>COPY 149
</span></span></code></pre></div><ul>
<li>Then validate and format the matches:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
</span></span><span style="display:flex;"><span>$ csvcut -c subject,<span style="color:#e6db74">&#39;match type&#39;</span> /tmp/2021-09-29-ilri-subjects.csv | sed -e <span style="color:#e6db74">&#39;s/match type/matched/&#39;</span> -e <span style="color:#e6db74">&#39;s/\(alt\|pref\)Label/yes/&#39;</span> &gt; /tmp/2021-09-29-ilri-subjects2.csv
</span></span></code></pre></div><ul>
<li>I talked to Salem about depositing from MEL to CGSpace
<ul>
<li>He mentioned that the one issue is that when you deposit to a workflow you don&rsquo;t get a Handle or any kind of identifier back!</li>
<li>We might have to come to some kind of agreement that they deposit items without going into the workflow but that we have some kind of edit role in MEL</li>
<li>He also said that they are looking into using the Research Organization Registry (RoR) in MEL, at least adding the <code>ror_id</code> and storing it</li>
<li>I need to propose this to Peter again and perhaps start aligning our affiliations closer (I could even do something like the country codes with a process that scans every day)</li>
</ul>
</li>
<li>Talk to Moayad about OpenRXV
<ul>
<li>We decided that we&rsquo;d keep harvesting all the Handles from the Altmetric prefix API, but then have a plugin to retrive DOI scores that we can run manually</li>
</ul>
</li>
</ul>
<h2 id="2021-09-30">2021-09-30</h2>
<ul>
<li>Look over 292 non-IWMI publications from Udana for inclusion into the Virtual library on water management collection on CGSpace
<ul>
<li>I did some minor cleanup to remove blank columns and run it through the csv-metadata-quality tool</li>
<li>I told him to add licenses and journal volume/issue and asked Abenet for input as well</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2024-09-09 09:20:09 +02:00
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
2024-08-28 10:35:05 +02:00
<li><a href="/cgspace-notes/2024-08/">August, 2024</a></li>
2024-07-02 10:12:03 +02:00
<li><a href="/cgspace-notes/2024-07/">July, 2024</a></li>
2024-06-03 16:31:03 +02:00
<li><a href="/cgspace-notes/2024-06/">June, 2024</a></li>
2024-05-01 16:10:05 +02:00
<li><a href="/cgspace-notes/2024-05/">May, 2024</a></li>
2023-07-04 07:03:36 +02:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>