cgspace-notes/docs/2018-06/index.html

572 lines
32 KiB
HTML
Raw Normal View History

2018-06-05 06:31:29 +02:00
<!DOCTYPE html>
<html lang="en" >
2018-06-05 06:31:29 +02:00
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
2020-12-06 15:53:29 +01:00
2018-06-05 06:31:29 +02:00
<meta property="og:title" content="June, 2018" />
<meta property="og:description" content="2018-06-04
Test the DSpace 5.8 module upgrades from Atmire (#378)
2020-01-27 15:20:44 +01:00
There seems to be a problem with the CUA and L&amp;R versions in pom.xml because they are using SNAPSHOT and it doesn&rsquo;t build
2018-06-05 06:31:29 +02:00
I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
2019-05-05 15:45:12 +02:00
I proofed and tested the ILRI author corrections that Peter sent back to me this week:
2018-06-05 06:31:29 +02:00
$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
2019-05-05 15:45:12 +02:00
Time to index ~70,000 items on CGSpace:
2018-06-06 08:05:37 +02:00
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
sys 2m7.289s
2018-06-05 06:31:29 +02:00
" />
<meta property="og:type" content="article" />
2019-02-02 13:12:57 +01:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-06/" />
2019-08-08 17:10:44 +02:00
<meta property="article:published_time" content="2018-06-04T19:49:54-07:00" />
2020-02-19 14:17:32 +01:00
<meta property="article:modified_time" content="2020-02-17T11:38:34+02:00" />
2018-09-30 07:23:48 +02:00
2020-12-06 15:53:29 +01:00
2018-06-05 06:31:29 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2018"/>
<meta name="twitter:description" content="2018-06-04
Test the DSpace 5.8 module upgrades from Atmire (#378)
2020-01-27 15:20:44 +01:00
There seems to be a problem with the CUA and L&amp;R versions in pom.xml because they are using SNAPSHOT and it doesn&rsquo;t build
2018-06-05 06:31:29 +02:00
I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
2019-05-05 15:45:12 +02:00
I proofed and tested the ILRI author corrections that Peter sent back to me this week:
2018-06-05 06:31:29 +02:00
$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
2019-05-05 15:45:12 +02:00
Time to index ~70,000 items on CGSpace:
2018-06-06 08:05:37 +02:00
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
sys 2m7.289s
2018-06-05 06:31:29 +02:00
"/>
2022-06-14 07:45:07 +02:00
<meta name="generator" content="Hugo 0.100.2" />
2018-06-05 06:31:29 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "June, 2018",
2020-04-02 09:55:42 +02:00
"url": "https://alanorth.github.io/cgspace-notes/2018-06/",
2018-06-28 12:37:35 +02:00
"wordCount": "2894",
2018-06-05 06:31:29 +02:00
"datePublished": "2018-06-04T19:49:54-07:00",
2020-02-19 14:17:32 +01:00
"dateModified": "2020-02-17T11:38:34+02:00",
2018-06-05 06:31:29 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-06/">
<title>June, 2018 | CGSpace Notes</title>
2018-06-05 06:31:29 +02:00
<!-- combined, minified CSS -->
2020-01-23 19:19:38 +01:00
2021-01-24 08:46:27 +01:00
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
2018-06-05 06:31:29 +02:00
2020-01-28 11:01:42 +01:00
<!-- minified Font Awesome for SVG icons -->
2021-09-28 09:32:32 +02:00
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
2020-01-28 11:01:42 +01:00
2019-04-14 15:59:47 +02:00
<!-- RSS 2.0 feed -->
2018-06-05 06:31:29 +02:00
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
2018-12-19 12:20:39 +01:00
2018-06-05 06:31:29 +02:00
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
2018-06-05 06:31:29 +02:00
</div>
</header>
2018-12-19 12:20:39 +01:00
2018-06-05 06:31:29 +02:00
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2018-06/">June, 2018</a></h2>
2020-11-16 09:54:00 +01:00
<p class="blog-post-meta">
<time datetime="2018-06-04T19:49:54-07:00">Mon Jun 04, 2018</time>
in
2020-01-28 11:01:42 +01:00
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
2018-06-05 06:31:29 +02:00
</p>
</header>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-04">2018-06-04</h2>
2018-06-05 06:31:29 +02:00
<ul>
<li>Test the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 module upgrades from Atmire</a> (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>)
<ul>
2020-01-27 15:20:44 +01:00
<li>There seems to be a problem with the CUA and L&amp;R versions in <code>pom.xml</code> because they are using SNAPSHOT and it doesn&rsquo;t build</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
2018-06-05 06:31:29 +02:00
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
2019-11-28 16:30:45 +01:00
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
2018-06-06 08:05:37 +02:00
real 74m42.646s
user 8m5.056s
sys 2m7.289s
2019-12-17 13:49:24 +01:00
</code></pre><h2 id="2018-06-06">2018-06-06</h2>
2018-06-06 19:39:48 +02:00
<ul>
<li>It turns out that I needed to add a server block for <code>atmire.com-snapshots</code> to my Maven settings, so now the Atmire code builds</li>
2020-01-27 15:20:44 +01:00
<li>Now Maven and Ant run properly, but I&rsquo;m getting SQL migration errors in <code>dspace.log</code> after starting Tomcat</li>
<li>I&rsquo;ve updated my ticket on Atmire&rsquo;s bug tracker: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560</a></li>
2018-06-06 19:39:48 +02:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-07">2018-06-07</h2>
2018-06-07 18:17:51 +02:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Proofing 200 IITA records on DSpace Test for Sisay: <a href="https://dspacetest.cgiar.org/handle/10568/95391">IITA_Junel_06 (10568/95391)</a>
2018-06-07 18:17:51 +02:00
<ul>
<li>Mispelled authorship type: CGAIR single center should be: CGIAR single centre</li>
2019-11-28 16:30:45 +01:00
<li>I see some encoding errors in author affiliations, for example:
<ul>
2018-06-07 18:17:51 +02:00
<li>Universidade de SÆo Paulo</li>
<li>Institut National des Recherches Agricoles du B nin</li>
<li>Centre de Coop ration Internationale en Recherche Agronomique pour le D veloppement</li>
<li>Institut des Recherches Agricoles du B nin</li>
2022-03-04 13:30:06 +01:00
<li>Institut des Savannes, C te d&rsquo; Ivoire</li>
2018-06-07 18:17:51 +02:00
<li>Institut f r Pflanzenpathologie und Pflanzenschutz der Universit t, Germany</li>
<li>Projet de Gestion des Ressources Naturelles, B nin</li>
<li>Universit t Hannover</li>
<li>Universit F lix Houphouet-Boigny</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
</ul>
</li>
2018-06-07 18:17:51 +02:00
<li>I uploaded fixes for all those now, but I will continue with the rest of the data later</li>
2019-11-28 16:30:45 +01:00
<li>Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>delete from schema_version where version = &#39;5.6.2015.12.03.2&#39;;
update schema_version set version = &#39;5.6.2015.12.03.2&#39; where version = &#39;5.5.2015.12.03.2&#39;;
update schema_version set version = &#39;5.8.2015.12.03.3&#39; where version = &#39;5.5.2015.12.03.3&#39;;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>And then I need to ignore the ignored ones:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database migrate ignored
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Now DSpace starts up properly!</li>
<li>Gabriela from CIP got back to me about the author names we were correcting on CGSpace</li>
<li>I did a quick sanity check on them and then did a test import with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I will apply them on CGSpace tomorrow I think&hellip;</li>
2018-06-07 18:17:51 +02:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-09">2018-06-09</h2>
2018-06-09 10:48:41 +02:00
<ul>
2020-01-27 15:20:44 +01:00
<li>It&rsquo;s pretty annoying, but the JVM monitoring for Munin was never set up when I migrated DSpace Test to its new server a few months ago</li>
2018-06-09 10:48:41 +02:00
<li>I ran the tomcat and munin-node tags in Ansible again and now the stuff is all wired up and recording stats properly</li>
<li>I applied the CIP author corrections on CGSpace and DSpace Test and re-ran the Discovery indexing</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-10">2018-06-10</h2>
2018-06-10 13:12:07 +02:00
<ul>
<li>I spent some time removing the Atmire Metadata Quality Module (MQM) from the proposed DSpace 5.8 changes</li>
2019-11-28 16:30:45 +01:00
<li>After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code> INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
2022-03-04 13:30:06 +01:00
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0&#39; defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name &#39;itemCollectionPlugin&#39; defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>I can fix this by commenting out the <code>ItemCollectionPlugin</code> line of <code>discovery.xml</code>, but from looking at the git log I&rsquo;m not actually sure if that is related to MQM or not</li>
2019-11-28 16:30:45 +01:00
<li>I will have to ask Atmire</li>
2020-01-27 15:20:44 +01:00
<li>I continued to look at Sisay&rsquo;s IITA records from last week
2018-06-10 18:32:12 +02:00
<ul>
<li>I normalized all DOIs to use HTTPS and &ldquo;doi.org&rdquo; instead of &ldquo;dx.doi.org&rdquo;</li>
<li>I cleaned up white space in <code>cg.subject.iita</code> and <code>dc.subject</code></li>
<li>Even a bunch of IITA and AGROVOC subjects are missing accents, ie &ldquo;FERTILIT DU SOL&rdquo;</li>
<li>More organization names in <code>dc.description.sponsorship</code> are incorrect (ie, missing accents) or inconsistent (ie, CGIAR centers should be spelled in English or multiple spellings of the same one, like &ldquo;Rockefeller Foundation&rdquo; and &ldquo;Rockefeller foundation&rdquo;)</li>
2019-11-28 16:30:45 +01:00
<li>A few dozen items have abstracts with character encoding errors, ie:
<ul>
2018-06-10 18:32:12 +02:00
<li>33.7øC</li>
<li>MgSO4ú7H2O</li>
<li>ha??1&amp;/sup;</li>
<li>En gen6ral</li>
<li>dÕpassÕ</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
2018-06-10 18:32:12 +02:00
<li>Also the abstracts have missing accents, ie &ldquo;recherche sur le d veloppement&rdquo;</li>
2018-06-10 13:12:07 +02:00
</ul>
2019-11-28 16:30:45 +01:00
</li>
<li>I will have to tell IITA people to redo these entirely I think&hellip;</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-11">2018-06-11</h2>
2018-06-11 14:21:14 +02:00
<ul>
<li>Sisay sent a new version of the last IITA records that he created from the original CSV from IITA</li>
2019-11-28 16:30:45 +01:00
<li>The 200 records are in the <a href="https://dspacetest.cgiar.org/handle/10568/95870">IITA_Junel_11 (10568/95870)</a> collection</li>
2018-06-11 14:21:14 +02:00
<li>Many errors:
<ul>
<li>Authorship types: &ldquo;CGIAR ans advanced research institute&rdquo;, &ldquo;CGAIR and advanced research institute&rdquo;, &ldquo;CGIAR and advanced research institutes&rdquo;, &ldquo;CGAIR single center&rdquo;</li>
2019-11-28 16:30:45 +01:00
<li>Lots of inconsistencies and mispellings in author affiliations:
<ul>
2018-06-11 14:21:14 +02:00
<li>&ldquo;Institut des Recherches Agricoles du Bénin&rdquo; and &ldquo;Institut National des Recherche Agricoles du Benin&rdquo; and &ldquo;National Agricultural Research Institute, Benin&rdquo;</li>
<li>International Insitute of Tropical Agriculture</li>
<li>Centro Internacional de Agricultura Tropical</li>
<li>&ldquo;Rivers State University of Science and Technology&rdquo; and &ldquo;Rivers State University&rdquo;</li>
<li>&ldquo;Institut de la Recherche Agronomique, Cameroon&rdquo; and &ldquo;Institut de Recherche Agronomique, Cameroon&rdquo;</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
2020-01-27 15:20:44 +01:00
<li>Inconsistency in countries: &ldquo;COTE DIVOIRE&rdquo; and &ldquo;COTE D&rsquo;IVOIRE&rdquo;</li>
2018-06-11 14:21:14 +02:00
<li>A few DOIs with spaces or invalid characters</li>
<li>Inconsistency in IITA subjects, for example &ldquo;PRODUCTION VEGETALE&rdquo; and &ldquo;PRODUCTION VÉGÉTALE&rdquo; and several others</li>
<li>I ran <code>value.unescape('javascript')</code> on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
2020-01-27 15:20:44 +01:00
<li>It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede&rsquo;s original file it doesn&rsquo;t have all those corrections</li>
<li>So I told Sisay to re-create the collection using Abenet&rsquo;s XLS from last week (<code>Mercy1805_AY.xls</code>)</li>
2018-06-11 14:21:14 +02:00
<li>I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces</li>
<li>I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: <code>isNotNull(value.match(/.*?\s{2,}.*?/))</code></li>
<li>I wonder if I should start checking for &ldquo;smart&rdquo; quotes like (hex 2019)</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-12">2018-06-12</h2>
2018-06-12 09:42:43 +02:00
<ul>
2020-02-19 14:17:32 +01:00
<li>Udana from IWMI asked about the OAI base URL for their community on CGSpace
<ul>
2018-06-12 09:42:43 +02:00
<li>I think it should be this: <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=com_10568_16814">https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=com_10568_16814</a></li>
<li>The style sheet obfuscates the data, but if you look at the source it is all there, including information about pagination of results</li>
2020-02-19 14:17:32 +01:00
</ul>
</li>
2020-01-27 15:20:44 +01:00
<li>Regarding Udana&rsquo;s Book Chapters and Reports on DSpace Test last week, Abenet told him to fix some character encoding and CRP issues, then I told him I&rsquo;d check them after that</li>
<li>The latest batch of IITA&rsquo;s 200 records (based on Abenet&rsquo;s version <code>Mercy1805_AY.xls</code>) are now in the <a href="https://dspacetest.cgiar.org/handle/10568/96071">IITA_Jan_9_II_Ab</a> collection</li>
2019-11-28 16:30:45 +01:00
<li>So here are some corrections:
2018-06-12 09:42:43 +02:00
<ul>
<li>use of Unicode smart quote (hex 2019) in countries and affiliations, for example &ldquo;COTE DIVOIRE&rdquo; and &ldquo;Institut dEconomic Rurale, Mali&rdquo;</li>
2019-11-28 16:30:45 +01:00
<li>inconsistencies in <code>cg.contributor.affiliation</code>:
<ul>
2018-06-12 09:42:43 +02:00
<li>&ldquo;Centro Internacional de Agricultura Tropical&rdquo; and &ldquo;Centro International de Agricultura Tropical&rdquo; should use the English name of CIAT (International Center for Tropical Agriculture)</li>
2020-01-27 15:20:44 +01:00
<li>&ldquo;Institut International d&rsquo;Agriculture Tropicale&rdquo; should use the English name of IITA (International Institute of Tropical Agriculture)</li>
2018-06-12 09:42:43 +02:00
<li>&ldquo;East and Southern Africa Regional Center&rdquo; and &ldquo;Eastern and Southern Africa Regional Centre&rdquo;</li>
<li>&ldquo;Institut de la Recherche Agronomique, Cameroon&rdquo; and &ldquo;Institut de Recherche Agronomique, Cameroon&rdquo;</li>
<li>&ldquo;Institut des Recherches Agricoles du Bénin&rdquo; and &ldquo;Institut National des Recherche Agricoles du Benin&rdquo; and &ldquo;National Agricultural Research Institute, Benin&rdquo;</li>
<li>&ldquo;Institute of Agronomic Research, Cameroon&rdquo; and &ldquo;Institute of Agronomy Research, Cameroon&rdquo;</li>
<li>&ldquo;Rivers State University&rdquo; and &ldquo;Rivers State University of Science and Technology&rdquo;</li>
<li>&ldquo;Universität Hannover&rdquo; and &ldquo;University of Hannover&rdquo;</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
<li>inconsistencies in <code>cg.subject.iita</code>:
<ul>
2018-06-12 09:42:43 +02:00
<li>&ldquo;AMELIORATION DES PLANTES&rdquo; and &ldquo;AMÉLIORATION DES PLANTES&rdquo;</li>
<li>&ldquo;PRODUCTION VEGETALE&rdquo; and &ldquo;PRODUCTION VÉGÉTALE&rdquo;</li>
<li>&ldquo;CONTRÔLE DE MALADIES&rdquo; and &ldquo;CONTROLE DES MALADIES&rdquo;</li>
<li>&ldquo;HANDLING, TRANSPORT, STORAGE AND PROTECTION OF AGRICULTURAL PRODUCT&rdquo; and &ldquo;HANDLING, TRANSPORT, STORAGE AND PROTECTION OF AGRICULTURAL PRODUCTS&rdquo;</li>
<li>&ldquo;RAVAGEURS DE PLANTES&rdquo; and &ldquo;RAVAGEURS DES PLANTES&rdquo;</li>
<li>&ldquo;SANTE DES PLANTES&rdquo; and &ldquo;SANTÉ DES PLANTES&rdquo;</li>
<li>&ldquo;SOCIOECONOMIE&rdquo; and &ldquo;SOCIOECONOMY&rdquo;</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
<li>inconsistencies in <code>dc.description.sponsorship</code>:
<ul>
2018-06-12 09:42:43 +02:00
<li>&ldquo;Belgian Corporation&rdquo; and &ldquo;Belgium Corporation&rdquo;</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
<li>inconsistencies in <code>dc.subject</code>:
<ul>
2018-06-12 09:42:43 +02:00
<li>&ldquo;AFRICAN CASSAVA MOSAIC&rdquo; and &ldquo;AFRICAN CASSAVA MOSAIC DISEASE&rdquo;</li>
<li>&ldquo;ASPERGILLU FLAVUS&rdquo; and &ldquo;ASPERGILLUS FLAVUS&rdquo;</li>
<li>&ldquo;BIOTECHNOLOGIES&rdquo; and &ldquo;BIOTECHNOLOGY&rdquo;</li>
<li>&ldquo;CASSAVA MOSAIC DISEASE&rdquo; and &ldquo;CASSAVA MOSAIC DISEASES&rdquo; and &ldquo;CASSAVA MOSAIC VIRUS&rdquo;</li>
<li>&ldquo;CASSAVA PROCESSING&rdquo; and &ldquo;CASSAVA PROCESSING TECHNOLOGY&rdquo;</li>
<li>&ldquo;CROPPING SYSTEM&rdquo; and &ldquo;CROPPING SYSTEMS&rdquo;</li>
<li>&ldquo;DRY SEASON&rdquo; and &ldquo;DRY-SEASON&rdquo;</li>
<li>&ldquo;FERTILIZER&rdquo; and &ldquo;FERTILIZERS&rdquo;</li>
<li>&ldquo;LEGUME&rdquo; and &ldquo;LEGUMES&rdquo;</li>
<li>&ldquo;LEGUMINOSAE&rdquo; and &ldquo;LEGUMINOUS&rdquo;</li>
<li>&ldquo;LEGUMINOUS COVER CROP&rdquo; and &ldquo;LEGUMINOUS COVER CROPS&rdquo;</li>
<li>&ldquo;MATÉRIEL DE PLANTATION&rdquo; and &ldquo;MATÉRIELS DE PLANTATION&rdquo;</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
2020-01-27 15:20:44 +01:00
<li>I noticed that some records do have encoding errors in the <code>dc.description.abstract</code> field, but only four of them so probably not from Abenet&rsquo;s handling of the XLS file</li>
2019-11-28 16:30:45 +01:00
<li>Based on manually eyeballing the text I used a custom text facet with this GREL to identify the records:</li>
</ul>
</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>or(
2022-03-04 13:30:06 +01:00
value.contains(&#39;&#39;),
value.contains(&#39;6g&#39;),
value.contains(&#39;6m&#39;),
value.contains(&#39;6d&#39;),
value.contains(&#39;6e&#39;)
2018-06-12 09:42:43 +02:00
)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>So IITA should double check the abstracts for these:
<ul>
<li><a href="https://dspacetest.cgiar.org/10568/96184">https://dspacetest.cgiar.org/10568/96184</a></li>
<li><a href="https://dspacetest.cgiar.org/10568/96141">https://dspacetest.cgiar.org/10568/96141</a></li>
<li><a href="https://dspacetest.cgiar.org/10568/96118">https://dspacetest.cgiar.org/10568/96118</a></li>
<li><a href="https://dspacetest.cgiar.org/10568/96113">https://dspacetest.cgiar.org/10568/96113</a></li>
2018-06-12 09:42:43 +02:00
</ul>
2019-11-28 16:30:45 +01:00
</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-13">2018-06-13</h2>
2018-06-13 08:22:58 +02:00
<ul>
2020-01-27 15:20:44 +01:00
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara&rsquo;s items</li>
2019-11-28 16:30:45 +01:00
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p &#39;fuuu&#39;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>The contents of <code>2018-06-13-Robin-Buruchara.csv</code> were:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
2022-03-04 13:30:06 +01:00
&#34;Buruchara, Robin&#34;,Robin Buruchara: 0000-0003-0934-1218
&#34;Buruchara, Robin A.&#34;,Robin Buruchara: 0000-0003-0934-1218
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>On a hunch I checked to see if CGSpace&rsquo;s bitstream cleanup was working properly and of course it&rsquo;s broken:</li>
2019-11-28 16:30:45 +01:00
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>$ dspace cleanup -v
2018-06-13 08:22:58 +02:00
...
2022-03-04 13:30:06 +01:00
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(152402) is still referenced from table &#34;bundle&#34;.
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>As always, the solution is to delete that ID manually in PostgreSQL:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);&#39;
2018-06-13 08:22:58 +02:00
UPDATE 1
2019-12-17 13:49:24 +01:00
</code></pre><h2 id="2018-06-14">2018-06-14</h2>
2018-06-14 13:46:09 +02:00
<ul>
2020-01-27 15:20:44 +01:00
<li>Check through Udana&rsquo;s IWMI records from last week on DSpace Test</li>
2018-06-14 13:46:09 +02:00
<li>There were only some minor whitespace and one or two syntax errors, but they look very good otherwise</li>
<li>I uploaded the twenty-four reports to the IWMI Reports collection: <a href="https://cgspace.cgiar.org/handle/10568/36188">https://cgspace.cgiar.org/handle/10568/36188</a></li>
<li>I uploaded the seventy-six book chapters to the IWMI Book Chapters collection: <a href="https://cgspace.cgiar.org/handle/10568/36178">https://cgspace.cgiar.org/handle/10568/36178</a></li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-24">2018-06-24</h2>
2018-06-24 08:41:33 +02:00
<ul>
2019-11-28 16:30:45 +01:00
<li>I was restoring a PostgreSQL dump on my test machine and found a way to restore the CGSpace dump as the <code>postgres</code> user, but have the owner of the schema be the <code>dspacetest</code> user:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>$ dropdb -h localhost -U postgres dspacetest
2018-06-24 08:41:33 +02:00
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
2022-03-04 13:30:06 +01:00
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
2018-06-24 16:38:07 +02:00
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
2022-03-04 13:30:06 +01:00
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>The <code>-O</code> option to <code>pg_restore</code> makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore</li>
2020-01-27 15:20:44 +01:00
<li>I always prefer to use the <code>postgres</code> user locally because it&rsquo;s just easier than remembering the <code>dspacetest</code> user&rsquo;s password, but then I couldn&rsquo;t figure out why the resulting schema was owned by <code>postgres</code></li>
2019-11-28 16:30:45 +01:00
<li>So with this you connect as the <code>postgres</code> superuser and then switch roles to <code>dspacetest</code> (also, make sure this user has <code>superuser</code> privileges before the restore)</li>
<li>Last week Linode emailed me to say that our Linode 8192 instance used for DSpace Test qualified for an upgrade</li>
<li>Apparently they announced some <a href="https://blog.linode.com/2018/05/17/updated-linode-plans-new-larger-linodes/">upgrades to most of their plans in 2018-05</a></li>
2020-01-27 15:20:44 +01:00
<li>After the upgrade I see we have more disk space available in the instance&rsquo;s dashboard, so I shut the instance down and resized it from 98GB to 160GB</li>
2019-11-28 16:30:45 +01:00
<li>The resize was very quick (less than one minute) and after booting the instance back up I now have 160GB for the root filesystem!</li>
2020-01-27 15:20:44 +01:00
<li>I will move the DSpace installation directory back to the root file system and delete the extra 300GB block storage, as it was actually kinda slow when we put Solr there and now we don&rsquo;t actually need it anymore because running the production Solr on this instance didn&rsquo;t work well with 8GB of RAM</li>
<li>Also, the larger instance we&rsquo;re using for CGSpace will go from 24GB of RAM to 32, and will also get a storage increase from 320GB to 640GB&hellip; that means we don&rsquo;t need to consider using block storage right now!</li>
<li>The smaller instances get increased storage and network speed but I doubt many are actually using much of their current allocations so we probably don&rsquo;t need to bother with upgrading them</li>
2019-11-28 16:30:45 +01:00
<li>Last week Abenet asked if we could add <code>dc.language.iso</code> to the advanced search filters</li>
2020-01-27 15:20:44 +01:00
<li>There is already a search filter for this field defined in <code>discovery.xml</code> but we aren&rsquo;t using it, so I quickly enabled and tested it, then merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/380">#380</a>)</li>
2019-11-28 16:30:45 +01:00
<li>Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>It took me a while to figure out that this migration is for MQM, which I removed after Atmire&rsquo;s original advice about the migrations so we actually need to delete this migration instead up updating it</li>
2019-11-28 16:30:45 +01:00
<li>So I need to make sure to run the following during the DSpace 5.8 upgrade:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>-- Delete existing CUA 4 migration if it exists
2022-03-04 13:30:06 +01:00
delete from schema_version where version = &#39;5.6.2015.12.03.2&#39;;
2018-06-24 16:38:07 +02:00
-- Update version of CUA 4 migration
2022-03-04 13:30:06 +01:00
update schema_version set version = &#39;5.6.2015.12.03.2&#39; where version = &#39;5.5.2015.12.03.2&#39;;
2018-06-24 16:38:07 +02:00
2022-03-04 13:30:06 +01:00
-- Delete MQM migration since we&#39;re no longer using it
delete from schema_version where version = &#39;5.5.2015.12.03.3&#39;;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>After that you can run the migrations manually and then DSpace should work fine:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database migrate ignored
2018-06-24 16:38:07 +02:00
...
Done.
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2022-03-04 13:30:06 +01:00
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis&rsquo; items on CGSpace</li>
2019-11-28 16:30:45 +01:00
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p &#39;fuuu&#39;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>The contents of <code>2018-06-24-andy-jarvis-orcid.csv</code> were:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
2022-03-04 13:30:06 +01:00
&#34;Jarvis, A.&#34;,Andy Jarvis: 0000-0001-6543-0798
&#34;Jarvis, Andy&#34;,Andy Jarvis: 0000-0001-6543-0798
&#34;Jarvis, Andrew&#34;,Andy Jarvis: 0000-0001-6543-0798
2019-12-17 13:49:24 +01:00
</code></pre><h2 id="2018-06-26">2018-06-26</h2>
2018-06-26 16:17:55 +02:00
<ul>
2020-01-27 15:20:44 +01:00
<li>Atmire got back to me to say that we can remove the <code>itemCollectionPlugin</code> and <code>HasBitstreamsSSIPlugin</code> beans from DSpace&rsquo;s <code>discovery.xml</code> file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore</li>
2018-06-26 16:17:55 +02:00
<li>I removed both those beans and did some simple tests to check item submission, media-filter of PDFs, REST API, but got an error &ldquo;No matches for the query&rdquo; when listing records in OAI</li>
2019-11-28 16:30:45 +01:00
<li>This warning appears in the DSpace log:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>2018-06-26 16:58:12,052 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>It&rsquo;s actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting</li>
2019-11-28 16:30:45 +01:00
<li>Ah, I think I just need to run <code>dspace oai import</code></li>
2018-06-26 16:17:55 +02:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-27">2018-06-27</h2>
2018-06-27 16:16:24 +02:00
<ul>
<li>Vika from CIFOR sent back his annotations on the duplicates for the &ldquo;CIFOR_May_9&rdquo; archive import that I sent him last week</li>
2020-01-27 15:20:44 +01:00
<li>I&rsquo;ll have to figure out how to separate those we&rsquo;re keeping, deleting, and mapping into CIFOR&rsquo;s archive collection</li>
<li>First, get the 62 deletes from Vika&rsquo;s file and remove them from the collection:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E &#39;[0-9]{5}\/[0-9]{5}&#39; &gt; cifor-handle-to-delete.txt
2018-06-27 16:16:24 +02:00
$ wc -l cifor-handle-to-delete.txt
62 cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
2461 10568-92904.csv
2022-03-04 13:30:06 +01:00
$ while read line; do sed -i &#34;\#$line#d&#34; 10568-92904.csv; done &lt; cifor-handle-to-delete.txt
2018-06-27 16:16:24 +02:00
$ wc -l 10568-92904.csv
2399 10568-92904.csv
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>This iterates over the handles for deletion and uses <code>sed</code> with an alternative pattern delimiter of &lsquo;#&rsquo; (which must be escaped), because the pattern itself contains a &lsquo;/&rsquo;</li>
<li>The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E &#39;[0-9]{5}\/[0-9]{5}&#39; &gt; cifor-handle-to-map.txt
2018-06-27 16:16:24 +02:00
$ wc -l cifor-handle-to-map.txt
50 cifor-handle-to-map.txt
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I can either get them from the databse, or programatically export the metadata using <code>dspace metadata-export -i 10568/xxxxx</code>&hellip;</li>
<li>Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the <code>id</code> and <code>collection</code> columns using <a href="https://csvkit.readthedocs.io/">csvkit</a>:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done &lt; /tmp/cifor-handle-to-map.txt
2022-03-04 13:30:06 +01:00
$ sed &#39;/^id/d&#39; 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Then I can use Open Refine to add the &ldquo;CIFOR Archive&rdquo; collection to the mappings</li>
<li>Importing the 2398 items via <code>dspace metadata-import</code> ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000</li>
<li>After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch</li>
2020-01-27 15:20:44 +01:00
<li>I&rsquo;ll let Abenet take one last look and then move them to CGSpace</li>
2018-06-27 16:16:24 +02:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2018-06-28">2018-06-28</h2>
2018-06-28 12:37:35 +02:00
<ul>
<li>DSpace Test appears to have crashed last night</li>
2019-11-28 16:30:45 +01:00
<li>There is nothing in the Tomcat or DSpace logs, but I see the following in <code>dmesg -T</code>:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>[Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
2018-06-28 12:37:35 +02:00
[Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB
[Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>Look over IITA&rsquo;s <a href="https://dspacetest.cgiar.org/handle/10568/96071">IITA_Jan_9_II_Ab</a> collection from earlier this month on DSpace Test</li>
2019-11-28 16:30:45 +01:00
<li>Bosede fixed a few things (and seems to have removed many French IITA subjects like <code>AMÉLIORATION DES PLANTES</code> and <code>SANTÉ DES PLANTES</code>)</li>
2020-01-27 15:20:44 +01:00
<li>I still see at least one issue with author affiliations, and I didn&rsquo;t bother to check the AGROVOC subjects because it&rsquo;s such a mess aanyways</li>
2019-11-28 16:30:45 +01:00
<li>I suggested that IITA provide an updated list of subject to us so we can include their controlled vocabulary in CGSpace, which would also make it easier to do automated validation</li>
2018-06-28 12:37:35 +02:00
</ul>
2019-11-28 16:30:45 +01:00
<!-- raw HTML omitted -->
2018-06-12 09:42:43 +02:00
2018-06-05 06:31:29 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2022-06-06 08:45:43 +02:00
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
2022-05-04 10:09:45 +02:00
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
2022-04-27 08:58:45 +02:00
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
2022-03-01 15:48:40 +01:00
2022-04-27 08:58:45 +02:00
<li><a href="/cgspace-notes/2022-03/">March, 2022</a></li>
2022-04-04 18:15:58 +02:00
2022-02-10 18:35:40 +01:00
<li><a href="/cgspace-notes/2022-02/">February, 2022</a></li>
2018-06-05 06:31:29 +02:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
2018-06-05 06:31:29 +02:00
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>