cgspace-notes/docs/2018-03/index.html

632 lines
34 KiB
HTML
Raw Normal View History

2018-03-02 15:09:18 +01:00
<!DOCTYPE html>
<html lang="en" >
2018-03-02 15:09:18 +01:00
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="March, 2018" />
2018-03-06 09:17:14 +01:00
<meta property="og:description" content="2018-03-02
2018-03-02 15:09:18 +01:00
Export a CSV of the IITA community metadata for Martin Mueller
" />
<meta property="og:type" content="article" />
2019-02-02 13:12:57 +01:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-03/" />
2019-08-08 17:10:44 +02:00
<meta property="article:published_time" content="2018-03-02T16:07:54+02:00" />
2019-10-28 12:43:25 +01:00
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
2018-09-30 07:23:48 +02:00
2018-03-02 15:09:18 +01:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2018"/>
2018-03-06 09:17:14 +01:00
<meta name="twitter:description" content="2018-03-02
2018-03-02 15:09:18 +01:00
Export a CSV of the IITA community metadata for Martin Mueller
"/>
2019-12-08 15:03:19 +01:00
<meta name="generator" content="Hugo 0.60.1" />
2018-03-02 15:09:18 +01:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "March, 2018",
2019-04-13 11:15:55 +02:00
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2018-03\/",
2018-04-30 18:05:39 +02:00
"wordCount": "2960",
"datePublished": "2018-03-02T16:07:54+02:00",
2019-10-28 12:43:25 +01:00
"dateModified": "2019-10-28T13:39:25+02:00",
2018-03-02 15:09:18 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-03/">
<title>March, 2018 | CGSpace Notes</title>
2018-03-02 15:09:18 +01:00
<!-- combined, minified CSS -->
2019-02-13 17:47:17 +01:00
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
2018-03-02 15:09:18 +01:00
2019-04-14 15:59:47 +02:00
<!-- RSS 2.0 feed -->
2018-03-02 15:09:18 +01:00
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
2018-12-19 12:20:39 +01:00
2018-03-02 15:09:18 +01:00
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
2018-03-02 15:09:18 +01:00
</div>
</header>
2018-12-19 12:20:39 +01:00
2018-03-02 15:09:18 +01:00
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2018-03/">March, 2018</a></h2>
2018-03-02 15:09:18 +01:00
<p class="blog-post-meta"><time datetime="2018-03-02T16:07:54&#43;02:00">Fri Mar 02, 2018</time> by Alan Orth in
2019-10-28 12:43:25 +01:00
<i class="fa fa-folder" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
2018-03-02 15:09:18 +01:00
</p>
</header>
2019-11-28 16:30:45 +01:00
<h2 id="20180302">2018-03-02</h2>
2018-03-02 15:09:18 +01:00
<ul>
<li>Export a CSV of the IITA community metadata for Martin Mueller</li>
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180306">2018-03-06</h2>
2018-03-06 09:17:14 +01:00
<ul>
<li>Add three new CCAFS project tags to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/357">#357</a>)</li>
<li>Andrea from Macaroni Bros had sent me an email that CCAFS needs them</li>
2018-03-06 11:25:26 +01:00
<li>Give Udana more feedback on his WLE records from last month</li>
<li>There were some records using a non-breaking space in their AGROVOC subject field</li>
2019-11-28 16:30:45 +01:00
<li>I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace</li>
</ul>
2018-03-06 11:25:26 +01:00
<pre><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character</li>
<li>Add new CRP subject &ldquo;GRAIN LEGUMES AND DRYLAND CEREALS&rdquo; to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/358">#358</a>)</li>
<li>Merge the ORCID integration stuff in to <code>5_x-prod</code> for deployment on CGSpace soon (<a href="https://github.com/ilri/DSpace/pull/359">#359</a>)</li>
<li>Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server</li>
<li>Run all system updates on DSpace Test and reboot server</li>
<li>I ran the <a href="https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c">orcid-authority-to-item.py</a> script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata</li>
</ul>
2018-03-06 20:48:10 +01:00
<pre><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I ran the DSpace cleanup script on CGSpace and it threw an error (as always):</li>
</ul>
2018-03-06 20:48:10 +01:00
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
2019-11-28 16:30:45 +01:00
Detail: Key (bitstream_id)=(150659) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
2018-03-06 23:24:03 +01:00
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
2018-03-06 20:48:10 +01:00
UPDATE 1
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Apply the proposed PostgreSQL indexes from DS-3636 (pull request <a href="https://github.com/DSpace/DSpace/pull/1791/">#1791</a> on CGSpace (linode18)</li>
2018-03-06 09:17:14 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180307">2018-03-07</h2>
2018-03-07 11:29:24 +01:00
<ul>
<li>Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (<a href="https://github.com/ilri/DSpace/pull/360">#360</a>)</li>
<li>Help Sisay proof 200 IITA records on DSpace Test</li>
2019-11-28 16:30:45 +01:00
<li>Finally import Udana's 24 items to <a href="https://cgspace.cgiar.org/handle/10568/36185">IWMI Journal Articles</a> on CGSpace</li>
2018-03-08 14:05:29 +01:00
<li>Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc</li>
2018-03-07 11:29:24 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180308">2018-03-08</h2>
2018-03-08 16:32:38 +01:00
<ul>
<li>Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata</li>
<li>This makes the CSV have tons of columns, for example <code>dc.title</code>, <code>dc.title[]</code>, <code>dc.title[en]</code>, <code>dc.title[eng]</code>, <code>dc.title[en_US]</code> and so on!</li>
2019-11-28 16:30:45 +01:00
<li>I think I can fixor at least normalizethem in the database:</li>
</ul>
2018-03-08 16:32:38 +01:00
<pre><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
2019-11-28 16:30:45 +01:00
text_lang
2018-03-08 16:32:38 +01:00
-----------
2019-11-28 16:30:45 +01:00
ethnob
en
spa
EN
En
en_
en_US
E.
2018-03-08 16:32:38 +01:00
2019-11-28 16:30:45 +01:00
EN_US
en_U
eng
fr
es_ES
es
2018-03-08 16:32:38 +01:00
(16 rows)
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
UPDATE 122227
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
2019-11-28 16:30:45 +01:00
text_lang
2018-03-08 16:32:38 +01:00
-----------
2019-11-28 16:30:45 +01:00
ethnob
en_US
spa
E.
2018-03-08 16:32:38 +01:00
2019-11-28 16:30:45 +01:00
fr
es_ES
es
2018-03-08 16:32:38 +01:00
(9 rows)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>On second inspection it looks like <code>dc.description.provenance</code> fields use the text_lang &ldquo;en&rdquo; so that's probably why there are over 100,000 fields changed&hellip;</li>
<li>If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:</li>
</ul>
2018-03-08 21:47:12 +01:00
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
UPDATE 2309
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I will apply this on CGSpace right now</li>
<li>In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine</li>
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
</ul>
2018-03-08 16:32:38 +01:00
<pre><code>or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
</ul>
2018-03-08 16:32:38 +01:00
<pre><code>if(isBlank(value), &quot;Hernan Ceballos: 0000-0002-8744-7918&quot;, value + &quot;||Hernan Ceballos: 0000-0002-8744-7918&quot;)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>One thing that bothers me is that this won't honor author order</li>
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
<li>I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching <code>cg.creator.id</code> fields: <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py </a></li>
<li>The CSV should have two columns: author name and ORCID identifier:</li>
</ul>
2018-03-08 20:10:16 +01:00
<pre><code>dc.contributor.author,cg.creator.id
&quot;Orth, Alan&quot;,Alan S. Orth: 0000-0002-1735-7458
&quot;Orth, A.&quot;,Alan S. Orth: 0000-0002-1735-7458
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I didn't integrate the ORCID API lookup for author names in this script for now because I was only interested in &ldquo;tagging&rdquo; old items for a few given authors</li>
<li>I added ORCID identifers for 187 items by CIAT's Hernan Ceballos, because that is what Elizabeth was trying to do manually!</li>
<li>Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well</li>
2018-03-08 16:32:38 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180309">2018-03-09</h2>
2018-03-09 21:16:20 +01:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Give James Stapleton input on Sisay's KRAs</li>
2018-03-09 21:16:20 +01:00
<li>Create a pull request to disable ORCID authority integration for <code>dc.contributor.author</code> in the submission forms and XMLUI display (<a href="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180311">2018-03-11</h2>
2018-03-11 12:56:20 +01:00
<ul>
<li>Peter also wrote to say he is having issues with the Atmire Listings and Reports module</li>
2019-11-28 16:30:45 +01:00
<li>When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:</li>
</ul>
2018-03-11 12:56:20 +01:00
<pre><code>2018-03-11 11:38:15,592 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
g/jspui/listings-and-reports
-- Method: POST
-- Parameters were:
-- selected_admin_preset: &quot;ilri authors2&quot;
-- load: &quot;normal&quot;
-- next: &quot;NEXT STEP &gt;&gt;&quot;
-- step: &quot;1&quot;
org.apache.jasper.JasperException: java.lang.NullPointerException
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn't find them</li>
<li>I made a quick fix and it's working now (<a href="https://github.com/ilri/DSpace/pull/364">#364</a>)</li>
2018-03-11 12:56:20 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180312">2018-03-12</h2>
2018-03-12 20:53:42 +01:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Increase upload size on CGSpace's nginx config to 85MB so Sisay can upload some data</li>
2018-03-12 20:53:42 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180313">2018-03-13</h2>
2018-03-13 18:21:40 +01:00
<ul>
<li>I created a new Linode server for DSpace Test (linode6623840) so I could try the block storage stuff, but when I went to add a 300GB volume it said that block storage capacity was exceeded in that datacenter (Newark, NJ)</li>
<li>I deleted the Linode and created another one (linode6624164) in the Fremont, CA region</li>
<li>After that I deployed the Ubuntu 16.04 image and attached a 300GB block storage volume to the image</li>
<li>Magdalena wrote to ask why there was no Altmetric donut for an item on CGSpace, but there was one on the related CCAFS publication page</li>
<li>It looks the the CCAFS publications page fetches the donut using its DOI, whereas CGSpace queries via Handle</li>
<li>I will write to Altmetric support and ask them, as perhaps its part of a larger issue</li>
<li>CGSpace item: <a href="https://cgspace.cgiar.org/handle/10568/89643">https://cgspace.cgiar.org/handle/10568/89643</a></li>
<li>CCAFS publication page: <a href="https://ccafs.cgiar.org/publications/can-scenario-planning-catalyse-transformational-change-evaluating-climate-change-policy">https://ccafs.cgiar.org/publications/can-scenario-planning-catalyse-transformational-change-evaluating-climate-change-policy</a></li>
2018-03-14 21:33:09 +01:00
<li>Peter tweeted the Handle link and now Altmetric shows the donut for both the DOI and the Handle</li>
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180314">2018-03-14</h2>
2018-03-14 21:33:09 +01:00
<ul>
<li>Help Abenet with a troublesome Listings and Report question for CIAT author Steve Beebe</li>
<li>Continue migrating DSpace Test to the new server (linode6624164)</li>
<li>I emailed ILRI service desk to update the DNS records for dspacetest.cgiar.org</li>
2018-03-14 21:45:37 +01:00
<li>Abenet was having problems saving Listings and Reports configurations or layouts but I tested it and it works</li>
2018-03-13 18:21:40 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180315">2018-03-15</h2>
2018-03-15 20:41:38 +01:00
<ul>
<li>Help Abenet troubleshoot the Listings and Reports issue again</li>
2019-11-28 16:30:45 +01:00
<li>It looks like it's an issue with the layouts, if you create a new layout that only has one type (<code>dc.identifier.citation</code>):</li>
2018-03-15 20:41:38 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<p><img src="/cgspace-notes/2018/03/layout-only-citation.png" alt="Listing and Reports layout"></p>
2018-03-15 20:41:38 +01:00
<ul>
2019-11-28 16:30:45 +01:00
<li>The error in the DSpace log is:</li>
</ul>
2018-03-15 20:41:38 +01:00
<pre><code>org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>The full error is here: <a href="https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca">https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca</a></li>
<li>If I do a report for &ldquo;Orth, Alan&rdquo; with the same custom layout it works!</li>
<li>I submitted a ticket to Atmire: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589</a></li>
<li>Small fix to the example citation text in Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/365">#365</a>)</li>
2018-03-15 20:41:38 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180316">2018-03-16</h2>
2018-03-16 15:48:39 +01:00
<ul>
<li>ICT made the DNS updates for dspacetest.cgiar.org late last night</li>
<li>I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164</li>
2019-11-28 16:30:45 +01:00
<li>Looking at the CRP subjects on CGSpace I see there is one blank one so I'll just fix it:</li>
</ul>
2018-03-16 15:48:39 +01:00
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Copy all CRP subjects to a CSV to do the mass updates:</li>
</ul>
2018-03-16 15:48:39 +01:00
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
COPY 21
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Once I prepare the new input forms (<a href="https://github.com/ilri/DSpace/issues/362">#362</a>) I will need to do the batch corrections:</li>
</ul>
2018-03-16 17:44:32 +01:00
<pre><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Create a pull request to update the input forms for the new CRP subject style (<a href="https://github.com/ilri/DSpace/pull/366">#366</a>)</li>
2018-03-16 17:58:36 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180319">2018-03-19</h2>
2018-03-19 07:30:32 +01:00
<ul>
<li>Tezira has been having problems accessing CGSpace from the ILRI Nairobi campus since last week</li>
<li>She is getting an HTTPS error apparently</li>
2019-11-28 16:30:45 +01:00
<li>It's working outside, and Ethiopian users seem to be having no issues so I've asked ICT to have a look</li>
2018-03-19 16:56:41 +01:00
<li>CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat</li>
2019-11-28 16:30:45 +01:00
<li>Around that time there were an increase of SQL errors:</li>
</ul>
2018-03-19 16:56:41 +01:00
<pre><code>2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>But these errors, I don't even know what they mean, because a handful of them happen every day:</li>
</ul>
2018-03-19 16:56:41 +01:00
<pre><code>$ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
dspace.log.2018-03-10:13
dspace.log.2018-03-11:15
dspace.log.2018-03-12:13
dspace.log.2018-03-13:13
dspace.log.2018-03-14:14
dspace.log.2018-03-15:13
dspace.log.2018-03-16:13
dspace.log.2018-03-17:13
dspace.log.2018-03-18:15
dspace.log.2018-03-19:90
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>There wasn't even a lot of traffic at the time (89 AM):</li>
</ul>
2018-03-19 16:56:41 +01:00
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Mar/2018:0[89]:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
92 40.77.167.197
92 83.103.94.48
96 40.77.167.175
116 207.46.13.178
122 66.249.66.153
140 95.108.181.88
196 213.55.99.121
206 197.210.168.174
207 104.196.152.243
294 54.198.169.202
</code></pre><ul>
<li>Well there is a hint in Tomcat's <code>catalina.out</code>:</li>
</ul>
2018-03-19 17:18:28 +01:00
<pre><code>Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>So someone was doing something heavy somehow&hellip; my guess is content and usage stats!</li>
<li>ICT responded that they &ldquo;fixed&rdquo; the CGSpace connectivity issue in Nairobi without telling me the problem</li>
<li>When I asked, Robert Okal said CGNET messed up when updating the DNS for cgspace.cgiar.org last week</li>
<li>I told him that my request last week was for dspacetest.cgiar.org, not cgspace.cgiar.org!</li>
<li>So they updated the wrong fucking DNS records</li>
<li>Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export</li>
<li>It appears to be this one: <a href="https://cgspace.cgiar.org/handle/10568/83473?show=full">https://cgspace.cgiar.org/handle/10568/83473?show=full</a></li>
<li>The title is &ldquo;Untitled&rdquo; and there is some metadata but indeed the citation is missing</li>
<li>I don't know what would cause that</li>
2018-03-19 07:30:32 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180320">2018-03-20</h2>
2018-03-20 16:37:20 +01:00
<ul>
2019-11-28 16:30:45 +01:00
<li>DSpace Test has been down for a few hours with SQL and memory errors starting this morning:</li>
</ul>
2018-03-20 16:37:20 +01:00
<pre><code>2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I have no idea why it crashed</li>
<li>I ran all system updates and rebooted it</li>
<li>Abenet told me that one of Lance Robinson's ORCID iDs on CGSpace is incorrect</li>
<li>I will remove it from the controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/367">#367</a>) and update any items using the old one:</li>
</ul>
2018-03-20 16:37:20 +01:00
<pre><code>dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
UPDATE 1
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits</li>
<li>Merge the changes to CRP names to the <code>5_x-prod</code> branch and deploy on CGSpace (<a href="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
<li>Run corrections for CRP names in the database:</li>
</ul>
2018-03-20 19:31:54 +01:00
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Run all system updates on CGSpace (linode18) and reboot the server</li>
<li>I started a full Discovery re-index on CGSpace because of the updated CRPs</li>
<li>I see this error in the DSpace log:</li>
</ul>
2018-03-20 20:04:10 +01:00
<pre><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &quot;dc_contributor_author&quot;.
java.lang.IllegalArgumentException: No choices plugin was configured for field &quot;dc_contributor_author&quot;.
2019-11-28 16:30:45 +01:00
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
</code></pre><ul>
<li>I have to figure that one out&hellip;</li>
</ul>
<h2 id="20180321">2018-03-21</h2>
2018-03-21 10:44:06 +01:00
<ul>
<li>Looks like the indexing gets confused that there is still data in the <code>authority</code> column</li>
<li>Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!</li>
2019-11-28 16:30:45 +01:00
<li>Since we've migrated the ORCID identifiers associated with the authority data to the <code>cg.creator.id</code> field we can nullify the authorities remaining in the database:</li>
</ul>
<div class="highlight"><pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-sql" data-lang="sql">dspace<span style="color:#f92672">=</span><span style="color:#f92672">#</span> <span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> authority<span style="color:#f92672">=</span><span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">WHERE</span> resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#66d9ef">AND</span> metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span> <span style="color:#66d9ef">AND</span> authority <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NOT</span> <span style="color:#66d9ef">NULL</span>;
<span style="color:#66d9ef">UPDATE</span> <span style="color:#ae81ff">195463</span>
</code></pre></div><ul>
<li>After this the indexing works as usual and item counts and facets are back to normal</li>
<li>Send Peter a list of all authors to correct:</li>
</ul>
<div class="highlight"><pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-sql" data-lang="sql">dspace<span style="color:#f92672">=</span><span style="color:#f92672">#</span> <span style="color:#960050;background-color:#1e0010">\</span><span style="color:#66d9ef">copy</span> (<span style="color:#66d9ef">select</span> <span style="color:#66d9ef">distinct</span> text_value, <span style="color:#66d9ef">count</span>(<span style="color:#f92672">*</span>) <span style="color:#66d9ef">as</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">from</span> metadatavalue <span style="color:#66d9ef">where</span> metadata_field_id <span style="color:#f92672">=</span> (<span style="color:#66d9ef">select</span> metadata_field_id <span style="color:#66d9ef">from</span> metadatafieldregistry <span style="color:#66d9ef">where</span> element <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;</span><span style="color:#e6db74">contributor</span><span style="color:#e6db74">&#39;</span> <span style="color:#66d9ef">and</span> qualifier <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;</span><span style="color:#e6db74">author</span><span style="color:#e6db74">&#39;</span>) <span style="color:#66d9ef">AND</span> resource_type_id <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#66d9ef">group</span> <span style="color:#66d9ef">by</span> text_value <span style="color:#66d9ef">order</span> <span style="color:#66d9ef">by</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">desc</span>) <span style="color:#66d9ef">to</span> <span style="color:#f92672">/</span>tmp<span style="color:#f92672">/</span>authors.csv <span style="color:#66d9ef">with</span> csv header;
<span style="color:#66d9ef">COPY</span> <span style="color:#ae81ff">56156</span>
</code></pre></div><ul>
<li>Afterwards we'll want to do some batch tagging of ORCID identifiers to these names</li>
<li>CGSpace crashed again this afternoon, I'm not sure of the cause but there are a lot of SQL errors in the DSpace log:</li>
</ul>
2018-03-21 17:11:22 +01:00
<pre><code>2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection has already been closed.
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I have no idea why so many connections were abandoned this afternoon:</li>
</ul>
2018-03-21 17:11:22 +01:00
<pre><code># grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
268
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>DSpace Test crashed again due to Java heap space, this is from the DSpace log:</li>
</ul>
2018-03-21 17:11:22 +01:00
<pre><code>2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>And this is from the Tomcat Catalina log:</li>
</ul>
2018-03-21 17:11:22 +01:00
<pre><code>Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>But there are tons of heap space errors on DSpace Test actually:</li>
</ul>
2018-03-21 17:11:22 +01:00
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
319
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I guess we need to give it more RAM because it now has CGSpace's large Solr core</li>
<li>I will increase the memory from 3072m to 4096m</li>
<li>Update <a href="https://github.com/ilri/rmg-ansible-public">Ansible playbooks</a> to use <a href="https://jdbc.postgresql.org/">PostgreSQL JBDC driver</a> 42.2.2</li>
<li>Deploy the new JDBC driver on DSpace Test</li>
<li>I'm also curious to see how long the <code>dspace index-discovery -b</code> takes on DSpace Test where the DSpace installation directory is on one of Linode's new block storage volumes</li>
</ul>
2018-03-22 00:04:49 +01:00
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 208m19.155s
user 8m39.138s
sys 2m45.135s
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>So that's about three times as long as it took on CGSpace this morning</li>
<li>I should also check the raw read speed with <code>hdparm -tT /dev/sdc</code></li>
<li>Looking at Peter's author corrections there are some mistakes due to Windows 1252 encoding</li>
<li>I need to find a way to filter these easily with OpenRefine</li>
<li>For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields</li>
<li>I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:</li>
</ul>
2018-03-22 00:04:49 +01:00
<pre><code>isNotNull(value.match(/.*\ufffd.*/))
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues</li>
2018-03-21 17:11:22 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180322">2018-03-22</h2>
2018-03-22 09:45:56 +01:00
<ul>
<li>Add ORCID identifier for Silvia Alonso</li>
2018-03-22 22:07:03 +01:00
<li>Update my Mirage 2 setup notes for Ubuntu 18.04: <a href="https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5">https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5</a></li>
2018-03-22 09:45:56 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180324">2018-03-24</h2>
2018-03-24 21:03:00 +01:00
<ul>
<li>More work on the Ubuntu 18.04 readiness stuff for the <a href="https://github.com/ilri/rmg-ansible-public">Ansible playbooks</a></li>
2019-11-28 16:30:45 +01:00
<li>The playbook now uses the system's Ruby and Node.js so I don't have to manually install RVM and NVM after</li>
2018-03-24 21:03:00 +01:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180325">2018-03-25</h2>
2018-03-25 21:46:48 +02:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Looking at Peter's author corrections and trying to work out a way to find errors in OpenRefine easily</li>
<li>I can find all names that have acceptable characters using a GREL expression like:</li>
</ul>
2018-03-25 21:46:48 +02:00
<pre><code>isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>But it's probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):</li>
</ul>
2018-03-25 21:46:48 +02:00
<pre><code>or(
2019-11-28 16:30:45 +01:00
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/))
2018-03-25 21:46:48 +02:00
)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>And here's one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it's time to add delete support to my <code>fix-metadata-values.py</code> script:</li>
</ul>
2018-03-25 21:46:48 +02:00
<pre><code>or(
2019-11-28 16:30:45 +01:00
isNotNull(value.match(/.*delete.*/i)),
isNotNull(value.match(/.*remove.*/i)),
isNotNull(value.match(/.*check.*/i))
2018-03-25 21:46:48 +02:00
)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>
<p>So I guess the routine is in OpenRefine is:</p>
2018-03-25 21:46:48 +02:00
<ul>
<li>Transform: trim leading/trailing whitespace</li>
<li>Transform: collapse consecutive whitespace</li>
<li>Custom text facet for items to delete/check</li>
<li>Custom text facet for illegal characters</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
<li>
<p>Test the corrections and deletions locally, then run them on CGSpace:</p>
</li>
</ul>
2018-03-25 21:46:48 +02:00
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test</li>
<li>CGSpace took 76m28.292s</li>
<li>DSpace Test took 194m56.048s</li>
2018-03-25 21:46:48 +02:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180326">2018-03-26</h2>
2018-03-28 08:48:08 +02:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Atmire got back to me about the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589">Listings and Reports issue</a> and said it's caused by items that have missing <code>dc.identifier.citation</code> fields</li>
2018-03-28 08:48:08 +02:00
<li>The will send a fix</li>
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180327">2018-03-27</h2>
2018-03-28 08:48:08 +02:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Atmire got back with an updated quote about the DSpace 5.8 compatibility so I've forwarded it to Peter</li>
2018-03-28 08:48:08 +02:00
</ul>
2019-11-28 16:30:45 +01:00
<h2 id="20180328">2018-03-28</h2>
2018-03-28 08:48:08 +02:00
<ul>
2019-11-28 16:30:45 +01:00
<li>DSpace Test crashed due to heap space so I've increased it from 4096m to 5120m</li>
<li>The error in Tomcat's <code>catalina.out</code> was:</li>
</ul>
2018-03-28 16:48:23 +02:00
<pre><code>Exception in thread &quot;RMI TCP Connection(idle)&quot; java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Add ISI Journal (cg.isijournal) as an option in Atmire's Listing and Reports layout (<a href="https://github.com/ilri/DSpace/pull/370">#370</a>) for Abenet</li>
<li>I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:</li>
</ul>
2018-03-28 16:48:23 +02:00
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
Fixed 100 occurences of: ROOTS, TUBERS AND BANANAS
Fixed 31 occurences of: HUMIDTROPICS
Fixed 21 occurences of: MAIZE
Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
Fixed 28 occurences of: GRAIN LEGUMES
Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
Fixed 5 occurences of: GENEBANKS
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>That's weird because we just updated them last week&hellip;</li>
<li>Create a pull request to enable searching by ORCID identifier (<code>cg.creator.id</code>) in Discovery and Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/371">#371</a>)</li>
<li>I will test it on DSpace Test first!</li>
<li>Fix one missing XMLUI string for &ldquo;Access Status&rdquo; (cg.identifier.status)</li>
<li>Run all system updates on DSpace Test and reboot the machine</li>
2018-03-28 08:48:08 +02:00
</ul>
2018-03-02 15:09:18 +01:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2019-12-01 10:29:49 +01:00
<li><a href="/cgspace-notes/2019-12/">December, 2019</a></li>
2019-11-04 15:41:19 +01:00
<li><a href="/cgspace-notes/2019-11/">November, 2019</a></li>
2019-10-28 12:43:25 +01:00
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
2019-10-01 16:31:40 +02:00
<li><a href="/cgspace-notes/2019-10/">October, 2019</a></li>
2019-09-01 09:41:30 +02:00
<li><a href="/cgspace-notes/2019-09/">September, 2019</a></li>
2018-03-02 15:09:18 +01:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
2018-03-02 15:09:18 +01:00
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>