cgspace-notes/docs/2018-03/index.html
2024-12-04 16:27:49 +03:00

640 lines
35 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="March, 2018" />
<meta property="og:description" content="2018-03-02
Export a CSV of the IITA community metadata for Martin Mueller
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-03/" />
<meta property="article:published_time" content="2018-03-02T16:07:54+02:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2018"/>
<meta name="twitter:description" content="2018-03-02
Export a CSV of the IITA community metadata for Martin Mueller
"/>
<meta name="generator" content="Hugo 0.133.1">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "March, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-03/",
"wordCount": "2960",
"datePublished": "2018-03-02T16:07:54+02:00",
"dateModified": "2019-10-28T13:39:25+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-03/">
<title>March, 2018 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2018-03/">March, 2018</a></h2>
<p class="blog-post-meta">
<time datetime="2018-03-02T16:07:54+02:00">Fri Mar 02, 2018</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2018-03-02">2018-03-02</h2>
<ul>
<li>Export a CSV of the IITA community metadata for Martin Mueller</li>
</ul>
<h2 id="2018-03-06">2018-03-06</h2>
<ul>
<li>Add three new CCAFS project tags to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/357">#357</a>)</li>
<li>Andrea from Macaroni Bros had sent me an email that CCAFS needs them</li>
<li>Give Udana more feedback on his WLE records from last month</li>
<li>There were some records using a non-breaking space in their AGROVOC subject field</li>
<li>I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p &#39;fuuu&#39; -f dc.contributor.author -m 3
</code></pre><ul>
<li>This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character</li>
<li>Add new CRP subject &ldquo;GRAIN LEGUMES AND DRYLAND CEREALS&rdquo; to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/358">#358</a>)</li>
<li>Merge the ORCID integration stuff in to <code>5_x-prod</code> for deployment on CGSpace soon (<a href="https://github.com/ilri/DSpace/pull/359">#359</a>)</li>
<li>Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server</li>
<li>Run all system updates on DSpace Test and reboot server</li>
<li>I ran the <a href="https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c">orcid-authority-to-item.py</a> script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata</li>
</ul>
<pre tabindex="0"><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p &#39;fuuu&#39; -s http://localhost:8081/solr -d
</code></pre><ul>
<li>I ran the DSpace cleanup script on CGSpace and it threw an error (as always):</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(150659) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);&#39;
UPDATE 1
</code></pre><ul>
<li>Apply the proposed PostgreSQL indexes from DS-3636 (pull request <a href="https://github.com/DSpace/DSpace/pull/1791/">#1791</a> on CGSpace (linode18)</li>
</ul>
<h2 id="2018-03-07">2018-03-07</h2>
<ul>
<li>Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (<a href="https://github.com/ilri/DSpace/pull/360">#360</a>)</li>
<li>Help Sisay proof 200 IITA records on DSpace Test</li>
<li>Finally import Udana&rsquo;s 24 items to <a href="https://cgspace.cgiar.org/handle/10568/36185">IWMI Journal Articles</a> on CGSpace</li>
<li>Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc</li>
</ul>
<h2 id="2018-03-08">2018-03-08</h2>
<ul>
<li>Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata</li>
<li>This makes the CSV have tons of columns, for example <code>dc.title</code>, <code>dc.title[]</code>, <code>dc.title[en]</code>, <code>dc.title[eng]</code>, <code>dc.title[en_US]</code> and so on!</li>
<li>I think I can fixor at least normalizethem in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
text_lang
-----------
ethnob
en
spa
EN
En
en_
en_US
E.
EN_US
en_U
eng
fr
es_ES
es
(16 rows)
dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and text_lang in (&#39;en&#39;,&#39;EN&#39;,&#39;En&#39;,&#39;en_&#39;,&#39;EN_US&#39;,&#39;en_U&#39;,&#39;eng&#39;);
UPDATE 122227
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
text_lang
-----------
ethnob
en_US
spa
E.
fr
es_ES
es
(9 rows)
</code></pre><ul>
<li>On second inspection it looks like <code>dc.description.provenance</code> fields use the text_lang &ldquo;en&rdquo; so that&rsquo;s probably why there are over 100,000 fields changed&hellip;</li>
<li>If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and text_lang in (&#39;EN&#39;,&#39;En&#39;,&#39;en_&#39;,&#39;EN_US&#39;,&#39;en_U&#39;,&#39;eng&#39;);
UPDATE 2309
</code></pre><ul>
<li>I will apply this on CGSpace right now</li>
<li>In other news, I was playing with adding ORCID identifiers to a dump of CIAT&rsquo;s community via CSV in OpenRefine</li>
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
</ul>
<pre tabindex="0"><code>or(value.contains(&#39;Ceballos, Hern&#39;), value.contains(&#39;Hernández Ceballos&#39;))
</code></pre><ul>
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
</ul>
<pre tabindex="0"><code>if(isBlank(value), &#34;Hernan Ceballos: 0000-0002-8744-7918&#34;, value + &#34;||Hernan Ceballos: 0000-0002-8744-7918&#34;)
</code></pre><ul>
<li>One thing that bothers me is that this won&rsquo;t honor author order</li>
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
<li>I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching <code>cg.creator.id</code> fields: <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py </a></li>
<li>The CSV should have two columns: author name and ORCID identifier:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&#34;Orth, Alan&#34;,Alan S. Orth: 0000-0002-1735-7458
&#34;Orth, A.&#34;,Alan S. Orth: 0000-0002-1735-7458
</code></pre><ul>
<li>I didn&rsquo;t integrate the ORCID API lookup for author names in this script for now because I was only interested in &ldquo;tagging&rdquo; old items for a few given authors</li>
<li>I added ORCID identifers for 187 items by CIAT&rsquo;s Hernan Ceballos, because that is what Elizabeth was trying to do manually!</li>
<li>Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well</li>
</ul>
<h2 id="2018-03-09">2018-03-09</h2>
<ul>
<li>Give James Stapleton input on Sisay&rsquo;s KRAs</li>
<li>Create a pull request to disable ORCID authority integration for <code>dc.contributor.author</code> in the submission forms and XMLUI display (<a href="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
</ul>
<h2 id="2018-03-11">2018-03-11</h2>
<ul>
<li>Peter also wrote to say he is having issues with the Atmire Listings and Reports module</li>
<li>When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:</li>
</ul>
<pre tabindex="0"><code>2018-03-11 11:38:15,592 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
g/jspui/listings-and-reports
-- Method: POST
-- Parameters were:
-- selected_admin_preset: &#34;ilri authors2&#34;
-- load: &#34;normal&#34;
-- next: &#34;NEXT STEP &gt;&gt;&#34;
-- step: &#34;1&#34;
org.apache.jasper.JasperException: java.lang.NullPointerException
</code></pre><ul>
<li>Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn&rsquo;t find them</li>
<li>I made a quick fix and it&rsquo;s working now (<a href="https://github.com/ilri/DSpace/pull/364">#364</a>)</li>
</ul>
<h2 id="2018-03-12">2018-03-12</h2>
<ul>
<li>Increase upload size on CGSpace&rsquo;s nginx config to 85MB so Sisay can upload some data</li>
</ul>
<h2 id="2018-03-13">2018-03-13</h2>
<ul>
<li>I created a new Linode server for DSpace Test (linode6623840) so I could try the block storage stuff, but when I went to add a 300GB volume it said that block storage capacity was exceeded in that datacenter (Newark, NJ)</li>
<li>I deleted the Linode and created another one (linode6624164) in the Fremont, CA region</li>
<li>After that I deployed the Ubuntu 16.04 image and attached a 300GB block storage volume to the image</li>
<li>Magdalena wrote to ask why there was no Altmetric donut for an item on CGSpace, but there was one on the related CCAFS publication page</li>
<li>It looks the the CCAFS publications page fetches the donut using its DOI, whereas CGSpace queries via Handle</li>
<li>I will write to Altmetric support and ask them, as perhaps its part of a larger issue</li>
<li>CGSpace item: <a href="https://cgspace.cgiar.org/handle/10568/89643">https://cgspace.cgiar.org/handle/10568/89643</a></li>
<li>CCAFS publication page: <a href="https://ccafs.cgiar.org/publications/can-scenario-planning-catalyse-transformational-change-evaluating-climate-change-policy">https://ccafs.cgiar.org/publications/can-scenario-planning-catalyse-transformational-change-evaluating-climate-change-policy</a></li>
<li>Peter tweeted the Handle link and now Altmetric shows the donut for both the DOI and the Handle</li>
</ul>
<h2 id="2018-03-14">2018-03-14</h2>
<ul>
<li>Help Abenet with a troublesome Listings and Report question for CIAT author Steve Beebe</li>
<li>Continue migrating DSpace Test to the new server (linode6624164)</li>
<li>I emailed ILRI service desk to update the DNS records for dspacetest.cgiar.org</li>
<li>Abenet was having problems saving Listings and Reports configurations or layouts but I tested it and it works</li>
</ul>
<h2 id="2018-03-15">2018-03-15</h2>
<ul>
<li>Help Abenet troubleshoot the Listings and Reports issue again</li>
<li>It looks like it&rsquo;s an issue with the layouts, if you create a new layout that only has one type (<code>dc.identifier.citation</code>):</li>
</ul>
<p><img src="/cgspace-notes/2018/03/layout-only-citation.png" alt="Listing and Reports layout"></p>
<ul>
<li>The error in the DSpace log is:</li>
</ul>
<pre tabindex="0"><code>org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
</code></pre><ul>
<li>The full error is here: <a href="https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca">https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca</a></li>
<li>If I do a report for &ldquo;Orth, Alan&rdquo; with the same custom layout it works!</li>
<li>I submitted a ticket to Atmire: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589</a></li>
<li>Small fix to the example citation text in Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/365">#365</a>)</li>
</ul>
<h2 id="2018-03-16">2018-03-16</h2>
<ul>
<li>ICT made the DNS updates for dspacetest.cgiar.org late last night</li>
<li>I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164</li>
<li>Looking at the CRP subjects on CGSpace I see there is one blank one so I&rsquo;ll just fix it:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value=&#39;&#39;;
</code></pre><ul>
<li>Copy all CRP subjects to a CSV to do the mass updates:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
COPY 21
</code></pre><ul>
<li>Once I prepare the new input forms (<a href="https://github.com/ilri/DSpace/issues/362">#362</a>) I will need to do the batch corrections:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.crp -t correct -m 230 -n -d
</code></pre><ul>
<li>Create a pull request to update the input forms for the new CRP subject style (<a href="https://github.com/ilri/DSpace/pull/366">#366</a>)</li>
</ul>
<h2 id="2018-03-19">2018-03-19</h2>
<ul>
<li>Tezira has been having problems accessing CGSpace from the ILRI Nairobi campus since last week</li>
<li>She is getting an HTTPS error apparently</li>
<li>It&rsquo;s working outside, and Ethiopian users seem to be having no issues so I&rsquo;ve asked ICT to have a look</li>
<li>CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat</li>
<li>Around that time there were an increase of SQL errors:</li>
</ul>
<pre tabindex="0"><code>2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
</code></pre><ul>
<li>But these errors, I don&rsquo;t even know what they mean, because a handful of them happen every day:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &#39;ERROR org.dspace.storage.rdbms.DatabaseManager&#39; dspace.log.2018-03-1*
dspace.log.2018-03-10:13
dspace.log.2018-03-11:15
dspace.log.2018-03-12:13
dspace.log.2018-03-13:13
dspace.log.2018-03-14:14
dspace.log.2018-03-15:13
dspace.log.2018-03-16:13
dspace.log.2018-03-17:13
dspace.log.2018-03-18:15
dspace.log.2018-03-19:90
</code></pre><ul>
<li>There wasn&rsquo;t even a lot of traffic at the time (89 AM):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;19/Mar/2018:0[89]:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.197
92 83.103.94.48
96 40.77.167.175
116 207.46.13.178
122 66.249.66.153
140 95.108.181.88
196 213.55.99.121
206 197.210.168.174
207 104.196.152.243
294 54.198.169.202
</code></pre><ul>
<li>Well there is a hint in Tomcat&rsquo;s <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-280&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So someone was doing something heavy somehow&hellip; my guess is content and usage stats!</li>
<li>ICT responded that they &ldquo;fixed&rdquo; the CGSpace connectivity issue in Nairobi without telling me the problem</li>
<li>When I asked, Robert Okal said CGNET messed up when updating the DNS for cgspace.cgiar.org last week</li>
<li>I told him that my request last week was for dspacetest.cgiar.org, not cgspace.cgiar.org!</li>
<li>So they updated the wrong fucking DNS records</li>
<li>Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export</li>
<li>It appears to be this one: <a href="https://cgspace.cgiar.org/handle/10568/83473?show=full">https://cgspace.cgiar.org/handle/10568/83473?show=full</a></li>
<li>The title is &ldquo;Untitled&rdquo; and there is some metadata but indeed the citation is missing</li>
<li>I don&rsquo;t know what would cause that</li>
</ul>
<h2 id="2018-03-20">2018-03-20</h2>
<ul>
<li>DSpace Test has been down for a few hours with SQL and memory errors starting this morning:</li>
</ul>
<pre tabindex="0"><code>2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I have no idea why it crashed</li>
<li>I ran all system updates and rebooted it</li>
<li>Abenet told me that one of Lance Robinson&rsquo;s ORCID iDs on CGSpace is incorrect</li>
<li>I will remove it from the controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/367">#367</a>) and update any items using the old one:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;Lance W. Robinson: 0000-0002-5224-8644&#39; where resource_type_id=2 and metadata_field_id=240 and text_value like &#39;%0000-0002-6344-195X%&#39;;
UPDATE 1
</code></pre><ul>
<li>Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits</li>
<li>Merge the changes to CRP names to the <code>5_x-prod</code> branch and deploy on CGSpace (<a href="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
<li>Run corrections for CRP names in the database:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Run all system updates on CGSpace (linode18) and reboot the server</li>
<li>I started a full Discovery re-index on CGSpace because of the updated CRPs</li>
<li>I see this error in the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &#34;dc_contributor_author&#34;.
java.lang.IllegalArgumentException: No choices plugin was configured for field &#34;dc_contributor_author&#34;.
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
</code></pre><ul>
<li>I have to figure that one out&hellip;</li>
</ul>
<h2 id="2018-03-21">2018-03-21</h2>
<ul>
<li>Looks like the indexing gets confused that there is still data in the <code>authority</code> column</li>
<li>Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!</li>
<li>Since we&rsquo;ve migrated the ORCID identifiers associated with the authority data to the <code>cg.creator.id</code> field we can nullify the authorities remaining in the database:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>dspace<span style="color:#f92672">=#</span> <span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> authority<span style="color:#f92672">=</span><span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">WHERE</span> resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#66d9ef">AND</span> metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span> <span style="color:#66d9ef">AND</span> authority <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NOT</span> <span style="color:#66d9ef">NULL</span>;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> <span style="color:#ae81ff">195463</span>
</span></span></code></pre></div><ul>
<li>After this the indexing works as usual and item counts and facets are back to normal</li>
<li>Send Peter a list of all authors to correct:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>dspace<span style="color:#f92672">=#</span> <span style="color:#960050;background-color:#1e0010">\</span><span style="color:#66d9ef">copy</span> (<span style="color:#66d9ef">select</span> <span style="color:#66d9ef">distinct</span> text_value, <span style="color:#66d9ef">count</span>(<span style="color:#f92672">*</span>) <span style="color:#66d9ef">as</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">from</span> metadatavalue <span style="color:#66d9ef">where</span> metadata_field_id <span style="color:#f92672">=</span> (<span style="color:#66d9ef">select</span> metadata_field_id <span style="color:#66d9ef">from</span> metadatafieldregistry <span style="color:#66d9ef">where</span> element <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;contributor&#39;</span> <span style="color:#66d9ef">and</span> qualifier <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;author&#39;</span>) <span style="color:#66d9ef">AND</span> resource_type_id <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#66d9ef">group</span> <span style="color:#66d9ef">by</span> text_value <span style="color:#66d9ef">order</span> <span style="color:#66d9ef">by</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">desc</span>) <span style="color:#66d9ef">to</span> <span style="color:#f92672">/</span>tmp<span style="color:#f92672">/</span>authors.csv <span style="color:#66d9ef">with</span> csv header;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">COPY</span> <span style="color:#ae81ff">56156</span>
</span></span></code></pre></div><ul>
<li>Afterwards we&rsquo;ll want to do some batch tagging of ORCID identifiers to these names</li>
<li>CGSpace crashed again this afternoon, I&rsquo;m not sure of the cause but there are a lot of SQL errors in the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection has already been closed.
</code></pre><ul>
<li>I have no idea why so many connections were abandoned this afternoon:</li>
</ul>
<pre tabindex="0"><code># grep &#39;Mar 21, 2018&#39; /var/log/tomcat7/catalina.out | grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39;
268
</code></pre><ul>
<li>DSpace Test crashed again due to Java heap space, this is from the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>And this is from the Tomcat Catalina log:</li>
</ul>
<pre tabindex="0"><code>Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>But there are tons of heap space errors on DSpace Test actually:</li>
</ul>
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
319
</code></pre><ul>
<li>I guess we need to give it more RAM because it now has CGSpace&rsquo;s large Solr core</li>
<li>I will increase the memory from 3072m to 4096m</li>
<li>Update <a href="https://github.com/ilri/rmg-ansible-public">Ansible playbooks</a> to use <a href="https://jdbc.postgresql.org/">PostgreSQL JBDC driver</a> 42.2.2</li>
<li>Deploy the new JDBC driver on DSpace Test</li>
<li>I&rsquo;m also curious to see how long the <code>dspace index-discovery -b</code> takes on DSpace Test where the DSpace installation directory is on one of Linode&rsquo;s new block storage volumes</li>
</ul>
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 208m19.155s
user 8m39.138s
sys 2m45.135s
</code></pre><ul>
<li>So that&rsquo;s about three times as long as it took on CGSpace this morning</li>
<li>I should also check the raw read speed with <code>hdparm -tT /dev/sdc</code></li>
<li>Looking at Peter&rsquo;s author corrections there are some mistakes due to Windows 1252 encoding</li>
<li>I need to find a way to filter these easily with OpenRefine</li>
<li>For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields</li>
<li>I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:</li>
</ul>
<pre tabindex="0"><code>isNotNull(value.match(/.*\ufffd.*/))
</code></pre><ul>
<li>I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues</li>
</ul>
<h2 id="2018-03-22">2018-03-22</h2>
<ul>
<li>Add ORCID identifier for Silvia Alonso</li>
<li>Update my Mirage 2 setup notes for Ubuntu 18.04: <a href="https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5">https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5</a></li>
</ul>
<h2 id="2018-03-24">2018-03-24</h2>
<ul>
<li>More work on the Ubuntu 18.04 readiness stuff for the <a href="https://github.com/ilri/rmg-ansible-public">Ansible playbooks</a></li>
<li>The playbook now uses the system&rsquo;s Ruby and Node.js so I don&rsquo;t have to manually install RVM and NVM after</li>
</ul>
<h2 id="2018-03-25">2018-03-25</h2>
<ul>
<li>Looking at Peter&rsquo;s author corrections and trying to work out a way to find errors in OpenRefine easily</li>
<li>I can find all names that have acceptable characters using a GREL expression like:</li>
</ul>
<pre tabindex="0"><code>isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
</code></pre><ul>
<li>But it&rsquo;s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):</li>
</ul>
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/))
)
</code></pre><ul>
<li>And here&rsquo;s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it&rsquo;s time to add delete support to my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*delete.*/i)),
isNotNull(value.match(/.*remove.*/i)),
isNotNull(value.match(/.*check.*/i))
)
</code></pre><ul>
<li>
<p>So I guess the routine is in OpenRefine is:</p>
<ul>
<li>Transform: trim leading/trailing whitespace</li>
<li>Transform: collapse consecutive whitespace</li>
<li>Custom text facet for items to delete/check</li>
<li>Custom text facet for illegal characters</li>
</ul>
</li>
<li>
<p>Test the corrections and deletions locally, then run them on CGSpace:</p>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test</li>
<li>CGSpace took 76m28.292s</li>
<li>DSpace Test took 194m56.048s</li>
</ul>
<h2 id="2018-03-26">2018-03-26</h2>
<ul>
<li>Atmire got back to me about the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589">Listings and Reports issue</a> and said it&rsquo;s caused by items that have missing <code>dc.identifier.citation</code> fields</li>
<li>The will send a fix</li>
</ul>
<h2 id="2018-03-27">2018-03-27</h2>
<ul>
<li>Atmire got back with an updated quote about the DSpace 5.8 compatibility so I&rsquo;ve forwarded it to Peter</li>
</ul>
<h2 id="2018-03-28">2018-03-28</h2>
<ul>
<li>DSpace Test crashed due to heap space so I&rsquo;ve increased it from 4096m to 5120m</li>
<li>The error in Tomcat&rsquo;s <code>catalina.out</code> was:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &#34;RMI TCP Connection(idle)&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>Add ISI Journal (cg.isijournal) as an option in Atmire&rsquo;s Listing and Reports layout (<a href="https://github.com/ilri/DSpace/pull/370">#370</a>) for Abenet</li>
<li>I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p &#39;fuuu&#39;
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
Fixed 100 occurences of: ROOTS, TUBERS AND BANANAS
Fixed 31 occurences of: HUMIDTROPICS
Fixed 21 occurences of: MAIZE
Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
Fixed 28 occurences of: GRAIN LEGUMES
Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
Fixed 5 occurences of: GENEBANKS
</code></pre><ul>
<li>That&rsquo;s weird because we just updated them last week&hellip;</li>
<li>Create a pull request to enable searching by ORCID identifier (<code>cg.creator.id</code>) in Discovery and Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/371">#371</a>)</li>
<li>I will test it on DSpace Test first!</li>
<li>Fix one missing XMLUI string for &ldquo;Access Status&rdquo; (cg.identifier.status)</li>
<li>Run all system updates on DSpace Test and reboot the machine</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-12/">December, 2024</a></li>
<li><a href="/cgspace-notes/2024-11/">November, 2024</a></li>
<li><a href="/cgspace-notes/2024-10/">October, 2024</a></li>
<li><a href="/cgspace-notes/2024-09/">September, 2024</a></li>
<li><a href="/cgspace-notes/2024-08/">August, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>