cgspace-notes/docs/2018-08/index.html

<!DOCTYPE html>
<html lang="en">

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<meta property="og:title" content="August, 2018" />
<meta property="og:description" content="2018-08-01


DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:


[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat&rsquo;s
I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;
Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it


" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-08/" />


<meta property="article:published_time" content="2018-08-01T11:52:54&#43;03:00"/>

<meta property="article:modified_time" content="2018-08-16T18:59:45&#43;03:00"/>


<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2018"/>
<meta name="twitter:description" content="2018-08-01


DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:


[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB


Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat&rsquo;s
I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;
Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it


"/>
<meta name="generator" content="Hugo 0.46" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "August, 2018",
  "url": "https://alanorth.github.io/cgspace-notes/2018-08/",
  "wordCount": "1376",
  "datePublished": "2018-08-01T11:52:54&#43;03:00",
  "dateModified": "2018-08-16T18:59:45&#43;03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-08/">

    <title>August, 2018 | CGSpace Notes</title>

    <!-- combined, minified CSS -->
    <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-Upm5uY/SXdvbjuIGH6fBjF5vOYUr9DguqBskM&#43;EQpLBzO9U&#43;9fMVmWEt&#43;TTlGrWQ" crossorigin="anonymous">

    
  </head>

  <body>

    
    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>
    

    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>
    

    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          
<article class="blog-post">
  <header>
    <h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-08/">August, 2018</a></h2>
    <p class="blog-post-meta"><time datetime="2018-08-01T11:52:54&#43;03:00">Wed Aug 01, 2018</time> by Alan Orth in 

<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>

</p>
  </header>
  <h2 id="2018-08-01">2018-08-01</h2>

<ul>
<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
</ul>

<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre>

<ul>
<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li>
<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat&rsquo;s</li>
<li>I&rsquo;m not sure why Tomcat didn&rsquo;t crash with an OutOfMemoryError&hellip;</li>
<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li>
<li>The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes</li>
<li>I ran all system updates on DSpace Test and rebooted it</li>
</ul>

<p></p>

<ul>
<li>I started looking over the latest round of IITA batch records from Sisay on DSpace Test: <a href="https://dspacetest.cgiar.org/handle/10568/103250">IITA July_30</a>

<ul>
<li>incorrect authorship types</li>
<li>dozens of inconsistencies, spelling mistakes, and white space in author affiliations</li>
<li>minor issues in countries (California is not a country)</li>
<li>minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects</li>
</ul></li>
</ul>

<h2 id="2018-08-02">2018-08-02</h2>

<ul>
<li>DSpace Test crashed again and I don&rsquo;t see the only error I see is this in <code>dmesg</code>:</li>
</ul>

<pre><code>[Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
</code></pre>

<ul>
<li>I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?</li>
<li>The risk we run there is that we&rsquo;ll start getting OutOfMemory errors from Tomcat</li>
<li>So basically we need a new test server with more RAM very soon&hellip;</li>
<li>Abenet asked about the workflow statistics in the Atmire CUA module again</li>
<li>Last year Atmire told me that it&rsquo;s disabled by default but you can enable it with <code>workflow.stats.enabled = true</code> in the CUA configuration file</li>
<li>There was a bug with adding users so they sent a patch, but I didn&rsquo;t merge it because it was <a href="https://github.com/ilri/DSpace/pull/319">very dirty</a> and I wasn&rsquo;t sure it actually fixed the problem</li>
<li>I just tried to enable the stats again on DSpace Test now that we&rsquo;re on DSpace 5.8 with updated Atmire modules, but every user I search for shows &ldquo;No data available&rdquo;</li>
<li>As a test I submitted a new item and I was able to see it in the workflow statistics &ldquo;data&rdquo; tab, but not in the graph</li>
</ul>

<h2 id="2018-08-15">2018-08-15</h2>

<ul>
<li>Run through Peter&rsquo;s list of author affiliations from earlier this month</li>
<li>I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors</li>
<li>Finally I did a test run with the <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>

<pre><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre>

<h2 id="2018-08-16">2018-08-16</h2>

<ul>
<li>Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:</li>
</ul>

<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
</code></pre>

<ul>
<li>Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month</li>
<li>I might need to overhaul the <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration</li>
<li>After checking a few examples I see that checking only the <code>text_value</code> and <code>place</code> when adding ORCID fields is not enough anymore</li>
<li>It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission</li>
<li>Now it is better to check if there is <em>any</em> existing ORCID identifier for a given author for the item&hellip;</li>
<li>I will have to update my script to extract the ORCID identifier and search for that</li>
<li>Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:</li>
</ul>

<pre><code>$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-08-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre>

<h2 id="2018-08-19">2018-08-19</h2>

<ul>
<li>Keep working on the CIAT ORCID identifiers from Elizabeth</li>
<li>In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie &ldquo;Schultze-Kraft, Rainer&rdquo; and &ldquo;Schultze-Kraft, R.&rdquo;) I will just tag them with ORCID identifiers too</li>
<li>This is less obvious and more error prone with names like &ldquo;Peters&rdquo; where there are many more authors</li>
<li>I see some errors in the variations of names as well, for example:</li>
</ul>

<pre><code>Verchot, Louis
Verchot, L
Verchot, L. V.
Verchot, L.V
Verchot, L.V.
Verchot, LV
Verchot, Louis V.
</code></pre>

<ul>
<li>I&rsquo;ll just tag them all with Louis Verchot&rsquo;s ORCID identifier&hellip;</li>
<li>In the end, I&rsquo;ll run the following CSV with my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script:</li>
</ul>

<pre><code>dc.contributor.author,cg.creator.id
&quot;Campbell, Bruce&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, Bruce M.&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, B.M&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Peters, Michael&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Peters, M.&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Peters, M.K.&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Tamene, Lulseged&quot;,Lulseged Tamene: 0000-0002-3806-8890
&quot;Desta, Lulseged Tamene&quot;,Lulseged Tamene: 0000-0002-3806-8890
&quot;Läderach, Peter&quot;,Peter Läderach: 0000-0001-8708-6318
&quot;Lundy, Mark&quot;,Mark Lundy: 0000-0002-5241-3777
&quot;Schultze-Kraft, Rainer&quot;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&quot;Schultze-Kraft, R.&quot;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&quot;Verchot, Louis&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L. V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L.V&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L.V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, LV&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, Louis V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Mukankusi, Clare&quot;,Clare Mukankusi: 0000-0001-7837-4545
&quot;Mukankusi, Clare M.&quot;,Clare Mukankusi: 0000-0001-7837-4545
&quot;Wyckhuys, Kris&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Wyckhuys, Kris A. G.&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Wyckhuys, Kris A.G.&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Chirinda, Ngonidzashe&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&quot;Chirinda, Ngoni&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&quot;Ngonidzashe, Chirinda&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
</code></pre>

<ul>
<li>The invocation would be:</li>
</ul>

<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
</code></pre>

<ul>
<li>I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers</li>
<li>Looking at the list of author affialitions from Peter one last time</li>
<li>I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:</li>
</ul>

<pre><code>or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/))
)
</code></pre>

<ul>
<li>This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n</li>
<li>I will run the following on DSpace Test and CGSpace:</li>
</ul>

<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre>

<ul>
<li>Then force an update of the Discovery index on DSpace Test:</li>
</ul>

<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    72m12.570s
user    6m45.305s
sys     2m2.461s
</code></pre>

<ul>
<li>And then on CGSpace:</li>
</ul>

<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    79m44.392s
user    8m50.730s
sys     2m20.248s
</code></pre>

<ul>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
1553
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19 
1724
</code></pre>

<ul>
<li>I don&rsquo;t even know how its possible for the bot to use MORE sessions than total requests&hellip;</li>
<li>The user agent is:</li>
</ul>

<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
</code></pre>

<ul>
<li>So I&rsquo;m thinking we should add &ldquo;crawl&rdquo; to the Tomcat Crawler Session Manager valve, as we already have &ldquo;bot&rdquo; that catches Googlebot, Bingbot, etc.</li>
</ul>

<!-- vim: set sw=2 ts=2: -->

  
</article> 


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">
  

        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2018-08/">August, 2018</a></li>

<li><a href="/cgspace-notes/2018-07/">July, 2018</a></li>

<li><a href="/cgspace-notes/2018-06/">June, 2018</a></li>

<li><a href="/cgspace-notes/2018-05/">May, 2018</a></li>

<li><a href="/cgspace-notes/2018-04/">April, 2018</a></li>

    </ol>
  </section>

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->
    

    <footer class="blog-footer">
      <p>
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>
    

  </body>

</html>