mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-18 19:22:18 +01:00
578 lines
26 KiB
HTML
578 lines
26 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
||
<meta property="og:title" content="August, 2018" />
|
||
<meta property="og:description" content="2018-08-01
|
||
|
||
|
||
DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
|
||
|
||
|
||
[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
|
||
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
|
||
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
|
||
|
||
|
||
|
||
Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
|
||
From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
|
||
I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
|
||
Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
|
||
The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
|
||
I ran all system updates on DSpace Test and rebooted it
|
||
|
||
|
||
" />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-08/" />
|
||
|
||
|
||
|
||
<meta property="article:published_time" content="2018-08-01T11:52:54+03:00"/>
|
||
|
||
<meta property="article:modified_time" content="2018-08-26T09:38:15+03:00"/>
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="August, 2018"/>
|
||
<meta name="twitter:description" content="2018-08-01
|
||
|
||
|
||
DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
|
||
|
||
|
||
[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
|
||
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
|
||
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
|
||
|
||
|
||
|
||
Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
|
||
From the DSpace log I see that eventually Solr stopped responding, so I guess the java process that was OOM killed above was Tomcat’s
|
||
I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
|
||
Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
|
||
The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
|
||
I ran all system updates on DSpace Test and rebooted it
|
||
|
||
|
||
"/>
|
||
<meta name="generator" content="Hugo 0.46" />
|
||
|
||
|
||
|
||
<script type="application/ld+json">
|
||
{
|
||
"@context": "http://schema.org",
|
||
"@type": "BlogPosting",
|
||
"headline": "August, 2018",
|
||
"url": "https://alanorth.github.io/cgspace-notes/2018-08/",
|
||
"wordCount": "2426",
|
||
"datePublished": "2018-08-01T11:52:54+03:00",
|
||
"dateModified": "2018-08-26T09:38:15+03:00",
|
||
"author": {
|
||
"@type": "Person",
|
||
"name": "Alan Orth"
|
||
},
|
||
"keywords": "Notes"
|
||
}
|
||
</script>
|
||
|
||
|
||
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-08/">
|
||
|
||
<title>August, 2018 | CGSpace Notes</title>
|
||
|
||
<!-- combined, minified CSS -->
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-Upm5uY/SXdvbjuIGH6fBjF5vOYUr9DguqBskM+EQpLBzO9U+9fMVmWEt+TTlGrWQ" crossorigin="anonymous">
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
</div>
|
||
</header>
|
||
|
||
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-08/">August, 2018</a></h2>
|
||
<p class="blog-post-meta"><time datetime="2018-08-01T11:52:54+03:00">Wed Aug 01, 2018</time> by Alan Orth in
|
||
|
||
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||
|
||
</p>
|
||
</header>
|
||
<h2 id="2018-08-01">2018-08-01</h2>
|
||
|
||
<ul>
|
||
<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
|
||
</ul>
|
||
|
||
<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
|
||
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
|
||
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight</li>
|
||
<li>From the DSpace log I see that eventually Solr stopped responding, so I guess the <code>java</code> process that was OOM killed above was Tomcat’s</li>
|
||
<li>I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…</li>
|
||
<li>Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core</li>
|
||
<li>The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes</li>
|
||
<li>I ran all system updates on DSpace Test and rebooted it</li>
|
||
</ul>
|
||
|
||
<p></p>
|
||
|
||
<ul>
|
||
<li>I started looking over the latest round of IITA batch records from Sisay on DSpace Test: <a href="https://dspacetest.cgiar.org/handle/10568/103250">IITA July_30</a>
|
||
|
||
<ul>
|
||
<li>incorrect authorship types</li>
|
||
<li>dozens of inconsistencies, spelling mistakes, and white space in author affiliations</li>
|
||
<li>minor issues in countries (California is not a country)</li>
|
||
<li>minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<h2 id="2018-08-02">2018-08-02</h2>
|
||
|
||
<ul>
|
||
<li>DSpace Test crashed again and I don’t see the only error I see is this in <code>dmesg</code>:</li>
|
||
</ul>
|
||
|
||
<pre><code>[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
|
||
[Thu Aug 2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?</li>
|
||
<li>The risk we run there is that we’ll start getting OutOfMemory errors from Tomcat</li>
|
||
<li>So basically we need a new test server with more RAM very soon…</li>
|
||
<li>Abenet asked about the workflow statistics in the Atmire CUA module again</li>
|
||
<li>Last year Atmire told me that it’s disabled by default but you can enable it with <code>workflow.stats.enabled = true</code> in the CUA configuration file</li>
|
||
<li>There was a bug with adding users so they sent a patch, but I didn’t merge it because it was <a href="https://github.com/ilri/DSpace/pull/319">very dirty</a> and I wasn’t sure it actually fixed the problem</li>
|
||
<li>I just tried to enable the stats again on DSpace Test now that we’re on DSpace 5.8 with updated Atmire modules, but every user I search for shows “No data available”</li>
|
||
<li>As a test I submitted a new item and I was able to see it in the workflow statistics “data” tab, but not in the graph</li>
|
||
</ul>
|
||
|
||
<h2 id="2018-08-15">2018-08-15</h2>
|
||
|
||
<ul>
|
||
<li>Run through Peter’s list of author affiliations from earlier this month</li>
|
||
<li>I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors</li>
|
||
<li>Finally I did a test run with the <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
|
||
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
|
||
</code></pre>
|
||
|
||
<h2 id="2018-08-16">2018-08-16</h2>
|
||
|
||
<ul>
|
||
<li>Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month</li>
|
||
<li>I might need to overhaul the <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration</li>
|
||
<li>After checking a few examples I see that checking only the <code>text_value</code> and <code>place</code> when adding ORCID fields is not enough anymore</li>
|
||
<li>It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission</li>
|
||
<li>Now it is better to check if there is <em>any</em> existing ORCID identifier for a given author for the item…</li>
|
||
<li>I will have to update my script to extract the ORCID identifier and search for that</li>
|
||
<li>Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
|
||
$ createuser -h localhost -U postgres --pwprompt dspacetest
|
||
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
|
||
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
|
||
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-08-16.backup
|
||
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
|
||
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
|
||
</code></pre>
|
||
|
||
<h2 id="2018-08-19">2018-08-19</h2>
|
||
|
||
<ul>
|
||
<li>Keep working on the CIAT ORCID identifiers from Elizabeth</li>
|
||
<li>In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie “Schultze-Kraft, Rainer” and “Schultze-Kraft, R.”) I will just tag them with ORCID identifiers too</li>
|
||
<li>This is less obvious and more error prone with names like “Peters” where there are many more authors</li>
|
||
<li>I see some errors in the variations of names as well, for example:</li>
|
||
</ul>
|
||
|
||
<pre><code>Verchot, Louis
|
||
Verchot, L
|
||
Verchot, L. V.
|
||
Verchot, L.V
|
||
Verchot, L.V.
|
||
Verchot, LV
|
||
Verchot, Louis V.
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I’ll just tag them all with Louis Verchot’s ORCID identifier…</li>
|
||
<li>In the end, I’ll run the following CSV with my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script:</li>
|
||
</ul>
|
||
|
||
<pre><code>dc.contributor.author,cg.creator.id
|
||
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
|
||
"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
|
||
"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
|
||
"Peters, Michael",Michael Peters: 0000-0003-4237-3916
|
||
"Peters, M.",Michael Peters: 0000-0003-4237-3916
|
||
"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
|
||
"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
|
||
"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
|
||
"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
|
||
"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
|
||
"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
|
||
"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
|
||
"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
|
||
"Verchot, L",Louis Verchot: 0000-0001-8309-6754
|
||
"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
|
||
"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
|
||
"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
|
||
"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
|
||
"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
|
||
"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
|
||
"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
|
||
"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
|
||
"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
|
||
"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
|
||
"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||
"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||
"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>The invocation would be:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers</li>
|
||
<li>Looking at the list of author affialitions from Peter one last time</li>
|
||
<li>I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:</li>
|
||
</ul>
|
||
|
||
<pre><code>or(
|
||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||
isNotNull(value.match(/.*\u00A0.*/)),
|
||
isNotNull(value.match(/.*\u200A.*/)),
|
||
isNotNull(value.match(/.*\u2019.*/)),
|
||
isNotNull(value.match(/.*\u00b4.*/))
|
||
)
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n</li>
|
||
<li>I will run the following on DSpace Test and CGSpace:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
|
||
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Then force an update of the Discovery index on DSpace Test:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
|
||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
|
||
real 72m12.570s
|
||
user 6m45.305s
|
||
sys 2m2.461s
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>And then on CGSpace:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
|
||
real 79m44.392s
|
||
user 8m50.730s
|
||
sys 2m20.248s
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Run system updates on DSpace Test and reboot the server</li>
|
||
<li>In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:</li>
|
||
</ul>
|
||
|
||
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
|
||
1553
|
||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
|
||
1724
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I don’t even know how its possible for the bot to use MORE sessions than total requests…</li>
|
||
<li>The user agent is:</li>
|
||
</ul>
|
||
|
||
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>So I’m thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.</li>
|
||
</ul>
|
||
|
||
<h2 id="2018-08-20">2018-08-20</h2>
|
||
|
||
<ul>
|
||
<li>Help Sisay with some UTF-8 encoding issues in a file Peter sent him</li>
|
||
<li>Finish up reconciling Atmire’s pull request for DSpace 5.8 changes with the latest status of our <code>5_x-prod</code> branch</li>
|
||
<li>I had to do some <code>git rev-list --reverse --no-merges oldestcommit..newestcommit</code> and <code>git cherry-pick -S</code> hackery to get everything all in order</li>
|
||
<li>After building I ran the Atmire schema migrations and forced old migrations, then did the <code>ant update</code></li>
|
||
<li>I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set <code>JAVA_OPTS</code> to 1024m and tried the <code>mvn package</code> again</li>
|
||
<li>Still the <code>mvn package</code> takes forever and essentially hangs on processing the xmlui-mirage2 overlay (though after building all the themes)</li>
|
||
<li>I will try to reduce Tomcat memory from 4608m to 4096m and then retry the <code>mvn package</code> with 1024m of <code>JAVA_OPTS</code> again</li>
|
||
<li>After running the <code>mvn package</code> for the third time and waiting an hour, I attached <code>strace</code> to the Java process and saw that it was indeed reading XMLUI theme data… so I guess I just need to wait more</li>
|
||
<li>After waiting two hours the maven process completed and installation was successful</li>
|
||
<li>I restarted Tomcat and it seems everything is working well, so I’ll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th</li>
|
||
<li>I merged <a href="https://github.com/ilri/DSpace/pull/378">Atmire’s pull request</a> into our <code>5_x-dspace-5.8</code> temporary brach and then cherry-picked all the changes from <code>5_x-prod</code> since April, 2018 when that temporary branch was created</li>
|
||
<li>As the branch histories are very different I cannot merge the new 5.8 branch into the current <code>5_x-prod</code> branch</li>
|
||
<li>Instead, I will archive the current <code>5_x-prod</code> DSpace 5.5 branch as <code>5_x-prod-dspace-5.5</code> and then hard reset <code>5_x-prod</code> based on <code>5_x-dspace-5.8</code></li>
|
||
<li>Unfortunately this will mess up the references in pull requests and issues on GitHub</li>
|
||
</ul>
|
||
|
||
<h2 id="2018-08-21">2018-08-21</h2>
|
||
|
||
<ul>
|
||
<li>Something must have happened, as the <code>mvn package</code> <em>always</em> takes about two hours now, stopping for a very long time near the end at this step:</li>
|
||
</ul>
|
||
|
||
<pre><code>[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>It’s the same on DSpace Test, my local laptop, and CGSpace…</li>
|
||
<li>It wasn’t this way before when I was constantly building the previous 5.8 branch with Atmire patches…</li>
|
||
<li>I will restore the previous <code>5_x-dspace-5.8</code> and <code>atmire-module-upgrades-5.8</code> branches to see if the build time is different there</li>
|
||
<li>… it seems that the <code>atmire-module-upgrades-5.8</code> branch still takes 1 hour and 23 minutes on my local machine…</li>
|
||
<li>Let me try to build the old <code>5_x-prod-dspace-5.5</code> branch on my local machine and see how long it takes</li>
|
||
<li>That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8</li>
|
||
<li>I notice that the step this pauses at is:</li>
|
||
</ul>
|
||
|
||
<pre><code>[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>And I notice that Atmire changed something in the XMLUI module’s <code>pom.xml</code> as part of the DSpace 5.8 changes, specifically to remove the exclude for <code>node_modules</code> in the <code>maven-war-plugin</code> step</li>
|
||
<li>This exclude is <em>present</em> in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!</li>
|
||
<li>It makes sense that it would take longer to complete this step because the <code>node_modules</code> folder has tens of thousands of files, and we have 27 themes!</li>
|
||
<li>I need to test to see if this has any side effects when deployed…</li>
|
||
<li>In other news, I see there was a pull request in DSpace 5.9 that fixes the issue with not being able to have blank lines in CSVs when importing via command line or webui (<a href="https://jira.duraspace.org/browse/DS-3245">DS-3245</a>)</li>
|
||
</ul>
|
||
|
||
<h2 id="2018-08-23">2018-08-23</h2>
|
||
|
||
<ul>
|
||
<li>Skype meeting with CKM people to meet new web dev guy Tariku</li>
|
||
<li>They say they want to start working on the ContentDM harvester middleware again</li>
|
||
<li>I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace</li>
|
||
<li>Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder</li>
|
||
<li>It appears that the web UI’s upload interface <em>requires</em> you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the <code>collections</code> file inside each item in the bundle</li>
|
||
<li>I imported the CTA items on CGSpace for Sisay:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
|
||
</code></pre>
|
||
|
||
<h2 id="2018-08-26">2018-08-26</h2>
|
||
|
||
<ul>
|
||
<li>Doing the DSpace 5.8 upgrade on CGSpace (linode18)</li>
|
||
<li>I already finished the Maven build, now I’ll take a backup of the PostgreSQL database and do a database cleanup just in case:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
|
||
$ dspace cleanup -v
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Now I can stop Tomcat and do the install:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ cd dspace/target/dspace-installer
|
||
$ ant update clean_backups update_geolite
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>After the successful Ant update I can run the database migrations:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ psql dspace dspace
|
||
|
||
dspace=> \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql
|
||
DELETE 0
|
||
UPDATE 1
|
||
DELETE 1
|
||
dspace=> \q
|
||
|
||
$ dspace database migrate ignored
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Then I’ll run all system updates and reboot the server:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ sudo su -
|
||
# apt update && apt full-upgrade
|
||
# apt clean && apt autoclean && apt autoremove
|
||
# reboot
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine</li>
|
||
<li>Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now</li>
|
||
<li>They want a CSV with <em>all</em> metadata, which the Atmire Listings and Reports module can’t do</li>
|
||
<li>I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject <code>GENDER</code> or <code>GENDER POVERTY AND INSTITUTIONS</code>, and CRP <code>Water, Land and Ecosystems</code></li>
|
||
<li>Then I extracted the Handle links from the report so I could export each item’s metadata as CSV</li>
|
||
</ul>
|
||
|
||
<pre><code>$ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Then on the DSpace server I exported the metadata for each item one by one:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them</li>
|
||
<li>I’m not sure how to proceed without writing some script to parse and join the CSVs, and I don’t think it’s worth my time</li>
|
||
</ul>
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|
||
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
|
||
<li><a href="/cgspace-notes/2018-08/">August, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-07/">July, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-06/">June, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-05/">May, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-04/">April, 2018</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
|
||
|
||
<footer class="blog-footer">
|
||
<p>
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
|
||
</body>
|
||
|
||
</html>
|