cgspace-notes/docs/2017-04/index.html

640 lines
38 KiB
HTML

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="April, 2017" />
<meta property="og:description" content="2017-04-02
Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): https://github.com/ilri/DSpace/pull/317
Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-04/" />
<meta property="article:published_time" content="2017-04-02T17:08:52+02:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2017"/>
<meta name="twitter:description" content="2017-04-02
Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): https://github.com/ilri/DSpace/pull/317
Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.105.0">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-04/",
"wordCount": "2917",
"datePublished": "2017-04-02T17:08:52+02:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-04/">
<title>April, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-04/">April, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-04-02T17:08:52+02:00">Sun Apr 02, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-04-02">2017-04-02</h2>
<ul>
<li>Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): <a href="https://github.com/ilri/DSpace/pull/317">https://github.com/ilri/DSpace/pull/317</a></li>
<li>Quick proof-of-concept hack to add <code>dc.rights</code> to the input form, including some inline instructions/hints:</li>
</ul>
<p><img src="/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form"></p>
<ul>
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-03">2017-04-03</h2>
<ul>
<li>Continue testing the CMYK patch on more communities:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
</code></pre><ul>
<li>So far there are almost 500:</li>
</ul>
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
484
</code></pre><ul>
<li>Looking at the CG Core document again, I&rsquo;ll send some feedback to Peter and Abenet:
<ul>
<li>We use cg.contributor.crp to indicate the CRP(s) affiliated with the item</li>
<li>DSpace has dc.date.available, but this field isn&rsquo;t particularly meaningful other than as an automatic timestamp at the time of item accession (and is identical to dc.date.accessioned)</li>
<li>dc.relation exists in CGSpace, but isn&rsquo;t used—rather dc.relation.ispartofseries, which is used ~5,000 times to Series name and number within that series</li>
</ul>
</li>
<li>Also, I&rsquo;m noticing some weird outliers in <code>cg.coverage.region</code>, need to remember to go correct these later:</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
</code></pre><h2 id="2017-04-04">2017-04-04</h2>
<ul>
<li>The <code>filter-media</code> script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:</li>
</ul>
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
1584
</code></pre><ul>
<li>Trying to find a way to get the number of items submitted by a certain user in 2016</li>
<li>It&rsquo;s not possible in the DSpace search / module interfaces, but might be able to be derived from <code>dc.description.provenance</code>, as that field contains the name and email of the submitter/approver, ie:</li>
</ul>
<pre tabindex="0"><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
No. of bitstreams: 1^M
ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
</code></pre><ul>
<li>This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a &ldquo;checksum&rdquo; (ie, there was a bitstream in the submission):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>Then this one does the same, but for fields that don&rsquo;t contain checksums (ie, there was no bitstream in the submission):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*&#39; and text_value !~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.</li>
<li>It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled&hellip;</li>
<li>In that case it might just be better to see how many the user submitted (both <em>with</em> and <em>without</em> bitstreams):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^Submitted.*giampieri.*2016-.*&#39;;
</code></pre><h2 id="2017-04-05">2017-04-05</h2>
<ul>
<li>After doing a few more large communities it seems this is the final count of CMYK PDFs:</li>
</ul>
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
2505
</code></pre><h2 id="2017-04-06">2017-04-06</h2>
<ul>
<li>After reading the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">notes for DCAT April 2017</a> I am testing some new settings for PostgreSQL on DSpace Test:
<ul>
<li><code>db.maxconnections</code> 30→70 (the default PostgreSQL config allows 100 connections, so DSpace&rsquo;s default of 30 is quite low)</li>
<li><code>db.maxwait</code> 5000→10000</li>
<li><code>db.maxidle</code> 8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)</li>
</ul>
</li>
<li>I need to look at the Munin graphs after a few days to see if the load has changed</li>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>Discussing harvesting CIFOR&rsquo;s DSpace via OAI</li>
<li>Sisay added their OAI as a source to a new collection, but using the Simple Dublin Core method, so many fields are unqualified and duplicated</li>
<li>Looking at the <a href="https://wiki.lyrasis.org/display/DSDOC5x/XMLUI+Configuration+and+Customization">documentation</a> it seems that we probably want to be using DSpace Intermediate Metadata</li>
</ul>
<h2 id="2017-04-10">2017-04-10</h2>
<ul>
<li>Adjust Linode CPU usage alerts on DSpace servers
<ul>
<li>CGSpace from 200 to 250%</li>
<li>DSpace Test from 100 to 150%</li>
</ul>
</li>
<li>Remove James from Linode access</li>
<li>Look into having CIFOR use a sub prefix of 10568 like 10568.01</li>
<li>Handle.net calls this <a href="https://www.handle.net/faq.html#4">&ldquo;derived prefixes&rdquo;</a> and it seems this would work with DSpace if we wanted to go that route</li>
<li>CIFOR is starting to test aligning their metadata more with CGSpace/CG core</li>
<li>They shared a <a href="https://data.cifor.org/dspace/xmlui/handle/11463/947?show=full">test item</a> which is using <code>cg.coverage.country</code>, <code>cg.subject.cifor</code>, <code>dc.subject</code>, and <code>dc.date.issued</code></li>
<li>Looking at their OAI I&rsquo;m not sure it has updated as I don&rsquo;t see the new fields: <a href="https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc///col_11463_6/900">https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc///col_11463_6/900</a></li>
<li>Maybe they need to make sure they are running the OAI cache refresh cron job, or maybe OAI doesn&rsquo;t export these?</li>
<li>I added <code>cg.subject.cifor</code> to the metadata registry and I&rsquo;m waiting for the harvester to re-harvest to see if it picks up more data now</li>
<li>Another possiblity is that we could use a cross walk&hellip; but I&rsquo;ve never done it.</li>
</ul>
<h2 id="2017-04-11">2017-04-11</h2>
<ul>
<li>Looking at the item from CIFOR it hasn&rsquo;t been updated yet, maybe they aren&rsquo;t running the cron job</li>
<li>I emailed Usman from CIFOR to ask if he&rsquo;s running the cron job</li>
</ul>
<h2 id="2017-04-12">2017-04-12</h2>
<ul>
<li>CIFOR says they have cleaned their OAI cache and that the cron job for OAI import is enabled</li>
<li>Now I see updated fields, like <code>dc.date.issued</code> but none from the CG or CIFOR namespaces</li>
<li>Also, DSpace Test hasn&rsquo;t re-harvested this item yet, so I will wait one more day before forcing a re-harvest</li>
<li>Looking at CIFOR&rsquo;s OAI using different metadata formats, like qualified Dublin Core and DSpace Intermediate Metadata:
<ul>
<li>QDC: <a href="https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=qdc///col_11463_6/900">https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=qdc///col_11463_6/900</a></li>
<li>DIM: <a href="https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=dim///col_11463_6/900">https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=dim///col_11463_6/900</a></li>
</ul>
</li>
<li>Looking at one of CGSpace&rsquo;s items in OAI it doesn&rsquo;t seem that metadata fields other than those in the DC schema are exported:
<ul>
<li><a href="https://cgspace.cgiar.org/handle/10568/33346?show=full">https://cgspace.cgiar.org/handle/10568/33346?show=full</a></li>
<li><a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=dim&amp;set=col_10568_68619">https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=dim&amp;set=col_10568_68619</a></li>
</ul>
</li>
<li>Side note: WTF, I just saw an item on CGSpace&rsquo;s OAI that is using <code>dc.cplace.country</code> and <code>dc.rplace.region</code>, which we stopped using in 2016 after the metadata migrations:</li>
</ul>
<p><img src="/cgspace-notes/2017/04/cplace.png" alt="stale metadata in OAI"></p>
<ul>
<li>The particular item is <a href="http://hdl.handle.net/10568/6">10568/6</a> and, for what it&rsquo;s worth, the stale metadata only appears in the OAI view:
<ul>
<li>XMLUI: <a href="https://cgspace.cgiar.org/handle/10568/6?show=full">https://cgspace.cgiar.org/handle/10568/6?show=full</a></li>
<li>OAI: <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6">https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6</a></li>
</ul>
</li>
<li>I don&rsquo;t see these fields anywhere in our source code or the database&rsquo;s metadata registry, so maybe it&rsquo;s just a cache issue</li>
<li>I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace</li>
<li>Running <code>dspace oai import</code> and <code>dspace oai clean-cache</code> have zero effect, but this seems to rebuild the cache from scratch:</li>
</ul>
<pre tabindex="0"><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
...
63900 items imported so far...
64000 items imported so far...
Total: 64056 items
Purging cached OAI responses.
OAI 2.0 manager action ended. It took 829 seconds.
</code></pre><ul>
<li>After reading some threads on the DSpace mailing list, I see that <code>clean-cache</code> is actually only for caching <em>responses</em>, ie to client requests in the OAI web application</li>
<li>These are stored in <code>[dspace]/var/oai/requests/</code></li>
<li>The import command should theoretically catch situations like this where an item&rsquo;s metadata was updated, but in this case we changed the metadata schema and it doesn&rsquo;t seem to catch it (could be a bug!)</li>
<li>Attempting a full rebuild of OAI on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
...
58700 items imported so far...
Total: 58789 items
Purging cached OAI responses.
OAI 2.0 manager action ended. It took 1032 seconds.
real 17m20.156s
user 4m35.293s
sys 1m29.310s
</code></pre><ul>
<li>Now the data for 10568/6 is correct in OAI: <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6">https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6</a></li>
<li>Perhaps I need to file a bug for this, or at least ask on the DSpace Test mailing list?</li>
<li>I wonder if we could use a crosswalk to convert to a format that CG Core wants, like <code>&lt;date Type=&quot;Available&quot;&gt;</code></li>
</ul>
<h2 id="2017-04-13">2017-04-13</h2>
<ul>
<li>Checking the <a href="https://dspacetest.cgiar.org/handle/11463/947?show=full">CIFOR item on DSpace Test</a>, it still doesn&rsquo;t have the new metadata</li>
<li>The collection status shows this message from the harvester:</li>
</ul>
<blockquote>
<p>Last Harvest Result: OAI server did not contain any updates on 2017-04-13 02:19:47.964</p>
</blockquote>
<ul>
<li>I don&rsquo;t know why there were no updates detected, so I will reset and reimport the collection</li>
<li>Usman has set up a custom crosswalk called <code>dimcg</code> that now shows CG and CIFOR metadata namespaces, but we can&rsquo;t use it because DSpace can only harvest DIM by default (from the harvesting user interface)</li>
<li>Also worth noting that the REST interface exposes all fields in the item, including CG and CIFOR fields: <a href="https://data.cifor.org/dspace/rest/items/944?expand=metadata">https://data.cifor.org/dspace/rest/items/944?expand=metadata</a></li>
<li>After re-importing the CIFOR collection it looks <em>very</em> good!</li>
<li>It seems like they have done a full metadata migration with <code>dc.date.issued</code> and <code>cg.coverage.country</code> etc</li>
<li>Submit pull request to upstream DSpace for the PDF thumbnail bug (DS-3516): <a href="https://github.com/DSpace/DSpace/pull/1709">https://github.com/DSpace/DSpace/pull/1709</a></li>
</ul>
<h2 id="2017-04-14">2017-04-14</h2>
<ul>
<li>DSpace committers reviewed my patch for DS-3516 and proposed a simpler idea involving incorrect use of <code>SelfRegisteredInputFormats</code></li>
<li>I tested the idea and it works, so I made a new patch: <a href="https://github.com/DSpace/DSpace/pull/1709">https://github.com/DSpace/DSpace/pull/1709</a></li>
<li>I discovered that we can override metadata formats in OAI by creating a new &ldquo;context&rdquo;: <a href="https://wiki.lyrasis.org/display/DSDOC5x/OAI+2.0+Server">https://wiki.lyrasis.org/display/DSDOC5x/OAI+2.0+Server</a></li>
<li>This allows us to have, say a default &ldquo;request&rdquo; context and a &ldquo;cgiar&rdquo; context, both of which implement the DSpace Intermediate Metadata formats, but have the later use a overridden version that exposes CG metadata</li>
<li>Compare the following results:
<ul>
<li><a href="https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6">https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6</a></li>
<li><a href="https://dspacetest.cgiar.org/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6">https://dspacetest.cgiar.org/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6</a></li>
</ul>
</li>
<li>Reboot DSpace Test server to get new Linode kernel</li>
</ul>
<h2 id="2017-04-17">2017-04-17</h2>
<ul>
<li>CIFOR has now implemented a new &ldquo;cgiar&rdquo; context in their OAI that exposes CG fields, so I am re-harvesting that to see how it looks in the Discovery sidebars and searches</li>
<li>See: <a href="https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947">https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947</a></li>
<li>One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see <code>harvester.autoStart</code> in <code>dspace/config/modules/oai.cfg</code>)</li>
<li>Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(435) is still referenced from table &#34;bundle&#34;.
</code></pre><h2 id="2017-04-18">2017-04-18</h2>
<ul>
<li>Helping Tsega test his new <a href="https://github.com/ilri/ckm-cgspace-rest-api">CGSpace REST API Rails app</a> on DSpace Test</li>
<li>Setup and run with:</li>
</ul>
<pre tabindex="0"><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
$ cd ckm-cgspace-rest-api/app
$ gem install bundler
$ bundle
$ cd ..
$ rails -s
</code></pre><ul>
<li>I used Ansible to create a PostgreSQL user that only has <code>SELECT</code> privileges on the tables it needs:</li>
</ul>
<pre tabindex="0"><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a &#39;db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
</code></pre><ul>
<li>Need to look into <a href="https://github.com/puma/puma/blob/master/docs/systemd.md">running this via systemd</a></li>
<li>This is interesting for creating runnable commands from <code>bundle</code>:</li>
</ul>
<pre tabindex="0"><code>$ bundle binstubs puma --path ./sbin
</code></pre><h2 id="2017-04-19">2017-04-19</h2>
<ul>
<li>Usman sent another link to their OAI interface, where the country names are now capitalized: <a href="https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947">https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947</a></li>
<li>Looking at the same item in XMLUI, the countries are not capitalized: <a href="https://data.cifor.org/dspace/xmlui/handle/11463/947?show=full">https://data.cifor.org/dspace/xmlui/handle/11463/947?show=full</a></li>
<li>So it seems he did it in the crosswalk!</li>
<li>Keep working on Ansible stuff for deploying the CKM REST API</li>
<li>We can use systemd&rsquo;s <code>Environment</code> stuff to pass the database parameters to Rails</li>
<li>Abenet noticed that the &ldquo;Workflow Statistics&rdquo; option is missing now, but we have screenshots from a presentation in 2016 when it was there</li>
<li>I filed a ticket with Atmire</li>
<li>Looking at 933 CIAT records from Sisay, he&rsquo;s having problems creating a SAF bundle to import to DSpace Test</li>
<li>I started by looking at his CSV in OpenRefine, and I see there a <em>bunch</em> of fields with whitespace issues that I cleaned up:</li>
</ul>
<pre tabindex="0"><code>value.replace(&#34; ||&#34;,&#34;||&#34;).replace(&#34;|| &#34;,&#34;||&#34;).replace(&#34; || &#34;,&#34;||&#34;)
</code></pre><ul>
<li>Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:</li>
</ul>
<pre tabindex="0"><code>unescape(value,&#34;url&#34;)
</code></pre><ul>
<li>Then create the filename column using the following transform from URL:</li>
</ul>
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1].replace(/#.*$/,&#34;&#34;)
</code></pre><ul>
<li>The <code>replace</code> part is because some URLs have an anchor like <code>#page=14</code> which we obviously don&rsquo;t want on the filename</li>
<li>Also, we need to only use the PDF on the item corresponding with page 1, so we don&rsquo;t end up with literally hundreds of duplicate PDFs</li>
<li>Alternatively, I could export each page to a standalone PDF&hellip;</li>
</ul>
<h2 id="2017-04-20">2017-04-20</h2>
<ul>
<li>Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful</li>
<li>I re-enabled it with a hidden config key <code>workflow.stats.enabled = true</code> on DSpace Test and will evaluate adding it on CGSpace</li>
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
<li>Cleaning them up with OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace(/\|\|$/,&#34;&#34;)
</code></pre><ul>
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
</ul>
<p><img src="/cgspace-notes/2017/04/openrefine-flagging-duplicates.png" alt="Flagging and filtering duplicates in OpenRefine"></p>
<ul>
<li>Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace</li>
<li>Unbelievable, there are also metadata values like:</li>
</ul>
<pre tabindex="0"><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
</code></pre><ul>
<li>Add a description to the file names using:</li>
</ul>
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Test import of 933 records:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
$ wc -l /tmp/ciat
933 /tmp/ciat
</code></pre><ul>
<li>Run system updates on CGSpace and reboot server</li>
<li>This includes switching nginx to using upstream with keepalive instead of direct <code>proxy_pass</code></li>
<li>Re-deploy CGSpace to latest <code>5_x-prod</code>, including the PABRA and RTB XMLUI themes, as well as the PDF processing and CMYK changes</li>
<li>More work on Ansible infrastructure stuff for Tsega&rsquo;s CKM DSpace REST API</li>
<li>I&rsquo;m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-22">2017-04-22</h2>
<ul>
<li>Someone on the dspace-tech mailing list responded with a suggestion about the foreign key violation in the <code>cleanup</code> task</li>
<li>The solution is to remove the ID (ie set to NULL) from the <code>primary_bitstream_id</code> column in the <code>bundle</code> table</li>
<li>After doing that and running the <code>cleanup</code> task again I find more bitstreams that are affected and end up with a long list of IDs that need to be fixed:</li>
</ul>
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
</code></pre><h2 id="2017-04-24">2017-04-24</h2>
<ul>
<li>Two users mentioned some items they recently approved not showing up in the search / XMLUI</li>
<li>I looked at the logs from yesterday and it seems the Discovery indexing has been crashing:</li>
</ul>
<pre tabindex="0"><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
2017-04-24 00:00:15,586 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:331)
at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:315)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:803)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
at org.dspace.discovery.IndexClient.main(IndexClient.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
</code></pre><ul>
<li>Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:</li>
</ul>
<pre tabindex="0"><code># grep -c &#39;IndexWriter is closed&#39; [dspace]/log/dspace.log.2017-04-*
[dspace]/log/dspace.log.2017-04-01:0
[dspace]/log/dspace.log.2017-04-02:0
[dspace]/log/dspace.log.2017-04-03:0
[dspace]/log/dspace.log.2017-04-04:0
[dspace]/log/dspace.log.2017-04-05:0
[dspace]/log/dspace.log.2017-04-06:0
[dspace]/log/dspace.log.2017-04-07:0
[dspace]/log/dspace.log.2017-04-08:0
[dspace]/log/dspace.log.2017-04-09:0
[dspace]/log/dspace.log.2017-04-10:0
[dspace]/log/dspace.log.2017-04-11:0
[dspace]/log/dspace.log.2017-04-12:0
[dspace]/log/dspace.log.2017-04-13:0
[dspace]/log/dspace.log.2017-04-14:0
[dspace]/log/dspace.log.2017-04-15:0
[dspace]/log/dspace.log.2017-04-16:0
[dspace]/log/dspace.log.2017-04-17:0
[dspace]/log/dspace.log.2017-04-18:0
[dspace]/log/dspace.log.2017-04-19:0
[dspace]/log/dspace.log.2017-04-20:2293
[dspace]/log/dspace.log.2017-04-21:5992
[dspace]/log/dspace.log.2017-04-22:13278
[dspace]/log/dspace.log.2017-04-23:22720
[dspace]/log/dspace.log.2017-04-24:21422
</code></pre><ul>
<li>I restarted Tomcat and re-ran the discovery process manually:</li>
</ul>
<pre tabindex="0"><code>[dspace]/bin/dspace index-discovery
</code></pre><ul>
<li>Now everything is ok</li>
<li>Finally finished manually running the cleanup task over and over and null&rsquo;ing the conflicting IDs:</li>
</ul>
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
</code></pre><ul>
<li>Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it&rsquo;s likely we haven&rsquo;t had a cleanup task complete successfully in years&hellip;</li>
</ul>
<h2 id="2017-04-25">2017-04-25</h2>
<ul>
<li>Finally finished running the PDF thumbnail re-processing on CGSpace, the final count of CMYK PDFs is about 2751</li>
<li>Preparing to run the cleanup task on CGSpace, I want to see how many files are in the assetstore:</li>
</ul>
<pre tabindex="0"><code># find [dspace]/assetstore/ -type f | wc -l
113104
</code></pre><ul>
<li>Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:</li>
</ul>
<pre tabindex="0"><code>[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
[==================================================&gt;]100% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.dspace.statistics.content.SpecifiedDSODatasetGenerator
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:254)
at org.dspace.statistics.content.StatisticsDisplay.&lt;init&gt;(SourceFile:203)
at com.atmire.statistics.display.StatisticsGraph.&lt;init&gt;(SourceFile:116)
at com.atmire.statistics.display.StatisticsGraphFactory.getStatisticsDisplay(SourceFile:25)
at com.atmire.statistics.display.StatisticsDisplayFactory.parseStatisticsDisplay(SourceFile:67)
at com.atmire.statistics.display.StatisticsDisplayFactory.getStatisticsDisplays(SourceFile:49)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:178)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:111)
at com.atmire.utils.ReportSender$ReportRunnable.run(SourceFile:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.SpecifiedDSODatasetGenerator
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1858)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1701)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.atmire.statistics.statlet.XmlParser.parsedatasetGenerator(SourceFile:299)
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:250)
... 13 more
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpaceObjectDatasetGenerator
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:254)
at org.dspace.statistics.content.StatisticsDisplay.&lt;init&gt;(SourceFile:203)
at com.atmire.statistics.display.StatisticsGraph.&lt;init&gt;(SourceFile:116)
at com.atmire.statistics.display.StatisticsGraphFactory.getStatisticsDisplay(SourceFile:25)
at com.atmire.statistics.display.StatisticsDisplayFactory.parseStatisticsDisplay(SourceFile:67)
at com.atmire.statistics.display.StatisticsDisplayFactory.getStatisticsDisplays(SourceFile:49)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:178)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:111)
at com.atmire.utils.ReportSender$ReportRunnable.run(SourceFile:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpaceObjectDatasetGenerator
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1858)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1701)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.atmire.statistics.statlet.XmlParser.parsedatasetGenerator(SourceFile:299)
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:250)
</code></pre><ul>
<li>Run system updates on DSpace Test and reboot the server (new Java 8 131)</li>
<li>Run the SQL cleanups on the bundle table on CGSpace and run the <code>[dspace]/bin/dspace cleanup</code> task</li>
<li>I will be interested to see the file count in the assetstore as well as the database size after the next backup (last backup size is 111M)</li>
<li>Final file count after the cleanup task finished: 77843</li>
<li>So that is 30,000 files, and about 7GB</li>
<li>Add logging to the cleanup cron task</li>
</ul>
<h2 id="2017-04-26">2017-04-26</h2>
<ul>
<li>The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though</li>
<li>Update RVM&rsquo;s Ruby from 2.3.0 to 2.4.0 on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
$ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
... reload shell to get new Ruby
$ gem install sass -v 3.3.14
$ gem install compass -v 1.0.3
</code></pre><ul>
<li>Help Tsega re-deploy the ckm-cgspace-rest-api on DSpace Test</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-12/">December, 2022</a></li>
<li><a href="/cgspace-notes/2022-11/">November, 2022</a></li>
<li><a href="/cgspace-notes/2022-10/">October, 2022</a></li>
<li><a href="/cgspace-notes/2022-09/">September, 2022</a></li>
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>