cgspace-notes/public/2016-02/index.html
Alan Orth b92dc95915
Update notes for 2016-02-15
Signed-off-by: Alan Orth <alan.orth@gmail.com>
2016-02-15 11:56:09 +02:00

367 lines
13 KiB
HTML

<!DOCTYPE html>
<html lang="en-us">
<head prefix="og: http://ogp.me/ns#">
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />
<meta property="og:title" content=" February, 2016 &middot; CGSpace Notes" />
<meta property="og:site_name" content="CGSpace Notes" />
<meta property="og:url" content="/cgspace-notes/2016-02/" />
<meta property="og:type" content="article" />
<meta property="og:article:published_time" content="2016-02-05T13:18:00&#43;03:00" />
<meta property="og:article:tag" content="notes" />
<title>
February, 2016 &middot; CGSpace Notes
</title>
<link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />
<link rel="stylesheet" href="/cgspace-notes/css/main.css" />
<link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />
<link rel="stylesheet" href="/cgspace-notes/css/github.css" />
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">
<link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />
<link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />
</head>
<body>
<header class="global-header" style="background-image:url(../images/bg.jpg )">
<section class="header-text">
<h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>
<div class="sns-links hidden-print">
</div>
<a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">
<i class="fa fa-angle-left" aria-hidden="true"></i>
&nbsp;Home
</a>
</section>
</header>
<main class="container">
<article>
<header>
<h1 class="text-primary">February, 2016</h1>
<div class="post-meta clearfix">
<div class="post-date pull-left">
Posted on
<time datetime="2016-02-05T13:18:00&#43;03:00">
Feb 5, 2016
</time>
</div>
<div class="pull-right">
<span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>
</div>
</div>
</header>
<section>
<h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2>
<ul>
<li>Looking at some DAGRIS data for Abenet Yabowork</li>
<li>Lots of issues with spaces, newlines, etc causing the import to fail</li>
<li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li>
</ul>
<p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p>
<ul>
<li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li>
<li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li>
</ul>
<h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2>
<ul>
<li>Found a way to get items with null/empty metadata values from SQL</li>
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
</ul>
<pre><code>dspacetest=# select * from metadatafieldregistry;
</code></pre>
<ul>
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>
<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
</code></pre>
<ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
</code></pre>
<ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>
<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
</code></pre>
<ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 &ldquo;|||&rdquo; countries are still there</li>
<li>Maybe I need to do a full re-index&hellip;</li>
<li>Yep! The full re-index seems to work.</li>
<li>Process the empty countries on CGSpace</li>
</ul>
<h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2>
<ul>
<li>Working on cleaning up Abenet&rsquo;s DAGRIS data with OpenRefine</li>
<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape(&quot;javascript&quot;)</code> which shows whitespace characters like <code>\r\n</code>!</li>
<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>
<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace(&quot;\.0&quot;, &quot;&quot;)</code></li>
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>
</ul>
<pre><code>$ postgres -D /opt/brew/var/postgres
$ createuser --superuser postgres
$ createuser --pwprompt dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql postgres
postgres=# alter user dspacetest createuser;
postgres=# \q
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup
$ psql postgres
postgres=# alter user dspacetest nocreateuser;
postgres=# \q
$ vacuumdb dspacetest
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
</code></pre>
<ul>
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li>
</ul>
<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
$ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai
$ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr
$ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
</code></pre>
<ul>
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>
<pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
</code></pre>
<ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace index-discovery -b
</code></pre>
<h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2>
<ul>
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme&rsquo;s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
</ul>
<p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" />
<img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p>
<h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2>
<ul>
<li>Re-sync DSpace Test with CGSpace</li>
<li>Help Sisay with OpenRefine</li>
<li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li>
</ul>
<pre><code>$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt
$ sudo service nginx stop
# add port 443 to firewall rules
$ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org
$ sudo service nginx start
$ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass
</code></pre>
<ul>
<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>
<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs&hellip;</li>
<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li>
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
<li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li>
</ul>
<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
</code></pre>
<ul>
<li>or</li>
</ul>
<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre>
<ul>
<li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
</ul>
<pre><code># free -m
total used free shared buffers cached
Mem: 3950 3902 48 9 37 1311
-/+ buffers/cache: 2552 1397
Swap: 255 57 198
</code></pre>
<ul>
<li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
</ul>
<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>
<ul>
<li>Massaging some CIAT data in OpenRefine</li>
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre><code>value.split('/')[-1]
</code></pre>
<ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>
<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&gt; Downloading 64661.pdf
&gt; Creating thumbnail for 64661.pdf
Processing 64195.pdf
&gt; Downloading 64195.pdf
&gt; Creating thumbnail for 64195.pdf
</code></pre>
<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>
<ul>
<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li>
<li>A few items are using the same exact PDF</li>
<li>A few items are using HTM or DOC files</li>
<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li>
<li>A few items have no item</li>
<li>Also, I&rsquo;m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>
</ul>
<h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>
<ul>
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>
<pre><code>$ ls | grep -c -E &quot;%&quot;
265
</code></pre>
<ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>
</ul>
<pre><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
</code></pre>
<ul>
<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>
<li>They will be deployed on CGSpace the next time I re-deploy</li>
</ul>
</section>
<footer>
<section class="author-info row">
<div class="author-avatar col-md-2">
</div>
<div class="author-meta col-md-6">
<h1 class="author-name text-primary">Alan Orth</h1>
</div>
</section>
<ul class="pager">
<li class="previous"><a href="/cgspace-notes/2016-01/"><span aria-hidden="true">&larr;</span> Older</a></li>
<li class="next disabled"><a href="#">Newer <span aria-hidden="true">&rarr;</span></a></li>
</ul>
</footer>
</article>
</main>
<footer class="container global-footer">
<div class="copyright-note pull-left">
</div>
<div class="sns-links hidden-print">
</div>
</footer>
<script src="/cgspace-notes/js/highlight.pack.js"></script>
<script>
hljs.initHighlightingOnLoad();
</script>
</body>
</html>