mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-05 06:43:00 +01:00
357 lines
12 KiB
HTML
357 lines
12 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en-us">
|
|
<head prefix="og: http://ogp.me/ns#">
|
|
<meta charset="utf-8" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />
|
|
<meta property="og:title" content=" February, 2016 · CGSpace Notes" />
|
|
|
|
<meta property="og:site_name" content="CGSpace Notes" />
|
|
<meta property="og:url" content="/cgspace-notes/2016-02/" />
|
|
|
|
|
|
<meta property="og:type" content="article" />
|
|
|
|
<meta property="og:article:published_time" content="2016-02-05T13:18:00+03:00" />
|
|
|
|
<meta property="og:article:tag" content="notes" />
|
|
|
|
|
|
|
|
<title>
|
|
February, 2016 · CGSpace Notes
|
|
</title>
|
|
|
|
<link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/main.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/github.css" />
|
|
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">
|
|
<link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />
|
|
<link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />
|
|
|
|
</head>
|
|
<body>
|
|
<header class="global-header" style="background-image:url(../images/bg.jpg )">
|
|
<section class="header-text">
|
|
<h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>
|
|
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">
|
|
<i class="fa fa-angle-left" aria-hidden="true"></i>
|
|
Home
|
|
</a>
|
|
|
|
|
|
</section>
|
|
</header>
|
|
<main class="container">
|
|
|
|
|
|
<article>
|
|
<header>
|
|
<h1 class="text-primary">February, 2016</h1>
|
|
<div class="post-meta clearfix">
|
|
<div class="post-date pull-left">
|
|
Posted on
|
|
<time datetime="2016-02-05T13:18:00+03:00">
|
|
Feb 5, 2016
|
|
</time>
|
|
</div>
|
|
<div class="pull-right">
|
|
|
|
<span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>
|
|
|
|
</div>
|
|
</div>
|
|
</header>
|
|
<section>
|
|
|
|
|
|
<h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2>
|
|
|
|
<ul>
|
|
<li>Looking at some DAGRIS data for Abenet Yabowork</li>
|
|
<li>Lots of issues with spaces, newlines, etc causing the import to fail</li>
|
|
<li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p>
|
|
|
|
<ul>
|
|
<li>Not only are there 49,000 countries, we have some blanks (25)…</li>
|
|
<li>Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2>
|
|
|
|
<ul>
|
|
<li>Found a way to get items with null/empty metadata values from SQL</li>
|
|
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select * from metadatafieldregistry;
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>In this case our country field is 78</li>
|
|
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>It’s 25 items so editing in the web UI is annoying, let’s try SQL!</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
|
|
DELETE 25
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice…</li>
|
|
<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 “|||” countries are still there</li>
|
|
<li>Maybe I need to do a full re-index…</li>
|
|
<li>Yep! The full re-index seems to work.</li>
|
|
<li>Process the empty countries on CGSpace</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2>
|
|
|
|
<ul>
|
|
<li>Working on cleaning up Abenet’s DAGRIS data with OpenRefine</li>
|
|
<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape("javascript")</code> which shows whitespace characters like <code>\r\n</code>!</li>
|
|
<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>
|
|
<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace("\.0", "")</code></li>
|
|
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
|
|
<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ postgres -D /opt/brew/var/postgres
|
|
$ createuser --superuser postgres
|
|
$ createuser --pwprompt dspacetest
|
|
$ createdb -O dspacetest --encoding=UNICODE dspacetest
|
|
$ psql postgres
|
|
postgres=# alter user dspacetest createuser;
|
|
postgres=# \q
|
|
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup
|
|
$ psql postgres
|
|
postgres=# alter user dspacetest nocreateuser;
|
|
postgres=# \q
|
|
$ vacuumdb dspacetest
|
|
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat’s webapps folder:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
|
|
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
|
|
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
|
|
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
|
|
$ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai
|
|
$ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr
|
|
$ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
|
|
<li>For example:</li>
|
|
</ul>
|
|
|
|
<pre><code>CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>After verifying that the site is working, start a full index:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ~/dspace/bin/dspace index-discovery -b
|
|
</code></pre>
|
|
|
|
<h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2>
|
|
|
|
<ul>
|
|
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
|
|
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme’s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" />
|
|
<img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p>
|
|
|
|
<h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2>
|
|
|
|
<ul>
|
|
<li>Re-sync DSpace Test with CGSpace</li>
|
|
<li>Help Sisay with OpenRefine</li>
|
|
<li>Enable HTTPS on DSpace Test using Let’s Encrypt:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ cd ~/src/git
|
|
$ git clone https://github.com/letsencrypt/letsencrypt
|
|
$ cd letsencrypt
|
|
$ sudo service nginx stop
|
|
# add port 443 to firewall rules
|
|
$ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org
|
|
$ sudo service nginx start
|
|
$ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>
|
|
<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs…</li>
|
|
<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")</code></li>
|
|
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
|
|
<li>Logs don’t always show anything right when it fails, but eventually one of these appears:</li>
|
|
</ul>
|
|
|
|
<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>or</li>
|
|
</ul>
|
|
|
|
<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
|
|
</ul>
|
|
|
|
<pre><code># free -m
|
|
total used free shared buffers cached
|
|
Mem: 3950 3902 48 9 37 1311
|
|
-/+ buffers/cache: 2552 1397
|
|
Swap: 255 57 198
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>
|
|
|
|
<ul>
|
|
<li>Massaging some CIAT data in OpenRefine</li>
|
|
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
|
|
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
|
|
</ul>
|
|
|
|
<pre><code>value.split('/')[-1]
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
|
|
Processing 64661.pdf
|
|
> Downloading 64661.pdf
|
|
> Creating thumbnail for 64661.pdf
|
|
Processing 64195.pdf
|
|
> Downloading 64195.pdf
|
|
> Creating thumbnail for 64195.pdf
|
|
</code></pre>
|
|
|
|
<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>
|
|
|
|
<ul>
|
|
<li>Looking at CIAT’s records again, there are some problems with a dozen or so files (out of 1200)</li>
|
|
<li>A few items are using the same exact PDF</li>
|
|
<li>A few items are using HTM or DOC files</li>
|
|
<li>A few items link to PDFs on IFPRI’s e-Library or Research Gate</li>
|
|
<li>A few items have no item</li>
|
|
<li>Also, I’m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>
|
|
|
|
<ul>
|
|
<li>Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those</li>
|
|
<li>265 items have dirty, URL-encoded filenames:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ls | grep -c -E "%"
|
|
265
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
|
|
</ul>
|
|
|
|
</section>
|
|
<footer>
|
|
|
|
<section class="author-info row">
|
|
<div class="author-avatar col-md-2">
|
|
|
|
</div>
|
|
<div class="author-meta col-md-6">
|
|
|
|
<h1 class="author-name text-primary">Alan Orth</h1>
|
|
|
|
|
|
</div>
|
|
|
|
</section>
|
|
<ul class="pager">
|
|
|
|
<li class="previous"><a href="/cgspace-notes/2016-01/"><span aria-hidden="true">←</span> Older</a></li>
|
|
|
|
|
|
<li class="next disabled"><a href="#">Newer <span aria-hidden="true">→</span></a></li>
|
|
|
|
</ul>
|
|
</footer>
|
|
</article>
|
|
|
|
</main>
|
|
<footer class="container global-footer">
|
|
<div class="copyright-note pull-left">
|
|
|
|
</div>
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
</footer>
|
|
|
|
<script src="/cgspace-notes/js/highlight.pack.js"></script>
|
|
<script>
|
|
hljs.initHighlightingOnLoad();
|
|
</script>
|
|
|
|
|
|
</body>
|
|
</html>
|
|
|