cgspace-notes/public/2016-02/index.html

<!DOCTYPE html>
<html lang="en-us">
<head prefix="og: http://ogp.me/ns#">
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />
  <meta property="og:title" content=" February, 2016 &middot;  CGSpace Notes" />
  
  <meta property="og:site_name" content="CGSpace Notes" />
  <meta property="og:url" content="/cgspace-notes/2016-02/" />
  
  
  <meta property="og:type" content="article" />
  
  <meta property="og:article:published_time" content="2016-02-05T13:18:00&#43;03:00" />
  
  <meta property="og:article:tag" content="notes" />
  
  
  <title>
     February, 2016 &middot;  CGSpace Notes
  </title>

  <link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />
  <link rel="stylesheet" href="/cgspace-notes/css/main.css" />
  <link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />
  <link rel="stylesheet" href="/cgspace-notes/css/github.css" />
  <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">
  <link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />
  <link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />
  
</head>
<body>
    <header class="global-header"  style="background-image:url(../images/bg.jpg )">
    <section class="header-text">
      <h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>
      
      <div class="sns-links hidden-print">
  
  
</div>

      
      <a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">
        <i class="fa fa-angle-left" aria-hidden="true"></i>
        &nbsp;Home
      </a>
      
      
    </section>
  </header>
  <main class="container">


<article>
  <header>
    <h1 class="text-primary">February, 2016</h1>
    <div class="post-meta clearfix">
      <div class="post-date pull-left">
        Posted on
        <time datetime="2016-02-05T13:18:00&#43;03:00">
          Feb 5, 2016
        </time>
      </div>
      <div class="pull-right">
        
        <span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>
        
      </div>
    </div>
  </header>
  <section>
    

<h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2>

<ul>
<li>Looking at some DAGRIS data for Abenet Yabowork</li>
<li>Lots of issues with spaces, newlines, etc causing the import to fail</li>
<li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li>
</ul>

<p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p>

<ul>
<li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li>
<li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li>
</ul>

<h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2>

<ul>
<li>Found a way to get items with null/empty metadata values from SQL</li>
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
</ul>

<pre><code>dspacetest=# select * from metadatafieldregistry;
</code></pre>

<ul>
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>

<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
</code></pre>

<ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>

<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
</code></pre>

<ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>

<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
</code></pre>

<ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 &ldquo;|||&rdquo; countries are still there</li>
<li>Maybe I need to do a full re-index&hellip;</li>
<li>Yep! The full re-index seems to work.</li>
<li>Process the empty countries on CGSpace</li>
</ul>

<h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2>

<ul>
<li>Working on cleaning up Abenet&rsquo;s DAGRIS data with OpenRefine</li>
<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape(&quot;javascript&quot;)</code> which shows whitespace characters like <code>\r\n</code>!</li>
<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>
<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace(&quot;\.0&quot;, &quot;&quot;)</code></li>
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>
</ul>

<pre><code>$ postgres -D /opt/brew/var/postgres
$ createuser --superuser postgres
$ createuser --pwprompt dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql postgres
postgres=# alter user dspacetest createuser;
postgres=# \q
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup 
$ psql postgres
postgres=# alter user dspacetest nocreateuser;
postgres=# \q
$ vacuumdb dspacetest
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
</code></pre>

<ul>
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li>
</ul>

<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
$ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai
$ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr
$ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
</code></pre>

<ul>
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>

<pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
</code></pre>

<ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>

<pre><code>$ ~/dspace/bin/dspace index-discovery -b
</code></pre>

<h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2>

<ul>
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme&rsquo;s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
</ul>

<p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" />
<img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p>

<h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2>

<ul>
<li>Re-sync DSpace Test with CGSpace</li>
<li>Help Sisay with OpenRefine</li>
<li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li>
</ul>

<pre><code>$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt
$ sudo service nginx stop
# add port 443 to firewall rules
$ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org
$ sudo service nginx start
$ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass
</code></pre>

<ul>
<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>
<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs&hellip;</li>
<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li>
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
<li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li>
</ul>

<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
</code></pre>

<ul>
<li>or</li>
</ul>

<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre>

<ul>
<li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
</ul>

<pre><code># free -m
             total       used       free     shared    buffers     cached
Mem:          3950       3902         48          9         37       1311
-/+ buffers/cache:       2552       1397
Swap:          255         57        198
</code></pre>

<ul>
<li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
</ul>

<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>

<ul>
<li>Massaging some CIAT data in OpenRefine</li>
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>

<pre><code>value.split('/')[-1]
</code></pre>

<ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>

<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&gt; Downloading 64661.pdf
&gt; Creating thumbnail for 64661.pdf
Processing 64195.pdf
&gt; Downloading 64195.pdf
&gt; Creating thumbnail for 64195.pdf
</code></pre>

<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>

<ul>
<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li>
<li>A few items are using the same exact PDF</li>
<li>A few items are using HTM or DOC files</li>
<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li>
<li>A few items have no item</li>
<li>Also, I&rsquo;m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>
</ul>

<h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>

<ul>
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>

<pre><code>$ ls | grep -c -E &quot;%&quot;
265
</code></pre>

<ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>
</ul>

<pre><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
</code></pre>

<ul>
<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>
<li>They will be deployed on CGSpace the next time I re-deploy</li>
</ul>

<h2 id="2016-02-16:124a59adbaa8ef13e1518d003fc03981">2016-02-16</h2>

<ul>
<li>Turns out OpenRefine has an unescape function!</li>
</ul>

<pre><code>value.unescape(&quot;url&quot;)
</code></pre>

<ul>
<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>
<li>Run web server and system updates on DSpace Test and reboot</li>
<li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn&rsquo;t have the brackets, like <code>dc.identifier.url2</code></li>
<li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li>
<li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li>
<li>To turn <code>dc.identifier.url</code> into filenames, create a new column based o</li>
<li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li>
</ul>

<h2 id="2016-02-17:124a59adbaa8ef13e1518d003fc03981">2016-02-17</h2>

<ul>
<li>Re-deploy CGSpace, run all system updates, and reboot</li>
<li>More work on CIAT data, cleaning and doing a last metadata-only import into DSpace Test</li>
<li>SAFBuilder has a bug preventing it from processing filenames containing more than one underscore</li>
<li>Need to re-process the filename column to replace multiple underscores with one: <code>value.replace(/_{2,}/, &quot;_&quot;)</code></li>
</ul>

<h2 id="2016-02-20:124a59adbaa8ef13e1518d003fc03981">2016-02-20</h2>

<ul>
<li>Turns out the &ldquo;bug&rdquo; in SAFBuilder isn&rsquo;t a bug, it&rsquo;s a feature that allows you to encode extra information like the destintion bundle in the filename</li>
<li>Also, it seems DSpace&rsquo;s SAF import tool doesn&rsquo;t like importing filenames that have accents in them:</li>
</ul>

<pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
</code></pre>

<ul>
<li>Need to rename files to have no accents or umlauts, etc&hellip;</li>
<li>Useful custom text facet for URLs ending with &ldquo;.pdf&rdquo;: <code>value.endsWith(&quot;.pdf&quot;)</code></li>
</ul>

  </section>
  <footer>
    
    <section class="author-info row">
      <div class="author-avatar col-md-2">
        
      </div>
      <div class="author-meta col-md-6">
        
        <h1 class="author-name text-primary">Alan Orth</h1>
        
        
      </div>
      
    </section>
    <ul class="pager">
      
      <li class="previous"><a href="/cgspace-notes/2016-01/"><span aria-hidden="true">&larr;</span> Older</a></li>
      
      
      <li class="next disabled"><a href="#">Newer <span aria-hidden="true">&rarr;</span></a></li>
      
    </ul>
  </footer>
</article>

  </main>
  <footer class="container global-footer">
    <div class="copyright-note pull-left">
      
    </div>
    <div class="sns-links hidden-print">
  
  
</div>

  </footer>

  <script src="/cgspace-notes/js/highlight.pack.js"></script>
  <script>
    hljs.initHighlightingOnLoad();
  </script>
  
  
</body>
</html>
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`<!DOCTYPE html>`
			`<html lang="en-us">`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<head prefix="og: http://ogp.me/ns#">`
			`<meta charset="utf-8" />`
			`<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />`
			`<meta property="og:title" content=" February, 2016 · CGSpace Notes" />`

			`<meta property="og:site_name" content="CGSpace Notes" />`
			`<meta property="og:url" content="/cgspace-notes/2016-02/" />`


			`<meta property="og:type" content="article" />`

			`<meta property="og:article:published_time" content="2016-02-05T13:18:00+03:00" />`

			`<meta property="og:article:tag" content="notes" />`



			`<title>`
			`February, 2016 · CGSpace Notes`
			`</title>`

			`<link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />`
			`<link rel="stylesheet" href="/cgspace-notes/css/main.css" />`
			`<link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />`
			`<link rel="stylesheet" href="/cgspace-notes/css/github.css" />`
			`<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">`
			`<link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />`
			`<link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />`

Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</head>`
			`<body>`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<header class="global-header" style="background-image:url(../images/bg.jpg )">`
			`<section class="header-text">`
			`<h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>`

			`<div class="sns-links hidden-print">`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00







			`</div>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00
			`<a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">`
			`<i class="fa fa-angle-left" aria-hidden="true"></i>`
			` Home`
			`</a>`


			`</section>`
			`</header>`
			`<main class="container">`


			`<article>`
			`<header>`
			`<h1 class="text-primary">February, 2016</h1>`
			`<div class="post-meta clearfix">`
			`<div class="post-date pull-left">`
			`Posted on`
			`<time datetime="2016-02-05T13:18:00+03:00">`
			`Feb 5, 2016`
			`</time>`
			`</div>`
			`<div class="pull-right">`

			`<span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>`

			`</div>`
			`</div>`
			`</header>`
			`<section>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00

			`<h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2>`

			`<ul>`
			`<li>Looking at some DAGRIS data for Abenet Yabowork</li>`
			`<li>Lots of issues with spaces, newlines, etc causing the import to fail</li>`
			`<li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li>`
			`</ul>`

			`<p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p>`

			`<ul>`
			`<li>Not only are there 49,000 countries, we have some blanks (25)…</li>`
			<li>Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”</li>
			`</ul>`

			`<h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2>`

			`<ul>`
			`<li>Found a way to get items with null/empty metadata values from SQL</li>`
			`<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>`
			`</ul>`

			`<pre><code>dspacetest=# select * from metadatafieldregistry;`
			`</code></pre>`

			`<ul>`
			`<li>In this case our country field is 78</li>`
			`<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>`
			`</ul>`

			`<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);`
			`</code></pre>`

			`<ul>`
			`<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>`
			`</ul>`

			`<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';`
			`</code></pre>`

			`<ul>`
			`<li>It’s 25 items so editing in the web UI is annoying, let’s try SQL!</li>`
			`</ul>`

			`<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';`
			`DELETE 25`
			`</code></pre>`

			`<ul>`
			`<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice…</li>`
			`<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 “\|\|\|” countries are still there</li>`
			`<li>Maybe I need to do a full re-index…</li>`
			`<li>Yep! The full re-index seems to work.</li>`
			`<li>Process the empty countries on CGSpace</li>`
			`</ul>`

			`<h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2>`

			`<ul>`
			`<li>Working on cleaning up Abenet’s DAGRIS data with OpenRefine</li>`
			`<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape("javascript")</code> which shows whitespace characters like <code>\r\n</code>!</li>`
			`<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>`
			`<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace("\.0", "")</code></li>`
			`<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</ul>`

			`<pre><code>$ postgres -D /opt/brew/var/postgres`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`$ createuser --superuser postgres`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`$ createuser --pwprompt dspacetest`
			`$ createdb -O dspacetest --encoding=UNICODE dspacetest`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`$ psql postgres`
			`postgres=# alter user dspacetest createuser;`
			`postgres=# \q`
			`$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup`
			`$ psql postgres`
			`postgres=# alter user dspacetest nocreateuser;`
			`postgres=# \q`
			`$ vacuumdb dspacetest`
			`$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</code></pre>`

			`<ul>`
			`<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat’s webapps folder:</li>`
			`</ul>`

			`<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig`
			`$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT`
			`$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest`
			`$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui`
			`$ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai`
			`$ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr`
			`$ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start`
			`</code></pre>`

			`<ul>`
			`<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>`
			`<li>For example:</li>`
			`</ul>`

			`<pre><code>CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"`
			`</code></pre>`

Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<ul>`
			`<li>After verifying that the site is working, start a full index:</li>`
			`</ul>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<pre><code>$ ~/dspace/bin/dspace index-discovery -b`
			`</code></pre>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Add notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 11:31:27 +02:00			`<h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2>`

			`<ul>`
			`<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>`
Update notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 13:23:35 +02:00			`<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme’s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>`
Add notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 11:31:27 +02:00			`</ul>`

Update notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 13:23:35 +02:00			`<p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" />`
			`<img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p>`

Add notes for 2016-02-09 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-09 18:21:55 +02:00			`<h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2>`

			`<ul>`
			`<li>Re-sync DSpace Test with CGSpace</li>`
			`<li>Help Sisay with OpenRefine</li>`
			`<li>Enable HTTPS on DSpace Test using Let’s Encrypt:</li>`
			`</ul>`

			`<pre><code>$ cd ~/src/git`
			`$ git clone https://github.com/letsencrypt/letsencrypt`
			`$ cd letsencrypt`
			`$ sudo service nginx stop`
			`# add port 443 to firewall rules`
			`$ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org`
			`$ sudo service nginx start`
			`$ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass`
			`</code></pre>`

			`<ul>`
			`<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>`
			`<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs…</li>`
			`<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")</code></li>`
			`<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>`
			`<li>Logs don’t always show anything right when it fails, but eventually one of these appears:</li>`
			`</ul>`

			`<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space`
			`</code></pre>`

			`<ul>`
			`<li>or</li>`
			`</ul>`

			`<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object`
			`</code></pre>`

			`<ul>`
			`<li>Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>`
			`</ul>`

			`<pre><code># free -m`
			`total used free shared buffers cached`
			`Mem: 3950 3902 48 9 37 1311`
			`-/+ buffers/cache: 2552 1397`
			`Swap: 255 57 198`
			`</code></pre>`

			`<ul>`
			`<li>So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>`
Add notes for 2016-02-11 and 2016-02-12 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-12 15:44:32 +02:00			`</ul>`

			`<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>`

			`<ul>`
			`<li>Massaging some CIAT data in OpenRefine</li>`
			`<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>`
			`<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>`
			`</ul>`

			`<pre><code>value.split('/')[-1]`
			`</code></pre>`

			`<ul>`
			`<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>`
			`</ul>`

			`<pre><code>$ ./generate-thumbnails.py ciat-reports.csv`
			`Processing 64661.pdf`
			`> Downloading 64661.pdf`
			`> Creating thumbnail for 64661.pdf`
			`Processing 64195.pdf`
			`> Downloading 64195.pdf`
			`> Creating thumbnail for 64195.pdf`
			`</code></pre>`

			`<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>`

			`<ul>`
			`<li>Looking at CIAT’s records again, there are some problems with a dozen or so files (out of 1200)</li>`
			`<li>A few items are using the same exact PDF</li>`
			`<li>A few items are using HTM or DOC files</li>`
			`<li>A few items link to PDFs on IFPRI’s e-Library or Research Gate</li>`
			`<li>A few items have no item</li>`
Update notes for 2016-02-12 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-12 15:48:29 +02:00			`<li>Also, I’m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>`
Add notes for 2016-02-15 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-15 11:36:31 +02:00			`</ul>`

			`<h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>`

			`<ul>`
			`<li>Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those</li>`
			`<li>265 items have dirty, URL-encoded filenames:</li>`
			`</ul>`

			`<pre><code>$ ls \| grep -c -E "%"`
			`265`
			`</code></pre>`

			`<ul>`
			`<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>`
Update notes for 2016-02-15 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-15 11:56:09 +02:00			`<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>`
			`</ul>`

			`<pre><code>$ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf`
			`CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf`
			`</code></pre>`

			`<ul>`
			`<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>`
			`<li>They will be deployed on CGSpace the next time I re-deploy</li>`
Update public html for 2016-02 2016-02-17 23:17:11 +02:00			`</ul>`

			`<h2 id="2016-02-16:124a59adbaa8ef13e1518d003fc03981">2016-02-16</h2>`

			`<ul>`
			`<li>Turns out OpenRefine has an unescape function!</li>`
			`</ul>`

			`<pre><code>value.unescape("url")`
			`</code></pre>`

			`<ul>`
			`<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>`
			`<li>Run web server and system updates on DSpace Test and reboot</li>`
			`<li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn’t have the brackets, like <code>dc.identifier.url2</code></li>`
			`<li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with “\|\|” in between</li>`
			`<li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li>`
			`<li>To turn <code>dc.identifier.url</code> into filenames, create a new column based o</li>`
			`<li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('\|\|'), v, v.split('/')[-1]).join('\|\|')</code></li>`
			`<li>This also works for records that have multiple URLs (separated by “\|\|”)</li>`
			`</ul>`

			`<h2 id="2016-02-17:124a59adbaa8ef13e1518d003fc03981">2016-02-17</h2>`

			`<ul>`
			`<li>Re-deploy CGSpace, run all system updates, and reboot</li>`
			`<li>More work on CIAT data, cleaning and doing a last metadata-only import into DSpace Test</li>`
Add notes for 2016-02-20 2016-02-20 18:53:36 +02:00			`<li>SAFBuilder has a bug preventing it from processing filenames containing more than one underscore</li>`
			`<li>Need to re-process the filename column to replace multiple underscores with one: <code>value.replace(/_{2,}/, "_")</code></li>`
			`</ul>`

			`<h2 id="2016-02-20:124a59adbaa8ef13e1518d003fc03981">2016-02-20</h2>`

			`<ul>`
			`<li>Turns out the “bug” in SAFBuilder isn’t a bug, it’s a feature that allows you to encode extra information like the destintion bundle in the filename</li>`
			`<li>Also, it seems DSpace’s SAF import tool doesn’t like importing filenames that have accents in them:</li>`
			`</ul>`

			`<pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)`
			`</code></pre>`

			`<ul>`
			`<li>Need to rename files to have no accents or umlauts, etc…</li>`
			`<li>Useful custom text facet for URLs ending with “.pdf”: <code>value.endsWith(".pdf")</code></li>`
Add notes for 2016-02-09 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-09 18:21:55 +02:00			`</ul>`

Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`</section>`
			`<footer>`

			`<section class="author-info row">`
			`<div class="author-avatar col-md-2">`

			`</div>`
			`<div class="author-meta col-md-6">`

			`<h1 class="author-name text-primary">Alan Orth</h1>`


			`</div>`

			`</section>`
			`<ul class="pager">`

			`<li class="previous"><a href="/cgspace-notes/2016-01/"><span aria-hidden="true">←</span> Older</a></li>`


			`<li class="next disabled"><a href="#">Newer <span aria-hidden="true">→</span></a></li>`

			`</ul>`
			`</footer>`
			`</article>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`</main>`
			`<footer class="container global-footer">`
			`<div class="copyright-note pull-left">`

			`</div>`
			`<div class="sns-links hidden-print">`









Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</div>`

Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`</footer>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<script src="/cgspace-notes/js/highlight.pack.js"></script>`
			`<script>`
			`hljs.initHighlightingOnLoad();`
			`</script>`


Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</body>`
			`</html>`