cgspace-notes/public/2016-02/index.html

<!DOCTYPE html>
<html lang="en-us">
<head prefix="og: http://ogp.me/ns#">
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />
  <meta property="og:title" content=" February, 2016 &middot;  CGSpace Notes" />
  
  <meta property="og:site_name" content="CGSpace Notes" />
  <meta property="og:url" content="/cgspace-notes/2016-02/" />
  
  
  <meta property="og:type" content="article" />
  
  <meta property="og:article:published_time" content="2016-02-05T13:18:00&#43;03:00" />
  
  <meta property="og:article:tag" content="notes" />
  
  
  <title>
     February, 2016 &middot;  CGSpace Notes
  </title>

  <link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />
  <link rel="stylesheet" href="/cgspace-notes/css/main.css" />
  <link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />
  <link rel="stylesheet" href="/cgspace-notes/css/github.css" />
  <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">
  <link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />
  <link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />
  
</head>
<body>
    <header class="global-header"  style="background-image:url(../images/bg.jpg )">
    <section class="header-text">
      <h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>
      
      <div class="sns-links hidden-print">
  
  
</div>

      
      <a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">
        <i class="fa fa-angle-left" aria-hidden="true"></i>
        &nbsp;Home
      </a>
      
      
    </section>
  </header>
  <main class="container">


<article>
  <header>
    <h1 class="text-primary">February, 2016</h1>
    <div class="post-meta clearfix">
      <div class="post-date pull-left">
        Posted on
        <time datetime="2016-02-05T13:18:00&#43;03:00">
          Feb 5, 2016
        </time>
      </div>
      <div class="pull-right">
        
        <span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>
        
      </div>
    </div>
  </header>
  <section>
    

<h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2>

<ul>
<li>Looking at some DAGRIS data for Abenet Yabowork</li>
<li>Lots of issues with spaces, newlines, etc causing the import to fail</li>
<li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li>
</ul>

<p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p>

<ul>
<li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li>
<li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li>
</ul>

<h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2>

<ul>
<li>Found a way to get items with null/empty metadata values from SQL</li>
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
</ul>

<pre><code>dspacetest=# select * from metadatafieldregistry;
</code></pre>

<ul>
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>

<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
</code></pre>

<ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>

<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
</code></pre>

<ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>

<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
</code></pre>

<ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 &ldquo;|||&rdquo; countries are still there</li>
<li>Maybe I need to do a full re-index&hellip;</li>
<li>Yep! The full re-index seems to work.</li>
<li>Process the empty countries on CGSpace</li>
</ul>

<h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2>

<ul>
<li>Working on cleaning up Abenet&rsquo;s DAGRIS data with OpenRefine</li>
<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape(&quot;javascript&quot;)</code> which shows whitespace characters like <code>\r\n</code>!</li>
<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>
<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace(&quot;\.0&quot;, &quot;&quot;)</code></li>
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>
</ul>

<pre><code>$ postgres -D /opt/brew/var/postgres
$ createuser --superuser postgres
$ createuser --pwprompt dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql postgres
postgres=# alter user dspacetest createuser;
postgres=# \q
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup 
$ psql postgres
postgres=# alter user dspacetest nocreateuser;
postgres=# \q
$ vacuumdb dspacetest
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
</code></pre>

<ul>
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li>
</ul>

<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
$ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai
$ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr
$ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
</code></pre>

<ul>
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>

<pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
</code></pre>

<ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>

<pre><code>$ ~/dspace/bin/dspace index-discovery -b
</code></pre>

<h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2>

<ul>
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme&rsquo;s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
</ul>

<p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" />
<img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p>

<h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2>

<ul>
<li>Re-sync DSpace Test with CGSpace</li>
<li>Help Sisay with OpenRefine</li>
<li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li>
</ul>

<pre><code>$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt
$ sudo service nginx stop
# add port 443 to firewall rules
$ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org
$ sudo service nginx start
$ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass
</code></pre>

<ul>
<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>
<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs&hellip;</li>
<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li>
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
<li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li>
</ul>

<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
</code></pre>

<ul>
<li>or</li>
</ul>

<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre>

<ul>
<li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
</ul>

<pre><code># free -m
             total       used       free     shared    buffers     cached
Mem:          3950       3902         48          9         37       1311
-/+ buffers/cache:       2552       1397
Swap:          255         57        198
</code></pre>

<ul>
<li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
</ul>

<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>

<ul>
<li>Massaging some CIAT data in OpenRefine</li>
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>

<pre><code>value.split('/')[-1]
</code></pre>

<ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>

<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&gt; Downloading 64661.pdf
&gt; Creating thumbnail for 64661.pdf
Processing 64195.pdf
&gt; Downloading 64195.pdf
&gt; Creating thumbnail for 64195.pdf
</code></pre>

<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>

<ul>
<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li>
<li>A few items are using the same exact PDF</li>
<li>A few items are using HTM or DOC files</li>
<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li>
<li>A few items have no item</li>
<li>Also, I&rsquo;m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>
</ul>

<h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>

<ul>
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>

<pre><code>$ ls | grep -c -E &quot;%&quot;
265
</code></pre>

<ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
</ul>

  </section>
  <footer>
    
    <section class="author-info row">
      <div class="author-avatar col-md-2">
        
      </div>
      <div class="author-meta col-md-6">
        
        <h1 class="author-name text-primary">Alan Orth</h1>
        
        
      </div>
      
    </section>
    <ul class="pager">
      
      <li class="previous"><a href="/cgspace-notes/2016-01/"><span aria-hidden="true">&larr;</span> Older</a></li>
      
      
      <li class="next disabled"><a href="#">Newer <span aria-hidden="true">&rarr;</span></a></li>
      
    </ul>
  </footer>
</article>

  </main>
  <footer class="container global-footer">
    <div class="copyright-note pull-left">
      
    </div>
    <div class="sns-links hidden-print">
  
  
</div>

  </footer>

  <script src="/cgspace-notes/js/highlight.pack.js"></script>
  <script>
    hljs.initHighlightingOnLoad();
  </script>
  
  
</body>
</html>
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`<!DOCTYPE html>`
			`<html lang="en-us">`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<head prefix="og: http://ogp.me/ns#">`
			`<meta charset="utf-8" />`
			`<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />`
			`<meta property="og:title" content=" February, 2016 · CGSpace Notes" />`

			`<meta property="og:site_name" content="CGSpace Notes" />`
			`<meta property="og:url" content="/cgspace-notes/2016-02/" />`


			`<meta property="og:type" content="article" />`

			`<meta property="og:article:published_time" content="2016-02-05T13:18:00+03:00" />`

			`<meta property="og:article:tag" content="notes" />`



			`<title>`
			`February, 2016 · CGSpace Notes`
			`</title>`

			`<link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />`
			`<link rel="stylesheet" href="/cgspace-notes/css/main.css" />`
			`<link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />`
			`<link rel="stylesheet" href="/cgspace-notes/css/github.css" />`
			`<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">`
			`<link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />`
			`<link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />`

Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</head>`
			`<body>`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<header class="global-header" style="background-image:url(../images/bg.jpg )">`
			`<section class="header-text">`
			`<h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>`

			`<div class="sns-links hidden-print">`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00







			`</div>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00
			`<a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">`
			`<i class="fa fa-angle-left" aria-hidden="true"></i>`
			` Home`
			`</a>`


			`</section>`
			`</header>`
			`<main class="container">`


			`<article>`
			`<header>`
			`<h1 class="text-primary">February, 2016</h1>`
			`<div class="post-meta clearfix">`
			`<div class="post-date pull-left">`
			`Posted on`
			`<time datetime="2016-02-05T13:18:00+03:00">`
			`Feb 5, 2016`
			`</time>`
			`</div>`
			`<div class="pull-right">`

			`<span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>`

			`</div>`
			`</div>`
			`</header>`
			`<section>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00

			`<h2 id="2016-02-05:124a59adbaa8ef13e1518d003fc03981">2016-02-05</h2>`

			`<ul>`
			`<li>Looking at some DAGRIS data for Abenet Yabowork</li>`
			`<li>Lots of issues with spaces, newlines, etc causing the import to fail</li>`
			`<li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li>`
			`</ul>`

			`<p><img src="../images/2016/02/cgspace-countries.png" alt="CGSpace country list" /></p>`

			`<ul>`
			`<li>Not only are there 49,000 countries, we have some blanks (25)…</li>`
			<li>Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”</li>
			`</ul>`

			`<h2 id="2016-02-06:124a59adbaa8ef13e1518d003fc03981">2016-02-06</h2>`

			`<ul>`
			`<li>Found a way to get items with null/empty metadata values from SQL</li>`
			`<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>`
			`</ul>`

			`<pre><code>dspacetest=# select * from metadatafieldregistry;`
			`</code></pre>`

			`<ul>`
			`<li>In this case our country field is 78</li>`
			`<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>`
			`</ul>`

			`<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);`
			`</code></pre>`

			`<ul>`
			`<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>`
			`</ul>`

			`<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';`
			`</code></pre>`

			`<ul>`
			`<li>It’s 25 items so editing in the web UI is annoying, let’s try SQL!</li>`
			`</ul>`

			`<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';`
			`DELETE 25`
			`</code></pre>`

			`<ul>`
			`<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice…</li>`
			`<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 “\|\|\|” countries are still there</li>`
			`<li>Maybe I need to do a full re-index…</li>`
			`<li>Yep! The full re-index seems to work.</li>`
			`<li>Process the empty countries on CGSpace</li>`
			`</ul>`

			`<h2 id="2016-02-07:124a59adbaa8ef13e1518d003fc03981">2016-02-07</h2>`

			`<ul>`
			`<li>Working on cleaning up Abenet’s DAGRIS data with OpenRefine</li>`
			`<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape("javascript")</code> which shows whitespace characters like <code>\r\n</code>!</li>`
			`<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>`
			`<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace("\.0", "")</code></li>`
			`<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</ul>`

			`<pre><code>$ postgres -D /opt/brew/var/postgres`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`$ createuser --superuser postgres`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`$ createuser --pwprompt dspacetest`
			`$ createdb -O dspacetest --encoding=UNICODE dspacetest`
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`$ psql postgres`
			`postgres=# alter user dspacetest createuser;`
			`postgres=# \q`
			`$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup`
			`$ psql postgres`
			`postgres=# alter user dspacetest nocreateuser;`
			`postgres=# \q`
			`$ vacuumdb dspacetest`
			`$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</code></pre>`

			`<ul>`
			`<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat’s webapps folder:</li>`
			`</ul>`

			`<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig`
			`$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT`
			`$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest`
			`$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui`
			`$ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai`
			`$ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr`
			`$ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start`
			`</code></pre>`

			`<ul>`
			`<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>`
			`<li>For example:</li>`
			`</ul>`

			`<pre><code>CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"`
			`</code></pre>`

Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<ul>`
			`<li>After verifying that the site is working, start a full index:</li>`
			`</ul>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<pre><code>$ ~/dspace/bin/dspace index-discovery -b`
			`</code></pre>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Add notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 11:31:27 +02:00			`<h2 id="2016-02-08:124a59adbaa8ef13e1518d003fc03981">2016-02-08</h2>`

			`<ul>`
			`<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>`
Update notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 13:23:35 +02:00			`<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme’s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>`
Add notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 11:31:27 +02:00			`</ul>`

Update notes for 2016-02-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 13:23:35 +02:00			`<p><img src="../images/2016/02/submit-button-ilri.png" alt="ILRI submission buttons" />`
			`<img src="../images/2016/02/submit-button-drylands.png" alt="Drylands submission buttons" /></p>`

Add notes for 2016-02-09 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-09 18:21:55 +02:00			`<h2 id="2016-02-09:124a59adbaa8ef13e1518d003fc03981">2016-02-09</h2>`

			`<ul>`
			`<li>Re-sync DSpace Test with CGSpace</li>`
			`<li>Help Sisay with OpenRefine</li>`
			`<li>Enable HTTPS on DSpace Test using Let’s Encrypt:</li>`
			`</ul>`

			`<pre><code>$ cd ~/src/git`
			`$ git clone https://github.com/letsencrypt/letsencrypt`
			`$ cd letsencrypt`
			`$ sudo service nginx stop`
			`# add port 443 to firewall rules`
			`$ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org`
			`$ sudo service nginx start`
			`$ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass`
			`</code></pre>`

			`<ul>`
			`<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>`
			`<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs…</li>`
			`<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")</code></li>`
			`<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>`
			`<li>Logs don’t always show anything right when it fails, but eventually one of these appears:</li>`
			`</ul>`

			`<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space`
			`</code></pre>`

			`<ul>`
			`<li>or</li>`
			`</ul>`

			`<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object`
			`</code></pre>`

			`<ul>`
			`<li>Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>`
			`</ul>`

			`<pre><code># free -m`
			`total used free shared buffers cached`
			`Mem: 3950 3902 48 9 37 1311`
			`-/+ buffers/cache: 2552 1397`
			`Swap: 255 57 198`
			`</code></pre>`

			`<ul>`
			`<li>So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>`
Add notes for 2016-02-11 and 2016-02-12 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-12 15:44:32 +02:00			`</ul>`

			`<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>`

			`<ul>`
			`<li>Massaging some CIAT data in OpenRefine</li>`
			`<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>`
			`<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>`
			`</ul>`

			`<pre><code>value.split('/')[-1]`
			`</code></pre>`

			`<ul>`
			`<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>`
			`</ul>`

			`<pre><code>$ ./generate-thumbnails.py ciat-reports.csv`
			`Processing 64661.pdf`
			`> Downloading 64661.pdf`
			`> Creating thumbnail for 64661.pdf`
			`Processing 64195.pdf`
			`> Downloading 64195.pdf`
			`> Creating thumbnail for 64195.pdf`
			`</code></pre>`

			`<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>`

			`<ul>`
			`<li>Looking at CIAT’s records again, there are some problems with a dozen or so files (out of 1200)</li>`
			`<li>A few items are using the same exact PDF</li>`
			`<li>A few items are using HTM or DOC files</li>`
			`<li>A few items link to PDFs on IFPRI’s e-Library or Research Gate</li>`
			`<li>A few items have no item</li>`
Update notes for 2016-02-12 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-12 15:48:29 +02:00			`<li>Also, I’m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>`
Add notes for 2016-02-15 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-15 11:36:31 +02:00			`</ul>`

			`<h2 id="2016-02-12-1:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>`

			`<ul>`
			`<li>Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those</li>`
			`<li>265 items have dirty, URL-encoded filenames:</li>`
			`</ul>`

			`<pre><code>$ ls \| grep -c -E "%"`
			`265`
			`</code></pre>`

			`<ul>`
			`<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>`
Add notes for 2016-02-09 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-09 18:21:55 +02:00			`</ul>`

Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`</section>`
			`<footer>`

			`<section class="author-info row">`
			`<div class="author-avatar col-md-2">`

			`</div>`
			`<div class="author-meta col-md-6">`

			`<h1 class="author-name text-primary">Alan Orth</h1>`


			`</div>`

			`</section>`
			`<ul class="pager">`

			`<li class="previous"><a href="/cgspace-notes/2016-01/"><span aria-hidden="true">←</span> Older</a></li>`


			`<li class="next disabled"><a href="#">Newer <span aria-hidden="true">→</span></a></li>`

			`</ul>`
			`</footer>`
			`</article>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`</main>`
			`<footer class="container global-footer">`
			`<div class="copyright-note pull-left">`

			`</div>`
			`<div class="sns-links hidden-print">`









Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</div>`

Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`</footer>`
Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00
Update notes for 2016-02-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-08 08:59:05 +02:00			`<script src="/cgspace-notes/js/highlight.pack.js"></script>`
			`<script>`
			`hljs.initHighlightingOnLoad();`
			`</script>`


Add notes for 2016-02-07 and update public Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-02-07 21:33:55 +02:00			`</body>`
			`</html>`