<li>Found a way to get items with null/empty metadata values from SQL</li>
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
</ul>
<pre><code>dspacetest=# select * from metadatafieldregistry;
</code></pre>
<ul>
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>
<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
</code></pre>
<ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
</code></pre>
<ul>
<li>It’s 25 items so editing in the web UI is annoying, let’s try SQL!</li>
</ul>
<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
</code></pre>
<ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice…</li>
<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 “|||” countries are still there</li>
<li>Maybe I need to do a full re-index…</li>
<li>Working on cleaning up Abenet’s DAGRIS data with OpenRefine</li>
<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape("javascript")</code> which shows whitespace characters like <code>\r\n</code>!</li>
<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>
<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace("\.0", "")</code></li>
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme’s brand colors (<ahref="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <ahref="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>
<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs…</li>
<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")</code></li>
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
<li>Logs don’t always show anything right when it fails, but eventually one of these appears:</li>
</ul>
<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
</code></pre>
<ul>
<li>or</li>
</ul>
<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre>
<ul>
<li>Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
</ul>
<pre><code># free -m
total used free shared buffers cached
Mem: 3950 3902 48 9 37 1311
-/+ buffers/cache: 2552 1397
Swap: 255 57 198
</code></pre>
<ul>
<li>So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre><code>value.split('/')[-1]
</code></pre>
<ul>
<li>Then I wrote a tool called <ahref="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
<li>Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>
<pre><code>$ ls | grep -c -E "%"
265
</code></pre>
<ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>