Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)…
Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -145,15 +145,15 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
<pre tabindex="0"><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value=&#39;&#39; OR text_value IS NULL);
</code></pre><ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;22678&#39;;
</code></pre><ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value=&#39;&#39;;
DELETE 25
</code></pre><ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
@ -198,7 +198,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>
<pre tabindex="0"><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>CATALINA_OPTS=&#34;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&#34;
</code></pre><ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>
@ -253,7 +253,7 @@ Swap: 255 57 198
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre tabindex="0"><code>value.split('/')[-1]
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1]
</code></pre><ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>
@ -278,13 +278,13 @@ Processing 64195.pdf
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>
<pre tabindex="0"><code>$ ls | grep -c -E &quot;%&quot;
<pre tabindex="0"><code>$ ls | grep -c -E &#34;%&#34;
265
</code></pre><ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
<pre tabindex="0"><code>$ python -c &#34;import urllib, sys; print urllib.unquote(sys.argv[1])&#34; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
</code></pre><ul>
<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>
@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>Turns out OpenRefine has an unescape function!</li>
</ul>
<pre tabindex="0"><code>value.unescape(&quot;url&quot;)
<pre tabindex="0"><code>value.unescape(&#34;url&#34;)
</code></pre><ul>
<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>
<li>Run web server and system updates on DSpace Test and reboot</li>
@ -302,7 +302,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li>
<li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li>
<li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&quot;)</li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li>
</ul>
<h2 id="2016-02-17">2016-02-17</h2>
<ul>
@ -325,7 +325,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>To change Spanish accents to ASCII in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
<pre tabindex="0"><code>value.replace(&#39;ó&#39;,&#39;o&#39;).replace(&#39;í&#39;,&#39;i&#39;).replace(&#39;á&#39;,&#39;a&#39;).replace(&#39;é&#39;,&#39;e&#39;).replace(&#39;ñ&#39;,&#39;n&#39;)
</code></pre><ul>
<li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li>
<li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li>
@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
<li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li>
<li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;&#39;).replace(&#39;_=_&#39;,&#39;_&#39;).replace(&#39;,&#39;,&#39;&#39;).replace(&#39;[&#39;,&#39;&#39;).replace(&#39;]&#39;,&#39;&#39;).replace(&#39;(&#39;,&#39;&#39;).replace(&#39;)&#39;,&#39;&#39;).replace(&#39;_.pdf&#39;,&#39;.pdf&#39;).replace(&#39;._&#39;,&#39;_&#39;)
</code></pre><ul>
<li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li>
<li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li>