cgspace-notes/docs/2019-08/index.html

<!DOCTYPE html>
<html lang="en">

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<meta property="og:title" content="August, 2019" />
<meta property="og:description" content="2019-08-03


Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;


2019-08-04


Deploy ORCID identifier updates requested by Bioversity to CGSpace
Run system updates on CGSpace (linode18) and reboot it


Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;
After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.

Run system updates on DSpace Test (linode19) and reboot it
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-08/" />
<meta property="article:published_time" content="2019-08-03T12:39:51+03:00" />
<meta property="article:modified_time" content="2019-08-09T09:57:26+03:00" />

<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2019"/>
<meta name="twitter:description" content="2019-08-03


Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;


2019-08-04


Deploy ORCID identifier updates requested by Bioversity to CGSpace
Run system updates on CGSpace (linode18) and reboot it


Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;
After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.

Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.56.3" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "August, 2019",
  "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-08\/",
  "wordCount": "1059",
  "datePublished": "2019-08-03T12:39:51\x2b03:00",
  "dateModified": "2019-08-09T09:57:26\x2b03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-08/">

    <title>August, 2019 | CGSpace Notes</title>

    <!-- combined, minified CSS -->
    <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">

    <!-- RSS 2.0 feed -->


  </head>

  <body>


    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>


    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>


    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">


<article class="blog-post">
  <header>
    <h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-08/">August, 2019</a></h2>
    <p class="blog-post-meta"><time datetime="2019-08-03T12:39:51&#43;03:00">Sat Aug 03, 2019</time> by Alan Orth in

<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>

</p>
  </header>
  <h2 id="2019-08-03">2019-08-03</h2>

<ul>
<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li>
</ul>

<h2 id="2019-08-04">2019-08-04</h2>

<ul>
<li>Deploy ORCID identifier updates requested by Bioversity to CGSpace</li>
<li>Run system updates on CGSpace (linode18) and reboot it

<ul>
<li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li>
<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li>
</ul></li>
<li>Run system updates on DSpace Test (linode19) and reboot it</li>
</ul>

<h2 id="2019-08-05">2019-08-05</h2>

<ul>
<li>Update Tomcat to 7.0.96 in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Update PostgreSQL JDBC driver to 42.2.6 in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastrucutre playbooks</a></li>
<li>Deploy both on DSpace Test (linode19)</li>

<li><p>Looking at the 1429 records for Bioversity migration again</p>

<ul>
<li>The following items use the same exact PDF and seem to be duplicates:</li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10191">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10191</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=342">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=342</a></li>
<li>The following items use the same exact PDF, but one seems to be incorrect:</li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5347">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5347</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5340">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5340</a></li>
<li>The following PDFs are used by several items incorrectly:</li>
<li><code>Report_of_a_Working_Group_on_Allium_7.pdf</code></li>
<li><code>Report_of_a_Working_Group_on_Allium_Fourth_meeting_1696.pdf</code></li>
<li>I checked the SHA1 hashes of each PDF and found that some appear more than once&hellip;</li>
<li>The following items use the same PDF with a different name, but seem to be duplicates (pick one?):</li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=433">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=433</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10189">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10189</a></li>
<li>The following items use the same PDF with a different name, but seem to be duplicates (pick one?):</li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=332">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=332</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10187">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10187</a></li>
<li>There are about thirty PDFs that have French or Spanish filenames and there seems to be an encoding issue</li>
<li>I asked Francesco if he can give me a PDF URL column instead of a &ldquo;filename&rdquo; column so I can download the files myself</li>

<li><p>At <em>least</em> the ~50 filenames identified by the following GREL will have issues:</p>

<pre><code>or(
isNotNull(value.match(/^.*’.*$/)),
isNotNull(value.match(/^.*é.*$/)),
isNotNull(value.match(/^.*á.*$/)),
isNotNull(value.match(/^.*è.*$/)),
isNotNull(value.match(/^.*í.*$/)),
isNotNull(value.match(/^.*ó.*$/)),
isNotNull(value.match(/^.*ú.*$/)),
isNotNull(value.match(/^.*à.*$/)),
isNotNull(value.match(/^.*û.*$/))
).toString()
</code></pre></li>
</ul></li>

<li><p>I tried to extract the filenames and construct a URL to download the PDFs with my <code>generate-thumbnails.py</code> script, but there seem to be several paths for PDFs so I can&rsquo;t guess it properly</p></li>

<li><p>I will have to wait for Francesco to respond about the PDFs, or perhaps proceed with a metadata-only upload so we can do other checks on DSpace Test</p></li>
</ul>

<h2 id="2019-08-06">2019-08-06</h2>

<ul>
<li>Francesca responded to address my feedback yesterday

<ul>
<li>I made some changes to the CSV based on her feedback (remove two duplicates, change one PDF file name, change two titles)</li>
<li>Then I found some items that have PDFs in multiple languages that only list one language in <code>dc.language.iso</code> so I changed them</li>
<li>Strangley, one item was referring to a 7zip file&hellip;</li>
<li>After removing the two duplicates there are now 1427 records</li>
<li>Fix one invalid ISSN: 1020-2002→1020-3362</li>
</ul></li>
</ul>

<h2 id="2019-08-07">2019-08-07</h2>

<ul>
<li>Daniel Haile-Michael asked about using a logical OR with the DSpace OpenSearch, but I looked in the DSpace manual and it does not seem to be possible</li>
</ul>

<h2 id="2019-08-08">2019-08-08</h2>

<ul>
<li><p>Moayad noticed that the HTTPS certificate expired on the AReS dev server (linode20)</p>

<ul>
<li>The first problem was that there is a Docker container listening on port 80, so it conflicts with the ACME http-01 validation</li>
<li>The second problem was that we only allow access to port 80 from localhost</li>

<li><p>I adjusted the <code>renew-letsencrypt</code> systemd service so it stops/starts the Docker container and firewall:</p>

<pre><code># /opt/certbot-auto renew --standalone --pre-hook &quot;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&quot; --post-hook &quot;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&quot;
</code></pre></li>
</ul></li>

<li><p>It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains</p></li>

<li><p>Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04&rsquo;s <a href="https://ssl-config.mozilla.org/#server=nginx&amp;server-version=1.16.0&amp;config=intermediate&amp;openssl-version=1.1.0g&amp;hsts=false&amp;ocsp=false">OpenSSL 1.1.0g with nginx 1.16.0</a></p></li>

<li><p>Run all system updates on AReS dev server (linode20) and reboot it</p></li>

<li><p>Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:</p>

<pre><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
$ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs2.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
$ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt
</code></pre></li>

<li><p>(the weird sed regex removes color codes, because my generate-thumbnails script prints pretty colors)</p></li>

<li><p>Some PDFs are uploaded in different paths so I have to try a few times to get them all:</p>

<ul>
<li><code>/fileadmin/_migrated/uploads/tx_news/</code></li>
<li><code>/fileadmin/user_upload/online_library/publications/pdfs/</code></li>
<li><code>/fileadmin/user_upload/</code></li>
</ul></li>

<li><p>Even so, there are still 52 items with incorrect filenames, so I can&rsquo;t derive their PDF URLs&hellip;</p>

<ul>
<li>For example, <code>Wild_cherry_Prunus_avium_859.pdf</code> is here (with double underscore): <a href="https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf">https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf</a></li>
</ul></li>

<li><p>I will proceed with a metadata-only upload first and then let them know about the missing PDFs</p></li>

<li><p>Troubleshoot an issue we had with proxying to the new development version of AReS from DSpace Test (linode19)</p>

<ul>
<li>For some reason the host header in the proxy pass is not set so nginx on DSpace Test makes a request to the upstream nginx on an IP-based virtual host</li>
<li>The upstream nginx returns HTTP 444 because we configured it to not answer when a request does not send a valid hostname</li>

<li><p>The solution is to set the host header when proxy passing:</p>

<pre><code>proxy_set_header Host dev.ares.codeobia.com;
</code></pre></li>
</ul></li>

<li><p>Though I am really wondering why this happened now, because the configuration has been working for months&hellip;</p></li>

<li><p>Improve the output of the suspicious characters check in <a href="https://github.com/alanorth/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.0</p></li>
</ul>

<h2 id="2019-08-09">2019-08-09</h2>

<ul>
<li>Looking at the 128 IITA records (20195TH.xls) that Sisay uploadd to DSpace Test last month: <a href="https://dspacetest.cgiar.org/handle/10568/102361">IITA_July_29</a>

<ul>
<li>The records are pretty clean because Sisay ran them through the csv-metadata-quality tool</li>
<li>I fixed one incorrect country (MELBOURNE)</li>
<li>I normalized all DOIs to be <a href="https://doi.org">https://doi.org</a> format</li>
<li>This item is using the wrong Google Books link: <a href="https://dspacetest.cgiar.org/handle/10568/102593">https://dspacetest.cgiar.org/handle/10568/102593</a></li>
<li>The French abstract here has copy/paste errors: <a href="https://dspacetest.cgiar.org/handle/10568/102491">https://dspacetest.cgiar.org/handle/10568/102491</a></li>
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:</li>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
<li>I asked Bosede to check about twenty-five invalid AGROVOC subjects identified by csv-metadata-quality script</li>
<li>I still need to check the sponsors and then check for duplicates</li>
</ul></li>
</ul>

<h2 id="2019-08-10">2019-08-10</h2>

<ul>
<li>Add checks for uncommon filename extensions and replacements for unneccesary Unicode to the csv-metadata-quality script</li>
</ul>

<!-- vim: set sw=2 ts=2: -->


</article>


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">


        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2019-08/">August, 2019</a></li>

<li><a href="/cgspace-notes/posts/">Posts</a></li>

<li><a href="/cgspace-notes/2019-07/">July, 2019</a></li>

<li><a href="/cgspace-notes/2019-06/">June, 2019</a></li>

<li><a href="/cgspace-notes/2019-05/">May, 2019</a></li>

    </ol>
  </section>


  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">

      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>

      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>

      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>

    </ol>
  </section>

</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->


    <footer class="blog-footer">
      <p>

      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.

      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>


  </body>

</html>