cgspace-notes/docs/2019-08/index.html

<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">


<meta property="og:title" content="August, 2019" />
<meta property="og:description" content="2019-08-03

Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;

2019-08-04

Deploy ORCID identifier updates requested by Bioversity to CGSpace
Run system updates on CGSpace (linode18) and reboot it

Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;
After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.


Run system updates on DSpace Test (linode19) and reboot it
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-08/" />
<meta property="article:published_time" content="2019-08-03T12:39:51+03:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />


<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2019"/>
<meta name="twitter:description" content="2019-08-03

Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;

2019-08-04

Deploy ORCID identifier updates requested by Bioversity to CGSpace
Run system updates on CGSpace (linode18) and reboot it

Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;
After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.


Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.89.4" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "August, 2019",
  "url": "https://alanorth.github.io/cgspace-notes/2019-08/",
  "wordCount": "2703",
  "datePublished": "2019-08-03T12:39:51+03:00",
  "dateModified": "2019-10-28T13:39:25+02:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-08/">

    <title>August, 2019 | CGSpace Notes</title>

    
    <!-- combined, minified CSS -->
    
    <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
    

    <!-- minified Font Awesome for SVG icons -->
    
    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>

    <!-- RSS 2.0 feed -->
    

  </head>

  <body>

    
    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>
    

    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>
    
    
    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          
<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-08/">August, 2019</a></h2>
    <p class="blog-post-meta">
<time datetime="2019-08-03T12:39:51+03:00">Sat Aug 03, 2019</time>
 in 
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2019-08-03">2019-08-03</h2>
<ul>
<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li>
</ul>
<h2 id="2019-08-04">2019-08-04</h2>
<ul>
<li>Deploy ORCID identifier updates requested by Bioversity to CGSpace</li>
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li>
<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode19) and reboot it</li>
</ul>
<h2 id="2019-08-05">2019-08-05</h2>
<ul>
<li>Update Tomcat to 7.0.96 in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Update PostgreSQL JDBC driver to 42.2.6 in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastrucutre playbooks</a></li>
<li>Deploy both on DSpace Test (linode19)</li>
<li>Looking at the 1429 records for Bioversity migration again
<ul>
<li>The following items use the same exact PDF and seem to be duplicates:
<ul>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=10191">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10191</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=342">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=342</a></li>
</ul>
</li>
<li>The following items use the same exact PDF, but one seems to be incorrect:
<ul>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=5347">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5347</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=5340">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5340</a></li>
</ul>
</li>
<li>The following PDFs are used by several items incorrectly:
<ul>
<li><code>Report_of_a_Working_Group_on_Allium_7.pdf</code></li>
<li><code>Report_of_a_Working_Group_on_Allium_Fourth_meeting_1696.pdf</code></li>
</ul>
</li>
<li>I checked the SHA1 hashes of each PDF and found that some appear more than once&hellip;</li>
<li>The following items use the same PDF with a different name, but seem to be duplicates (pick one?):
<ul>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=433">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=433</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=10189">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10189</a></li>
</ul>
</li>
<li>The following items use the same PDF with a different name, but seem to be duplicates (pick one?):
<ul>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=332">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=332</a></li>
<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=10187">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10187</a></li>
</ul>
</li>
<li>There are about thirty PDFs that have French or Spanish filenames and there seems to be an encoding issue
<ul>
<li>I asked Francesco if he can give me a PDF URL column instead of a &ldquo;filename&rdquo; column so I can download the files myself</li>
<li>At <em>least</em> the ~50 filenames identified by the following GREL will have issues:</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>or(
  isNotNull(value.match(/^.*’.*$/)),
  isNotNull(value.match(/^.*é.*$/)),
  isNotNull(value.match(/^.*á.*$/)),
  isNotNull(value.match(/^.*è.*$/)),
  isNotNull(value.match(/^.*í.*$/)),
  isNotNull(value.match(/^.*ó.*$/)),
  isNotNull(value.match(/^.*ú.*$/)),
  isNotNull(value.match(/^.*à.*$/)),
  isNotNull(value.match(/^.*û.*$/))
).toString()
</code></pre><ul>
<li>I tried to extract the filenames and construct a URL to download the PDFs with my <code>generate-thumbnails.py</code> script, but there seem to be several paths for PDFs so I can&rsquo;t guess it properly</li>
<li>I will have to wait for Francesco to respond about the PDFs, or perhaps proceed with a metadata-only upload so we can do other checks on DSpace Test</li>
</ul>
<h2 id="2019-08-06">2019-08-06</h2>
<ul>
<li>Francesca responded to address my feedback yesterday
<ul>
<li>I made some changes to the CSV based on her feedback (remove two duplicates, change one PDF file name, change two titles)</li>
<li>Then I found some items that have PDFs in multiple languages that only list one language in <code>dc.language.iso</code> so I changed them</li>
<li>Strangley, one item was referring to a 7zip file&hellip;</li>
<li>After removing the two duplicates there are now 1427 records</li>
<li>Fix one invalid ISSN: 1020-2002→1020-3362</li>
</ul>
</li>
</ul>
<h2 id="2019-08-07">2019-08-07</h2>
<ul>
<li>Daniel Haile-Michael asked about using a logical OR with the DSpace OpenSearch, but I looked in the DSpace manual and it does not seem to be possible</li>
</ul>
<h2 id="2019-08-08">2019-08-08</h2>
<ul>
<li>Moayad noticed that the HTTPS certificate expired on the AReS dev server (linode20)
<ul>
<li>The first problem was that there is a Docker container listening on port 80, so it conflicts with the ACME http-01 validation</li>
<li>The second problem was that we only allow access to port 80 from localhost</li>
<li>I adjusted the <code>renew-letsencrypt</code> systemd service so it stops/starts the Docker container and firewall:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code># /opt/certbot-auto renew --standalone --pre-hook &quot;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&quot; --post-hook &quot;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&quot;
</code></pre><ul>
<li>It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains</li>
<li>Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04&rsquo;s <a href="https://ssl-config.mozilla.org/#server=nginx&amp;server-version=1.16.0&amp;config=intermediate&amp;openssl-version=1.1.0g&amp;hsts=false&amp;ocsp=false">OpenSSL 1.1.0g with nginx 1.16.0</a></li>
<li>Run all system updates on AReS dev server (linode20) and reboot it</li>
<li>Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:</li>
</ul>
<pre tabindex="0"><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
$ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs2.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
$ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt
</code></pre><ul>
<li>
<p>(the weird sed regex removes color codes, because my generate-thumbnails script prints pretty colors)</p>
</li>
<li>
<p>Some PDFs are uploaded in different paths so I have to try a few times to get them all:</p>
<ul>
<li><code>/fileadmin/_migrated/uploads/tx_news/</code></li>
<li><code>/fileadmin/user_upload/online_library/publications/pdfs/</code></li>
<li><code>/fileadmin/user_upload/</code></li>
</ul>
</li>
<li>
<p>Even so, there are still 52 items with incorrect filenames, so I can&rsquo;t derive their PDF URLs&hellip;</p>
<ul>
<li>For example, <code>Wild_cherry_Prunus_avium_859.pdf</code> is here (with double underscore): <a href="https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf">https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf</a></li>
</ul>
</li>
<li>
<p>I will proceed with a metadata-only upload first and then let them know about the missing PDFs</p>
</li>
<li>
<p>Troubleshoot an issue we had with proxying to the new development version of AReS from DSpace Test (linode19)</p>
<ul>
<li>For some reason the host header in the proxy pass is not set so nginx on DSpace Test makes a request to the upstream nginx on an IP-based virtual host</li>
<li>The upstream nginx returns HTTP 444 because we configured it to not answer when a request does not send a valid hostname</li>
<li>The solution is to set the host header when proxy passing:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>proxy_set_header Host dev.ares.codeobia.com;
</code></pre><ul>
<li>Though I am really wondering why this happened now, because the configuration has been working for months&hellip;</li>
<li>Improve the output of the suspicious characters check in <a href="https://github.com/alanorth/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.0</li>
</ul>
<h2 id="2019-08-09">2019-08-09</h2>
<ul>
<li>Looking at the 128 IITA records (20195TH.xls) that Sisay uploadd to DSpace Test last month: <a href="https://dspacetest.cgiar.org/handle/10568/102361">IITA_July_29</a>
<ul>
<li>The records are pretty clean because Sisay ran them through the csv-metadata-quality tool</li>
<li>I fixed one incorrect country (MELBOURNE)</li>
<li>I normalized all DOIs to be <a href="https://doi.org">https://doi.org</a> format</li>
<li>This item is using the wrong Google Books link: <a href="https://dspacetest.cgiar.org/handle/10568/102593">https://dspacetest.cgiar.org/handle/10568/102593</a></li>
<li>The French abstract here has copy/paste errors: <a href="https://dspacetest.cgiar.org/handle/10568/102491">https://dspacetest.cgiar.org/handle/10568/102491</a></li>
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
</ul>
</li>
<li>I asked Bosede to check about twenty-five invalid AGROVOC subjects identified by csv-metadata-quality script</li>
<li>I still need to check the sponsors and then check for duplicates</li>
</ul>
</li>
</ul>
<h2 id="2019-08-10">2019-08-10</h2>
<ul>
<li>Add checks for uncommon filename extensions and replacements for unneccesary Unicode to the csv-metadata-quality script</li>
</ul>
<h2 id="2019-08-12">2019-08-12</h2>
<ul>
<li>Looking at the 128 IITA records again:
<ul>
<li>Validate and normalize affiliations against our 2019-02 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-02-22-sponsorships.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
</ul>
</li>
<li>I checked the collection for duplicates and found a few:
<ul>
<li><a href="https://dspacetest.cgiar.org/handle/10568/102513">https://dspacetest.cgiar.org/handle/10568/102513</a> is a duplicate of CIAT item: <a href="https://cgspace.cgiar.org/handle/10568/44158">https://cgspace.cgiar.org/handle/10568/44158</a></li>
<li><a href="https://dspacetest.cgiar.org/handle/10568/102512">https://dspacetest.cgiar.org/handle/10568/102512</a> is a duplicate of CIAT item: <a href="https://cgspace.cgiar.org/handle/10568/43557">https://cgspace.cgiar.org/handle/10568/43557</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="2019-08-13">2019-08-13</h2>
<ul>
<li>Create a test user on DSpace Test for Mohammad Salem to attempt depositing:</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
</code></pre><ul>
<li>Create and merge a pull request (<a href="https://github.com/ilri/DSpace/pull/429">#429</a>) to add eleven new CCAFS Phase II Project Tags to CGSpace</li>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">Solr cores issue</a> last week, but they could not reproduce the issue
<ul>
<li>I told them not to continue, and that we would keep an eye on it and keep troubleshooting it (if neccessary) in the public eye on dspace-tech and Solr mailing lists</li>
</ul>
</li>
<li>Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:</li>
</ul>
<pre tabindex="0"><code>$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
...
java.lang.OutOfMemoryError: GC overhead limit exceeded
</code></pre><ul>
<li>I increased the heap size to 1536m and tried again:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1536m&quot;
$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
</code></pre><ul>
<li>This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM</li>
<li>(oops, I realize that actually I forgot to delete items I had flagged as duplicates, so the total should be 1,427 items)</li>
</ul>
<h2 id="2019-08-14">2019-08-14</h2>
<ul>
<li>I imported the 1,427 Bioversity records into DSpace Test
<ul>
<li>To make sure we didn&rsquo;t have memory issues I reduced Tomcat&rsquo;s JVM heap by 512m, increased the import processes&rsquo;s heap to 512m, and split the input file into two parts with about 700 each</li>
<li>Then I had to create a few new temporary collections on DSpace Test that had been created on CGSpace after our last sync</li>
<li>After that the import succeeded:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
$ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
$ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
</code></pre><ul>
<li>The next step is to check these items for duplicates</li>
</ul>
<h2 id="2019-08-16">2019-08-16</h2>
<ul>
<li>Email Bioversity to let them know that the 1,427 records are on DSpace Test and that Abenet should look over them</li>
</ul>
<h2 id="2019-08-18">2019-08-18</h2>
<ul>
<li>Deploy latest <code>5_x-prod</code> branch on CGSpace (linode18), including the <a href="https://github.com/ilri/DSpace/pull/429">new CCAFS project tags</a></li>
<li>Deploy Tomcat 7.0.96 and PostgreSQL JDBC 42.2.6 driver on CGSpace (linde18)</li>
<li>After restarting Tomcat one of the Solr statistics cores failed to start up:</li>
</ul>
<pre tabindex="0"><code>statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I decided to run all system updates on the server and reboot it</li>
<li>After reboot the statistics-2018 core failed to load so I restarted <code>tomcat7</code> again</li>
<li>After this last restart all Solr cores seem to be up and running</li>
</ul>
<h2 id="2019-08-20">2019-08-20</h2>
<ul>
<li>Francesco sent me a new CSV with the raw filenames and paths for the Bioversity migration
<ul>
<li>All file paths are relative to the Typo3 upload path of <code>/fileadmin</code> on the Bioversity website</li>
<li>I create a new column with the derived URL that I can use to download the PDFs with my <code>generate-thumbnails.py</code> script</li>
<li>Unfortunately now the filename column has paths too, so I have to use a simple Python/Jython script in OpenRefine to get the basename of the files in the filename column:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>import os

return os.path.basename(value)
</code></pre><ul>
<li>Then I can try to download all the files again with the script</li>
<li>I also asked Francesco about the strange filenames (.LCK, .zip, and .7z)</li>
</ul>
<h2 id="2019-08-21">2019-08-21</h2>
<ul>
<li>Upload <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality repository to ILRI&rsquo;s GitHub organization</a></li>
<li>Fix a few invalid countries in IITA&rsquo;s <a href="https://dspacetest.cgiar.org/handle/10568/102361">July 29</a> records (aka &ldquo;20195TH.xls&rdquo;)
<ul>
<li>These were not caught by my csv-metadata-quality check script because of a logic error</li>
<li>Remove <code>dc.identified.uri</code> fields from test data, set <code>id</code> values to &ldquo;-1&rdquo;, add collection mappings according to <code>dc.type</code>, and Upload 126 IITA records to CGSpace</li>
</ul>
</li>
</ul>
<h2 id="2019-08-22">2019-08-22</h2>
<ul>
<li>Transfer original <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> repository to ILRI organization on GitHub</li>
</ul>
<h2 id="2019-08-23">2019-08-23</h2>
<ul>
<li>Run system updates on AReS / OpenRXV dev server (linode20) and reboot it</li>
<li>Fix AReS exports on DSpace Test by adding a new nginx proxy pass</li>
</ul>
<h2 id="2019-08-26">2019-08-26</h2>
<ul>
<li>Peter sent 2,943 corrections to the author dump I had originally sent him on 2019-05-27
<ul>
<li>I noticed that one correction had a missing space after the comma, ie &ldquo;Adamou,A.&rdquo; so I corrected it</li>
<li>Also, I should add that as a check to the csv-metadata-quality pipeline</li>
<li>Apply the corrections to my local dev machine in preparation for the CGSpace:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
</code></pre><ul>
<li>Apply the corrections on CGSpace and DSpace Test
<ul>
<li>After that I started a full Discovery re-indexing on both servers:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    81m47.057s 
user    8m5.265s 
sys     2m24.715s
</code></pre><ul>
<li>
<p>Peter asked me to add related citation aka <code>cg.link.citation</code> to the item view</p>
<ul>
<li>I created a <a href="https://github.com/ilri/DSpace/pull/430">pull request</a> with a draft implementation and asked for Peter&rsquo;s feedback</li>
</ul>
</li>
<li>
<p>Add the ability to skip certain fields from the csv-metadata-quality script using <code>--exclude-fields</code></p>
<ul>
<li>For example, when I&rsquo;m working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use <code>--exclude-fields dc.contributor.author</code> for example</li>
</ul>
</li>
</ul>
<h2 id="2019-08-27">2019-08-27</h2>
<ul>
<li>File <a href="https://github.com/ilri/OpenRXV/issues/11">an issue on OpenRXV</a> for the bug when selecting communities</li>
<li>Peter approved the related citation changes so I merged the <a href="https://github.com/ilri/DSpace/pull/430">pull request on GitHub</a> and will deploy it to CGSpace this weekend</li>
<li>Add a safety feature to <code>fix-metadata-values.py</code> that skips correction values that contain the &lsquo;|&rsquo; character</li>
<li>Help Francesco from Bioversity with the REST and OAI APIs on CGSpace
<ul>
<li>He is contracted by Bioversity to work on the migration from Typo3</li>
<li>I told him that the OAI interface only exposes Dublin Core fields in its default configuration and that he might want to use OAI to get the latest-changed items, then use REST API to get their metadata</li>
</ul>
</li>
<li>Add a fix for missing space after commas to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.2</li>
</ul>
<h2 id="2019-08-28">2019-08-28</h2>
<ul>
<li>Skype with Jane about AReS Phase III priorities</li>
<li>I did a test to automatically fix some authors in the database using my csv-metadata-quality script
<ul>
<li>First I dumped a list of all unique authors:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
COPY 65597
</code></pre><ul>
<li>Then I created a new CSV with two author columns (edit title of second column after):</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv &gt; /tmp/all-authors.csv
</code></pre><ul>
<li>Then I ran my script on the new CSV, skipping one of the author columns:</li>
</ul>
<pre tabindex="0"><code>$ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
</code></pre><ul>
<li>This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc</li>
<li>Then I ran the corrections on my test server and there were 185 of them!</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
</code></pre><ul>
<li>I very well might run these on CGSpace soon&hellip;</li>
</ul>
<h2 id="2019-08-29">2019-08-29</h2>
<ul>
<li>Resume working on the CG Core v2 changes in the <code>5_x-cgcorev2</code> branch again
<ul>
<li>I notice that CG Core doesn&rsquo;t currently have a field for CGSpace&rsquo;s &ldquo;alternative title&rdquo; (<code>dc.title.alternative</code>), but DCTERMS has <code>dcterms.alternative</code> so I <a href="https://github.com/AgriculturalSemantics/cg-core/issues/9">raised an issue about adding it</a></li>
<li>Marie responded and said she would add <code>dcterms.alternative</code></li>
<li>I created a sed script file to perform some replacements of metadata on the XMLUI XSL files:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec ./cgcore-xsl-replacements.sed {} \;
</code></pre><ul>
<li>I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:
<ul>
<li>Need to assess the XSL changes to see if things like <code>not(@qualifier)]</code> still make sense after we move fields from DC to DCTERMS, as some fields will no longer have qualifiers</li>
<li>Do I need to edit the author links to remove <code>dc.contributor.author</code> in <code>0_CGIAR/xsl/aspect/artifactbrowser/item-list-alterations.xsl</code>?</li>
<li>Do I need to edit the author links to remove <code>dc.contributor.author</code> in <code>0_CGIAR/xsl/aspect/discovery/discovery-item-list-alterations.xsl</code>?</li>
</ul>
</li>
<li>Thierry Lewadle asked why some PDFs on CGSpace open in the browser and some download
<ul>
<li>I told him it is because of the &ldquo;content disposition&rdquo; that causes DSpace to tell the browser to open or download the file based on its file size (currently around 8 megabytes)</li>
</ul>
</li>
<li>Peter asked why <a href="https://hdl.handle.net/10568/97825">an item on CGSpace</a> has no Altmetric donut on the item view, but has one in our explorer
<ul>
<li>I looked in the network requests when loading the CGSpace item view and I see the following response to the Altmetric API call:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>&quot;handles&quot;:[&quot;10986/30568&quot;,&quot;10568/97825&quot;],&quot;handle&quot;:&quot;10986/30568&quot;
</code></pre><ul>
<li>So this is the same issue we had before, where Altmetric <em>knows</em> this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn&rsquo;t show it because it seems to a secondary handle or something</li>
</ul>
<h2 id="2019-08-31">2019-08-31</h2>
<ul>
<li>Run system updates on DSpace Test (linode19) and reboot the server</li>
<li>Run the author fixes on DSpace Test and CGSpace and start a full Discovery re-index:</li>
</ul>
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
 
real    90m47.967s
user    8m12.826s
sys     2m27.496s
</code></pre><ul>
<li>I set up a test environment for CG Core v2 on my local environment and ran all the field migrations
<ul>
<li>DSpace comes up and runs, but there are some graphical issues, like missing community names</li>
<li>It turns out that my sed script was replacing some XSL code that was responsible for printing community names</li>
<li>See: <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/custom/communitylist.xsl</code></li>
<li>After reading the code I see that XSLT is reading the community titles from the DIM representation (stored in the <code>$dim</code> variable) created from METS</li>
<li>I modified the patterns in my sed script so that those lines are not replaced and then the community list works again</li>
<li>This is actually not a problem at all because this metadata is only used in the HTML meta tags in XMLUI community lists and has nothing to do with item metadata</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

  
</article> 


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">
  

        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>

<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>

<li><a href="/cgspace-notes/2021-10/">October, 2021</a></li>

<li><a href="/cgspace-notes/2021-09/">September, 2021</a></li>

<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>

    </ol>
  </section>

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->
    

    <footer class="blog-footer">
      <p dir="auto">
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>
    

  </body>

</html>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								<!DOCTYPE html>
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								<html lang="en" >
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 								  <head>
 								    <meta charset="utf-8">
 								<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-												Add notes for 2020-12

											
										
										
											2020-12-06 16:53:29 +02:00
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								<meta property="og:title" content="August, 2019" />
 								<meta property="og:description" content="2019-08-03
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 -08-04
 								Deploy ORCID identifier updates requested by Bioversity to CGSpace
 								Run system updates on CGSpace (linode18) and reboot it
 								Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 								Run system updates on DSpace Test (linode19) and reboot it
 								" />
 								<meta property="og:type" content="article" />
 								<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-08/" />
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								<meta property="article:published_time" content="2019-08-03T12:39:51+03:00" />
-												Regenerate public

											
										
										
											2019-10-28 13:43:25 +02:00
+								<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
-												Add notes for 2020-12

											
										
										
											2020-12-06 16:53:29 +02:00
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								<meta name="twitter:card" content="summary"/>
 								<meta name="twitter:title" content="August, 2019"/>
 								<meta name="twitter:description" content="2019-08-03
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 -08-04
 								Deploy ORCID identifier updates requested by Bioversity to CGSpace
 								Run system updates on CGSpace (linode18) and reboot it
 								Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 								Run system updates on DSpace Test (linode19) and reboot it
 								"/>
-												Add notes for 2021-11-22

											
										
										
											2021-11-22 16:47:50 +02:00
+								<meta name="generator" content="Hugo 0.89.4" />
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 								<script type="application/ld+json">
 								{
 								  "@context": "http://schema.org",
 								  "@type": "BlogPosting",
 								  "headline": "August, 2019",
-												Regenerate public

											
										
										
											2020-04-02 10:55:42 +03:00
+								  "url": "https://alanorth.github.io/cgspace-notes/2019-08/",
-												Add notes for 2019-09-27

											
										
										
											2019-09-27 17:53:18 +03:00
+								  "wordCount": "2703",
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								  "datePublished": "2019-08-03T12:39:51+03:00",
-												Regenerate public

											
										
										
											2019-10-28 13:43:25 +02:00
+								  "dateModified": "2019-10-28T13:39:25+02:00",
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								  "author": {
 								    "@type": "Person",
 								    "name": "Alan Orth"
 								  },
 								  "keywords": "Notes"
 								}
 								</script>
 								    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-08/">
 								    <title>August, 2019 | CGSpace Notes</title>
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								    <!-- combined, minified CSS -->
-												Regenerate docs

											
										
										
											2020-01-23 20:19:38 +02:00
-												Update docs

											
										
										
											2021-01-24 09:46:27 +02:00
+								    <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
-												Regenerate docs

											
										
										
											2020-01-28 12:01:42 +02:00
+								    <!-- minified Font Awesome for SVG icons -->
-												Regenerate docs

											
										
										
											2021-09-28 10:32:32 +03:00
+								    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
-												Regenerate docs

											
										
										
											2020-01-28 12:01:42 +02:00
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								    <!-- RSS 2.0 feed -->
 								  </head>
 								  <body>
 								    <div class="blog-masthead">
 								      <div class="container">
 								        <nav class="nav blog-nav">
 								          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
 								        </nav>
 								      </div>
 								    </div>
 								    <header class="blog-header">
 								      <div class="container">
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
 								        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								      </div>
 								    </header>
 								    <div class="container">
 								      <div class="row">
 								        <div class="col-sm-8 blog-main">
 								<article class="blog-post">
 								  <header>
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-08/">August, 2019</a></h2>
-												Regenerate docs

											
										
										
											2020-11-16 10:54:00 +02:00
+								    <p class="blog-post-meta">
 								<time datetime="2019-08-03T12:39:51+03:00">Sat Aug 03, 2019</time>
 								 in
-												Regenerate docs

											
										
										
											2020-01-28 12:01:42 +02:00
+								<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 								</p>
 								  </header>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								  <h2 id="2019-08-03">2019-08-03</h2>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Look at Bioversity&rsquo;s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name&hellip;</li>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-04">2019-08-04</h2>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								<ul>
 								<li>Deploy ORCID identifier updates requested by Bioversity to CGSpace</li>
 								<li>Run system updates on CGSpace (linode18) and reboot it
 								<ul>
 								<li>Before updating it I checked Solr and verified that all statistics cores were loaded properly&hellip;</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s lucky.</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								<li>Run system updates on DSpace Test (linode19) and reboot it</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-05">2019-08-05</h2>
-												Add notes for 2019-08-05

											
										
										
											2019-08-05 16:49:31 +03:00
+								<ul>
 								<li>Update Tomcat to 7.0.96 in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
 								<li>Update PostgreSQL JDBC driver to 42.2.6 in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastrucutre playbooks</a></li>
 								<li>Deploy both on DSpace Test (linode19)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Looking at the 1429 records for Bioversity migration again
 								<ul>
 								<li>The following items use the same exact PDF and seem to be duplicates:
 								<ul>
-												Add notes for 2019-12-30

											
										
										
											2019-12-30 11:28:49 +02:00
+								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=10191">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10191</a></li>
 								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=342">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=342</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>The following items use the same exact PDF, but one seems to be incorrect:
 								<ul>
-												Add notes for 2019-12-30

											
										
										
											2019-12-30 11:28:49 +02:00
+								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=5347">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5347</a></li>
 								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=5340">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=5340</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>The following PDFs are used by several items incorrectly:
-												Add notes for 2019-08-05

											
										
										
											2019-08-05 16:49:31 +03:00
+								<ul>
 								<li><code>Report_of_a_Working_Group_on_Allium_7.pdf</code></li>
 								<li><code>Report_of_a_Working_Group_on_Allium_Fourth_meeting_1696.pdf</code></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2019-08-06

											
										
										
											2019-08-06 20:07:44 +03:00
+								<li>I checked the SHA1 hashes of each PDF and found that some appear more than once&hellip;</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>The following items use the same PDF with a different name, but seem to be duplicates (pick one?):
 								<ul>
-												Add notes for 2019-12-30

											
										
										
											2019-12-30 11:28:49 +02:00
+								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=433">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=433</a></li>
 								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=10189">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10189</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>The following items use the same PDF with a different name, but seem to be duplicates (pick one?):
 								<ul>
-												Add notes for 2019-12-30

											
										
										
											2019-12-30 11:28:49 +02:00
+								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=332">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=332</a></li>
 								<li><a href="https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1%5Bnews%5D=10187">https://www.bioversityinternational.org/index.php?id=244&amp;tx_news_pi1[news]=10187</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>There are about thirty PDFs that have French or Spanish filenames and there seems to be an encoding issue
 								<ul>
-												Add notes for 2019-08-05

											
										
										
											2019-08-05 16:49:31 +03:00
+								<li>I asked Francesco if he can give me a PDF URL column instead of a &ldquo;filename&rdquo; column so I can download the files myself</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>At <em>least</em> the ~50 filenames identified by the following GREL will have issues:</li>
 								</ul>
 								</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>or(
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								  isNotNull(value.match(/^.*’.*$/)),
 								  isNotNull(value.match(/^.*é.*$/)),
 								  isNotNull(value.match(/^.*á.*$/)),
 								  isNotNull(value.match(/^.*è.*$/)),
 								  isNotNull(value.match(/^.*í.*$/)),
 								  isNotNull(value.match(/^.*ó.*$/)),
 								  isNotNull(value.match(/^.*ú.*$/)),
 								  isNotNull(value.match(/^.*à.*$/)),
 								  isNotNull(value.match(/^.*û.*$/))
-												Add notes for 2019-08-05

											
										
										
											2019-08-05 16:49:31 +03:00
+								).toString()
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I tried to extract the filenames and construct a URL to download the PDFs with my <code>generate-thumbnails.py</code> script, but there seem to be several paths for PDFs so I can&rsquo;t guess it properly</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I will have to wait for Francesco to respond about the PDFs, or perhaps proceed with a metadata-only upload so we can do other checks on DSpace Test</li>
-												Add notes for 2019-08-05

											
										
										
											2019-08-05 16:49:31 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-06">2019-08-06</h2>
-												Add notes for 2019-08-06

											
										
										
											2019-08-06 20:07:44 +03:00
+								<ul>
 								<li>Francesca responded to address my feedback yesterday
 								<ul>
 								<li>I made some changes to the CSV based on her feedback (remove two duplicates, change one PDF file name, change two titles)</li>
 								<li>Then I found some items that have PDFs in multiple languages that only list one language in <code>dc.language.iso</code> so I changed them</li>
 								<li>Strangley, one item was referring to a 7zip file&hellip;</li>
 								<li>After removing the two duplicates there are now 1427 records</li>
-												Update notes for 2019-08-06

											
										
										
											2019-08-06 20:11:27 +03:00
+								<li>Fix one invalid ISSN: 1020-2002→1020-3362</li>
-												Add notes for 2019-08-06

											
										
										
											2019-08-06 20:07:44 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-07">2019-08-07</h2>
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								<ul>
 								<li>Daniel Haile-Michael asked about using a logical OR with the DSpace OpenSearch, but I looked in the DSpace manual and it does not seem to be possible</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-08">2019-08-08</h2>
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Moayad noticed that the HTTPS certificate expired on the AReS dev server (linode20)
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								<ul>
 								<li>The first problem was that there is a Docker container listening on port 80, so it conflicts with the ACME http-01 validation</li>
 								<li>The second problem was that we only allow access to port 80 from localhost</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I adjusted the <code>renew-letsencrypt</code> systemd service so it stops/starts the Docker container and firewall:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># /opt/certbot-auto renew --standalone --pre-hook &quot;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&quot; --post-hook &quot;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&quot;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04&rsquo;s <a href="https://ssl-config.mozilla.org/#server=nginx&amp;server-version=1.16.0&amp;config=intermediate&amp;openssl-version=1.1.0g&amp;hsts=false&amp;ocsp=false">OpenSSL 1.1.0g with nginx 1.16.0</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Run all system updates on AReS dev server (linode20) and reboot it</li>
 								<li>Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
 								$ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
 								$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs2.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
 								$ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>
 								<p>(the weird sed regex removes color codes, because my generate-thumbnails script prints pretty colors)</p>
 								</li>
 								<li>
 								<p>Some PDFs are uploaded in different paths so I have to try a few times to get them all:</p>
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								<ul>
 								<li><code>/fileadmin/_migrated/uploads/tx_news/</code></li>
 								<li><code>/fileadmin/user_upload/online_library/publications/pdfs/</code></li>
 								<li><code>/fileadmin/user_upload/</code></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<p>Even so, there are still 52 items with incorrect filenames, so I can&rsquo;t derive their PDF URLs&hellip;</p>
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								<ul>
 								<li>For example, <code>Wild_cherry_Prunus_avium_859.pdf</code> is here (with double underscore): <a href="https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf">https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>
 								<p>I will proceed with a metadata-only upload first and then let them know about the missing PDFs</p>
 								</li>
 								<li>
 								<p>Troubleshoot an issue we had with proxying to the new development version of AReS from DSpace Test (linode19)</p>
-												Update notes for 2019-08-08

											
										
										
											2019-08-09 01:42:13 +03:00
+								<ul>
 								<li>For some reason the host header in the proxy pass is not set so nginx on DSpace Test makes a request to the upstream nginx on an IP-based virtual host</li>
 								<li>The upstream nginx returns HTTP 444 because we configured it to not answer when a request does not send a valid hostname</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>The solution is to set the host header when proxy passing:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>proxy_set_header Host dev.ares.codeobia.com;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Though I am really wondering why this happened now, because the configuration has been working for months&hellip;</li>
 								<li>Improve the output of the suspicious characters check in <a href="https://github.com/alanorth/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.0</li>
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-09">2019-08-09</h2>
-												Add notes for 2019-08-09

											
										
										
											2019-08-09 09:57:26 +03:00
+								<ul>
 								<li>Looking at the 128 IITA records (20195TH.xls) that Sisay uploadd to DSpace Test last month: <a href="https://dspacetest.cgiar.org/handle/10568/102361">IITA_July_29</a>
 								<ul>
 								<li>The records are pretty clean because Sisay ran them through the csv-metadata-quality tool</li>
 								<li>I fixed one incorrect country (MELBOURNE)</li>
 								<li>I normalized all DOIs to be <a href="https://doi.org">https://doi.org</a> format</li>
 								<li>This item is using the wrong Google Books link: <a href="https://dspacetest.cgiar.org/handle/10568/102593">https://dspacetest.cgiar.org/handle/10568/102593</a></li>
 								<li>The French abstract here has copy/paste errors: <a href="https://dspacetest.cgiar.org/handle/10568/102491">https://dspacetest.cgiar.org/handle/10568/102491</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
 								<ul>
-												Add notes for 2019-08-09

											
										
										
											2019-08-09 09:57:26 +03:00
+								<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
 								<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2019-08-09

											
										
										
											2019-08-09 09:57:26 +03:00
+								<li>I asked Bosede to check about twenty-five invalid AGROVOC subjects identified by csv-metadata-quality script</li>
 								<li>I still need to check the sponsors and then check for duplicates</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-10">2019-08-10</h2>
-												Add notes for 2019-08-10

											
										
										
											2019-08-11 00:12:28 +03:00
+								<ul>
 								<li>Add checks for uncommon filename extensions and replacements for unneccesary Unicode to the csv-metadata-quality script</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-12">2019-08-12</h2>
-												Add notes for 2019-08-12

											
										
										
											2019-08-12 18:08:00 +03:00
+								<ul>
 								<li>Looking at the 128 IITA records again:
 								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Validate and normalize affiliations against our 2019-02 list using reconcile-csv and OpenRefine:
 								<ul>
-												Add notes for 2019-08-12

											
										
										
											2019-08-12 18:08:00 +03:00
+								<li><code>$ lein run ~/src/git/DSpace/2019-02-22-sponsorships.csv name id</code></li>
 								<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>I checked the collection for duplicates and found a few:
 								<ul>
-												Add notes for 2019-08-12

											
										
										
											2019-08-12 18:08:00 +03:00
+								<li><a href="https://dspacetest.cgiar.org/handle/10568/102513">https://dspacetest.cgiar.org/handle/10568/102513</a> is a duplicate of CIAT item: <a href="https://cgspace.cgiar.org/handle/10568/44158">https://cgspace.cgiar.org/handle/10568/44158</a></li>
 								<li><a href="https://dspacetest.cgiar.org/handle/10568/102512">https://dspacetest.cgiar.org/handle/10568/102512</a> is a duplicate of CIAT item: <a href="https://cgspace.cgiar.org/handle/10568/43557">https://cgspace.cgiar.org/handle/10568/43557</a></li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-13">2019-08-13</h2>
-												Add notes for 2019-08-13

											
										
										
											2019-08-13 15:33:29 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Create a test user on DSpace Test for Mohammad Salem to attempt depositing:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Create and merge a pull request (<a href="https://github.com/ilri/DSpace/pull/429">#429</a>) to add eleven new CCAFS Phase II Project Tags to CGSpace</li>
 								<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">Solr cores issue</a> last week, but they could not reproduce the issue
-												Update notes for 2019-08-13

											
										
										
											2019-08-13 16:54:35 +03:00
+								<ul>
 								<li>I told them not to continue, and that we would keep an eye on it and keep troubleshooting it (if neccessary) in the public eye on dspace-tech and Solr mailing lists</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
-												Add notes for 2019-08-14

											
										
										
											2019-08-14 13:39:29 +03:00
+								...
 								java.lang.OutOfMemoryError: GC overhead limit exceeded
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I increased the heap size to 1536m and tried again:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1536m&quot;
-												Add notes for 2019-08-14

											
										
										
											2019-08-14 13:39:29 +03:00
+								$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM</li>
 								<li>(oops, I realize that actually I forgot to delete items I had flagged as duplicates, so the total should be 1,427 items)</li>
-												Add notes for 2019-08-14

											
										
										
											2019-08-14 13:39:29 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-14">2019-08-14</h2>
-												Add notes for 2019-08-14

											
										
										
											2019-08-14 13:39:29 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I imported the 1,427 Bioversity records into DSpace Test
-												Add notes for 2019-08-14

											
										
										
											2019-08-14 13:39:29 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>To make sure we didn&rsquo;t have memory issues I reduced Tomcat&rsquo;s JVM heap by 512m, increased the import processes&rsquo;s heap to 512m, and split the input file into two parts with about 700 each</li>
-												Add notes for 2019-08-14

											
										
										
											2019-08-14 13:39:29 +03:00
+								<li>Then I had to create a few new temporary collections on DSpace Test that had been created on CGSpace after our last sync</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>After that the import succeeded:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
-												Add notes for 2019-08-14

											
										
										
											2019-08-14 13:39:29 +03:00
+								$ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
 								$ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>The next step is to check these items for duplicates</li>
-												Add notes for 2019-08-13

											
										
										
											2019-08-13 15:33:29 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-16">2019-08-16</h2>
-												Add notes for 2019-08-16

											
										
										
											2019-08-16 17:10:11 +03:00
+								<ul>
 								<li>Email Bioversity to let them know that the 1,427 records are on DSpace Test and that Abenet should look over them</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-18">2019-08-18</h2>
-												Add notes for 2019-08-18

											
										
										
											2019-08-18 23:07:48 +03:00
+								<ul>
 								<li>Deploy latest <code>5_x-prod</code> branch on CGSpace (linode18), including the <a href="https://github.com/ilri/DSpace/pull/429">new CCAFS project tags</a></li>
 								<li>Deploy Tomcat 7.0.96 and PostgreSQL JDBC 42.2.6 driver on CGSpace (linde18)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>After restarting Tomcat one of the Solr statistics cores failed to start up:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I decided to run all system updates on the server and reboot it</li>
 								<li>After reboot the statistics-2018 core failed to load so I restarted <code>tomcat7</code> again</li>
 								<li>After this last restart all Solr cores seem to be up and running</li>
-												Add notes for 2019-08-18

											
										
										
											2019-08-18 23:07:48 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-20">2019-08-20</h2>
-												Add notes for 2019-08-20

											
										
										
											2019-08-21 01:26:32 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Francesco sent me a new CSV with the raw filenames and paths for the Bioversity migration
-												Add notes for 2019-08-20

											
										
										
											2019-08-21 01:26:32 +03:00
+								<ul>
 								<li>All file paths are relative to the Typo3 upload path of <code>/fileadmin</code> on the Bioversity website</li>
 								<li>I create a new column with the derived URL that I can use to download the PDFs with my <code>generate-thumbnails.py</code> script</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Unfortunately now the filename column has paths too, so I have to use a simple Python/Jython script in OpenRefine to get the basename of the files in the filename column:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>import os
-												Add notes for 2019-08-20

											
										
										
											2019-08-21 01:26:32 +03:00
 								return os.path.basename(value)
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Then I can try to download all the files again with the script</li>
 								<li>I also asked Francesco about the strange filenames (.LCK, .zip, and .7z)</li>
-												Add notes for 2019-08-20

											
										
										
											2019-08-21 01:26:32 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-21">2019-08-21</h2>
-												Add notes for 2019-08-21

											
										
										
											2019-08-21 17:14:49 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Upload <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality repository to ILRI&rsquo;s GitHub organization</a></li>
 								<li>Fix a few invalid countries in IITA&rsquo;s <a href="https://dspacetest.cgiar.org/handle/10568/102361">July 29</a> records (aka &ldquo;20195TH.xls&rdquo;)
-												Add notes for 2019-08-21

											
										
										
											2019-08-21 17:14:49 +03:00
+								<ul>
 								<li>These were not caught by my csv-metadata-quality check script because of a logic error</li>
 								<li>Remove <code>dc.identified.uri</code> fields from test data, set <code>id</code> values to &ldquo;-1&rdquo;, add collection mappings according to <code>dc.type</code>, and Upload 126 IITA records to CGSpace</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-22">2019-08-22</h2>
-												Update notes for 2019-08-23

											
										
										
											2019-08-23 13:38:38 +03:00
+								<ul>
 								<li>Transfer original <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> repository to ILRI organization on GitHub</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-23">2019-08-23</h2>
-												Update notes for 2019-08-23

											
										
										
											2019-08-23 13:38:38 +03:00
+								<ul>
 								<li>Run system updates on AReS / OpenRXV dev server (linode20) and reboot it</li>
 								<li>Fix AReS exports on DSpace Test by adding a new nginx proxy pass</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-26">2019-08-26</h2>
-												Add notes for 2019-08-26

											
										
										
											2019-08-27 00:15:10 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Peter sent 2,943 corrections to the author dump I had originally sent him on 2019-05-27
-												Add notes for 2019-08-26

											
										
										
											2019-08-27 00:15:10 +03:00
+								<ul>
 								<li>I noticed that one correction had a missing space after the comma, ie &ldquo;Adamou,A.&rdquo; so I corrected it</li>
 								<li>Also, I should add that as a check to the csv-metadata-quality pipeline</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Apply the corrections to my local dev machine in preparation for the CGSpace:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Apply the corrections on CGSpace and DSpace Test
-												Add notes for 2019-08-26

											
										
										
											2019-08-27 00:15:10 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>After that I started a full Discovery re-indexing on both servers:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
-												Update notes for 2019-08-26

											
										
										
											2019-08-27 15:08:47 +03:00
 								real    81m47.057s
 								user    8m5.265s
 								sys     2m24.715s
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>
 								<p>Peter asked me to add related citation aka <code>cg.link.citation</code> to the item view</p>
-												Add notes for 2019-08-26

											
										
										
											2019-08-27 00:15:10 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I created a <a href="https://github.com/ilri/DSpace/pull/430">pull request</a> with a draft implementation and asked for Peter&rsquo;s feedback</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>
 								<p>Add the ability to skip certain fields from the csv-metadata-quality script using <code>--exclude-fields</code></p>
-												Add notes for 2019-08-26

											
										
										
											2019-08-27 00:15:10 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>For example, when I&rsquo;m working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use <code>--exclude-fields dc.contributor.author</code> for example</li>
-												Add notes for 2019-08-26

											
										
										
											2019-08-27 00:15:10 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-27">2019-08-27</h2>
-												Add note for 2019-08-27

											
										
										
											2019-08-27 15:08:00 +03:00
+								<ul>
 								<li>File <a href="https://github.com/ilri/OpenRXV/issues/11">an issue on OpenRXV</a> for the bug when selecting communities</li>
-												Update notes for 2019-08-27

											
										
										
											2019-08-27 19:26:27 +03:00
+								<li>Peter approved the related citation changes so I merged the <a href="https://github.com/ilri/DSpace/pull/430">pull request on GitHub</a> and will deploy it to CGSpace this weekend</li>
-												Update notes for 2019-08-27

											
										
										
											2019-08-27 23:44:16 +03:00
+								<li>Add a safety feature to <code>fix-metadata-values.py</code> that skips correction values that contain the &lsquo;|&rsquo; character</li>
 								<li>Help Francesco from Bioversity with the REST and OAI APIs on CGSpace
 								<ul>
 								<li>He is contracted by Bioversity to work on the migration from Typo3</li>
 								<li>I told him that the OAI interface only exposes Dublin Core fields in its default configuration and that he might want to use OAI to get the latest-changed items, then use REST API to get their metadata</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Update notes for 2019-08-27

											
										
										
											2019-08-28 00:12:00 +03:00
+								<li>Add a fix for missing space after commas to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.2</li>
-												Add note for 2019-08-27

											
										
										
											2019-08-27 15:08:00 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-28">2019-08-28</h2>
-												Add notes for 2019-08-28

											
										
										
											2019-08-28 11:19:52 +03:00
+								<ul>
 								<li>Skype with Jane about AReS Phase III priorities</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I did a test to automatically fix some authors in the database using my csv-metadata-quality script
-												Update notes for 2019-08-28

											
										
										
											2019-08-29 01:24:23 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>First I dumped a list of all unique authors:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
-												Update notes for 2019-08-28

											
										
										
											2019-08-29 01:24:23 +03:00
+								COPY 65597
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Then I created a new CSV with two author columns (edit title of second column after):</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv &gt; /tmp/all-authors.csv
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Then I ran my script on the new CSV, skipping one of the author columns:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc</li>
 								<li>Then I ran the corrections on my test server and there were 185 of them!</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I very well might run these on CGSpace soon&hellip;</li>
-												Add notes for 2019-08-28

											
										
										
											2019-08-28 11:19:52 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-29">2019-08-29</h2>
-												Update notes for 2019-08-29

											
										
										
											2019-08-29 19:25:06 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Resume working on the CG Core v2 changes in the <code>5_x-cgcorev2</code> branch again
-												Update notes for 2019-08-29

											
										
										
											2019-08-29 19:25:06 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I notice that CG Core doesn&rsquo;t currently have a field for CGSpace&rsquo;s &ldquo;alternative title&rdquo; (<code>dc.title.alternative</code>), but DCTERMS has <code>dcterms.alternative</code> so I <a href="https://github.com/AgriculturalSemantics/cg-core/issues/9">raised an issue about adding it</a></li>
-												Update notes for 2019-08-29

											
										
										
											2019-08-29 19:25:06 +03:00
+								<li>Marie responded and said she would add <code>dcterms.alternative</code></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I created a sed script file to perform some replacements of metadata on the XMLUI XSL files:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec ./cgcore-xsl-replacements.sed {} \;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:
-												Update notes for 2019-08-29

											
										
										
											2019-08-29 19:25:06 +03:00
+								<ul>
 								<li>Need to assess the XSL changes to see if things like <code>not(@qualifier)]</code> still make sense after we move fields from DC to DCTERMS, as some fields will no longer have qualifiers</li>
 								<li>Do I need to edit the author links to remove <code>dc.contributor.author</code> in <code>0_CGIAR/xsl/aspect/artifactbrowser/item-list-alterations.xsl</code>?</li>
 								<li>Do I need to edit the author links to remove <code>dc.contributor.author</code> in <code>0_CGIAR/xsl/aspect/discovery/discovery-item-list-alterations.xsl</code>?</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Thierry Lewadle asked why some PDFs on CGSpace open in the browser and some download
-												Update notes for 2019-08-29

											
										
										
											2019-08-29 19:25:06 +03:00
+								<ul>
 								<li>I told him it is because of the &ldquo;content disposition&rdquo; that causes DSpace to tell the browser to open or download the file based on its file size (currently around 8 megabytes)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Peter asked why <a href="https://hdl.handle.net/10568/97825">an item on CGSpace</a> has no Altmetric donut on the item view, but has one in our explorer
-												Update notes for 2019-08-29

											
										
										
											2019-08-29 19:25:06 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I looked in the network requests when loading the CGSpace item view and I see the following response to the Altmetric API call:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>&quot;handles&quot;:[&quot;10986/30568&quot;,&quot;10568/97825&quot;],&quot;handle&quot;:&quot;10986/30568&quot;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>So this is the same issue we had before, where Altmetric <em>knows</em> this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn&rsquo;t show it because it seems to a secondary handle or something</li>
-												Update notes for 2019-08-29

											
										
										
											2019-08-29 19:25:06 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-08-31">2019-08-31</h2>
-												Update notes for 2019-08-31

											
										
										
											2019-09-01 01:54:55 +03:00
+								<ul>
 								<li>Run system updates on DSpace Test (linode19) and reboot the server</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Run the author fixes on DSpace Test and CGSpace and start a full Discovery re-index:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
-												Update notes for 2019-08-31

											
										
										
											2019-09-01 01:54:55 +03:00
 								real    90m47.967s
 								user    8m12.826s
 								sys     2m27.496s
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I set up a test environment for CG Core v2 on my local environment and ran all the field migrations
-												Update notes for 2019-08-31

											
										
										
											2019-09-01 01:54:55 +03:00
+								<ul>
 								<li>DSpace comes up and runs, but there are some graphical issues, like missing community names</li>
 								<li>It turns out that my sed script was replacing some XSL code that was responsible for printing community names</li>
 								<li>See: <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/custom/communitylist.xsl</code></li>
 								<li>After reading the code I see that XSLT is reading the community titles from the DIM representation (stored in the <code>$dim</code> variable) created from METS</li>
 								<li>I modified the patterns in my sed script so that those lines are not replaced and then the community list works again</li>
 								<li>This is actually not a problem at all because this metadata is only used in the HTML meta tags in XMLUI community lists and has nothing to do with item metadata</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
 								<!-- raw HTML omitted -->
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 								</article>
 								        </div> <!-- /.blog-main -->
 								        <aside class="col-sm-3 ml-auto blog-sidebar">
 								        <section class="sidebar-module">
 								    <h4>Recent Posts</h4>
 								    <ol class="list-unstyled">
-												Add notes for 2021-12-02

											
										
										
											2021-12-03 12:58:43 +02:00
+								<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>
-												Add 2021-11 and regenerate docs

											
										
										
											2021-11-01 10:49:21 +02:00
+								<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>
 								<li><a href="/cgspace-notes/2021-10/">October, 2021</a></li>
-												Add notes for 2021-09-02

											
										
										
											2021-09-02 17:21:48 +03:00
+								<li><a href="/cgspace-notes/2021-09/">September, 2021</a></li>
-												Add notes for 2021-08-01

											
										
										
											2021-08-01 16:19:05 +03:00
+								<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
+								    </ol>
 								  </section>
 								  <section class="sidebar-module">
 								    <h4>Links</h4>
 								    <ol class="list-unstyled">
 								      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
 								      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
 								      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
 								    </ol>
 								  </section>
 								</aside>
 								      </div> <!-- /.row -->
 								    </div> <!-- /.container -->
 								    <footer class="blog-footer">
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								      <p dir="auto">
-												Add notes for 2019-08

											
										
										
											2019-08-04 22:49:04 +03:00
 								      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
 								      </p>
 								      <p>
 								      <a href="#">Back to top</a>
 								      </p>
 								    </footer>
 								  </body>
 								</html>