cgspace-notes/docs/2019-09/index.html

<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<meta property="og:title" content="September, 2019" />
<meta property="og:description" content="2019-09-01

Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
    440 17.58.101.255
    441 157.55.39.101
    485 207.46.13.43
    728 169.60.128.125
    730 207.46.13.108
    758 157.55.39.9
    808 66.160.140.179
    814 207.46.13.212
   2472 163.172.71.23
   6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
     33 2a01:7e00::f03c:91ff:fe16:fcb
     57 3.83.192.124
     57 3.87.77.25
     57 54.82.1.8
    822 2a01:9cc0:47:1:1a:4:0:2
   1223 45.5.184.72
   1633 172.104.229.92
   5112 205.186.128.185
   7249 2a01:7e00::f03c:91ff:fe18:7396
   9124 45.5.186.2
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-09/" />
<meta property="article:published_time" content="2019-09-01T10:17:51+03:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />

<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2019"/>
<meta name="twitter:description" content="2019-09-01

Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
    440 17.58.101.255
    441 157.55.39.101
    485 207.46.13.43
    728 169.60.128.125
    730 207.46.13.108
    758 157.55.39.9
    808 66.160.140.179
    814 207.46.13.212
   2472 163.172.71.23
   6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
     33 2a01:7e00::f03c:91ff:fe16:fcb
     57 3.83.192.124
     57 3.87.77.25
     57 54.82.1.8
    822 2a01:9cc0:47:1:1a:4:0:2
   1223 45.5.184.72
   1633 172.104.229.92
   5112 205.186.128.185
   7249 2a01:7e00::f03c:91ff:fe18:7396
   9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.67.0" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "September, 2019",
  "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-09\/",
  "wordCount": "2870",
  "datePublished": "2019-09-01T10:17:51+03:00",
  "dateModified": "2019-10-28T13:39:25+02:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-09/">

    <title>September, 2019 | CGSpace Notes</title>


    <!-- combined, minified CSS -->

    <link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">


    <!-- minified Font Awesome for SVG icons -->

    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.90e14c13cee52929ac33e1c21694a3cc95063a194eb22aad9f7976434e1a9125.js" integrity="sha256-kOFME87lKSmsM&#43;HCFpSjzJUGOhlOsiqtn3l2Q04akSU=" crossorigin="anonymous"></script>

    <!-- RSS 2.0 feed -->


  </head>

  <body>


    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>


    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>


    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">


<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-09/">September, 2019</a></h2>
    <p class="blog-post-meta"><time datetime="2019-09-01T10:17:51&#43;03:00">Sun Sep 01, 2019</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2019-09-01">2019-09-01</h2>
<ul>
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    440 17.58.101.255
    441 157.55.39.101
    485 207.46.13.43
    728 169.60.128.125
    730 207.46.13.108
    758 157.55.39.9
    808 66.160.140.179
    814 207.46.13.212
   2472 163.172.71.23
   6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     33 2a01:7e00::f03c:91ff:fe16:fcb
     57 3.83.192.124
     57 3.87.77.25
     57 54.82.1.8
    822 2a01:9cc0:47:1:1a:4:0:2
   1223 45.5.184.72
   1633 172.104.229.92
   5112 205.186.128.185
   7249 2a01:7e00::f03c:91ff:fe18:7396
   9124 45.5.186.2
</code></pre><ul>
<li><code>3.94.211.189</code> is MauiBot, and most of its requests are to Discovery and get rate limited with HTTP 503</li>
<li><code>163.172.71.23</code> is some IP on Online SAS in France and its user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>It actually got mostly HTTP 200 responses:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
   1775 200
    703 499
     72 503
</code></pre><ul>
<li>And it was mostly requesting Discover pages:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
   2350 discover
     71 handle
</code></pre><ul>
<li>I&rsquo;m not sure why the outbound traffic rate was so high&hellip;</li>
</ul>
<h2 id="2019-09-02">2019-09-02</h2>
<ul>
<li>Follow up with Carol and Francesca from Bioversity as they were on holiday during the mid-to-late August
<ul>
<li>I told them to check the <a href="https://dspacetest.cgiar.org/handle/10568/103999">temporary collection on DSpace Test</a> where I uploaded the 1,427 items so they can see how it will look</li>
<li>Also, I told them to advise me about the strange file extensions (.7z, .zip, .lck)</li>
<li>Also, I reminded Abenet to check the metadata, as the institutional authors at least will need some modification</li>
</ul>
</li>
</ul>
<h2 id="2019-09-10">2019-09-10</h2>
<ul>
<li>Altmetric responded to say that they have fixed an issue with their badge code so now research outputs with multiple handles are showing badges!
<ul>
<li>See: <a href="https://hdl.handle.net/handle/10568/97825">https://hdl.handle.net/handle/10568/97825</a></li>
</ul>
</li>
<li>Follow up with Bosede about the mixup with PDFs in the items uploaded in 2018-12 (aka Daniel1807.xsl)
<ul>
<li>These are the same ones that Peter noticed last week, that Bosede and I had been discussing earlier this year that we never sorted out</li>
<li>It looks like these items were uploaded by Sisay on 2018-12-19 so we can use the <a href="https://cgspace.cgiar.org/handle/10568/68616/discover?filtertype_1=dateAccessioned&amp;filter_relational_operator_1=contains&amp;filter_1=2018-12-19&amp;submit_apply_filter=&amp;query=">accession date as a filter</a> to narrow it down to 230 items (of which only 104 have PDFs, according to the Daniel1807.xls input input file)</li>
<li>Now I just checked a few manually and they are correct in the original input file, so something must have happened when Sisay was processing them for upload</li>
<li>I have asked Sisay to fix them&hellip;</li>
</ul>
</li>
<li>Continue working on CG Core v2 migration, focusing on the crosswalk mappings
<ul>
<li>I think we can skip the MODS crosswalk for now because it is only used in <a href="https://wiki.duraspace.org/display/DSDOC5x/DSpace+AIP+Format#DSpaceAIPFormat-MODSSchema">AIP exports that are meant for non-DSpace systems</a></li>
<li>We should probably do the QDC crosswalk as well as those in <code>xhtml-head-item.properties</code>&hellip;</li>
<li>Ouch, there is potentially a lot of work in the OAI metadata formats like DIM, METS, and QDC (see <code>dspace/config/crosswalks/oai/*.xsl</code>)</li>
<li>In general I think I should only modify the left side of the crosswalk mappings (ie, where metadata is coming from) so we maintain the same exact output for search engines, etc</li>
</ul>
</li>
</ul>
<h2 id="2019-09-11">2019-09-11</h2>
<ul>
<li>Maria Garruccio asked me to add two new Bioversity ORCID identifiers to CGSpace so I created a <a href="https://github.com/ilri/DSpace/pull/431">pull request</a></li>
<li>Marissa Van Epp asked me to add new CCAFS Phase II project tags to CGSpace so I created a <a href="https://github.com/ilri/DSpace/pull/432">pull request</a>
<ul>
<li>I will wait until I hear from her to merge it because there is one tag that seems to be a duplicate because its name (PII-WA_agrosylvopast) is similar to one that already exists (PII-WA_AgroSylvopastoralSystems)</li>
</ul>
</li>
<li>More work on the CG Core v2 migrations
<ul>
<li>I have updated my <a href="https://gist.github.com/alanorth/2db39e91f48d116e00a4edffd6ba6409">notes on the possible changes</a> and done more work on the XMLUI replacements</li>
</ul>
</li>
</ul>
<h2 id="2019-09-12">2019-09-12</h2>
<ul>
<li>Deploy <a href="https://jdbc.postgresql.org/">PostgreSQL JDBC driver</a> version 42.2.7 on DSpace Test and update the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
</ul>
<h2 id="2019-09-15">2019-09-15</h2>
<ul>
<li>Deploy Bioversity ORCID identifier updates to CGSpace</li>
<li>Deploy PostgreSQL JDBC driver 42.2.7 on CGSpace</li>
<li>Run system updates on CGSpace (linode18) and restart the server
<ul>
<li>After restarting the system Tomcat came back up, but not all Solr statistics cores were loaded</li>
<li>I had to restart Tomcat one more time until the cores were loaded (verified in the Solr admin)</li>
</ul>
</li>
<li>Update nginx TLS cipher suite to the latest <a href="https://ssl-config.mozilla.org/#server=nginx&amp;server-version=1.16.1&amp;config=intermediate&amp;openssl-version=1.0.2g">Mozilla intermediate recommendations for nginx 1.16.0 and openssl 1.0.2</a>
<ul>
<li>DSpace Test (linode19) is running Ubuntu 18.04 with nginx 1.17.x and openssl 1.1.1 so it can even use TLS v1.3 if we override the nginx ssl protocol in its host vars</li>
</ul>
</li>
<li>XMLUI item view pages are blank on CGSpace right now
<ul>
<li>Like earliert this year, I see the following error in the Cocoon log while browsing:</li>
</ul>
</li>
</ul>
<pre><code>2019-09-15 15:32:18,137 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
</code></pre><ul>
<li>Around the same time I see the following in the DSpace log:</li>
</ul>
<pre><code>2019-09-15 15:32:18,079 INFO  org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644
2019-09-15 15:32:18,135 WARN  org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name=&quot;METSRIGHTS&quot;
</code></pre><ul>
<li>I see a lot of these errors today, but not earlier this month:</li>
</ul>
<pre><code># grep -c 'Cannot find named plugin' dspace.log.2019-09-*
dspace.log.2019-09-01:0
dspace.log.2019-09-02:0
dspace.log.2019-09-03:0
dspace.log.2019-09-04:0
dspace.log.2019-09-05:0
dspace.log.2019-09-06:0
dspace.log.2019-09-07:0
dspace.log.2019-09-08:0
dspace.log.2019-09-09:0
dspace.log.2019-09-10:0
dspace.log.2019-09-11:0
dspace.log.2019-09-12:0
dspace.log.2019-09-13:0
dspace.log.2019-09-14:0
dspace.log.2019-09-15:808
</code></pre><ul>
<li>Something must have happened when I restarted Tomcat a few hours ago, because earlier in the DSpace log I see a bunch of errors like this:</li>
</ul>
<pre><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.METSRightsCrosswalk&quot;, name=&quot;METSRIGHTS&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.OREDisseminationCrosswalk&quot;, name=&quot;ore&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.DIMDisseminationCrosswalk&quot;, name=&quot;dim&quot;
</code></pre><ul>
<li>I restarted Tomcat and the item views came back, but then the Solr statistics cores didn&rsquo;t all load properly
<ul>
<li>After restarting Tomcat once again, both the item views and the Solr statistics cores all came back OK</li>
</ul>
</li>
</ul>
<h2 id="2019-09-19">2019-09-19</h2>
<ul>
<li>For some reason my podman PostgreSQL container isn&rsquo;t working so I had to use Docker to re-create it for my testing work today:</li>
</ul>
<pre><code># docker pull docker.io/library/postgres:9.6-alpine
# docker create volume dspacedb_data
# docker run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-08-31.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><ul>
<li>Elizabeth from CIAT sent me a list of sixteen authors who need to have their ORCID identifiers tagged with their publications
<ul>
<li>I manually checked the ORCID profile links to make sure they matched the names</li>
<li>Then I created an input file to use with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Kihara, Job&quot;,&quot;Job Kihara: 0000-0002-4394-9553&quot;
&quot;Twyman, Jennifer&quot;,&quot;Jennifer Twyman: 0000-0002-8581-5668&quot;
&quot;Ishitani, Manabu&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
&quot;Arango, Jacobo&quot;,&quot;Jacobo Arango: 0000-0002-4828-9398&quot;
&quot;Chavarriaga Aguirre, Paul&quot;,&quot;Paul Chavarriaga-Aguirre: 0000-0001-7579-3250&quot;
&quot;Paul, Birthe&quot;,&quot;Birthe Paul: 0000-0002-5994-5354&quot;
&quot;Eitzinger, Anton&quot;,&quot;Anton Eitzinger: 0000-0001-7317-3381&quot;
&quot;Hoek, Rein van der&quot;,&quot;Rein van der Hoek: 0000-0003-4528-7669&quot;
&quot;Aranzales Rondón, Ericson&quot;,&quot;Ericson Aranzales Rondon: 0000-0001-7487-9909&quot;
&quot;Staiger-Rivas, Simone&quot;,&quot;Simone Staiger: 0000-0002-3539-0817&quot;
&quot;de Haan, Stef&quot;,&quot;Stef de Haan: 0000-0001-8690-1886&quot;
&quot;Pulleman, Mirjam&quot;,&quot;Mirjam Pulleman: 0000-0001-9950-0176&quot;
&quot;Abera, Wuletawu&quot;,&quot;Wuletawu Abera: 0000-0002-3657-5223&quot;
&quot;Tamene, Lulseged&quot;,&quot;Lulseged Tamene: 0000-0002-3806-8890&quot;
&quot;Andrieu, Nadine&quot;,&quot;Nadine Andrieu: 0000-0001-9558-9302&quot;
&quot;Ramírez-Villegas, Julián&quot;,&quot;Julian Ramirez-Villegas: 0000-0002-8044-583X&quot;
</code></pre><ul>
<li>I tested the file on my local development machine with the following invocation:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>In my test environment this added 390 ORCID identifier</li>
<li>I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update</li>
<li>Update the PostgreSQL JDBC driver to version 42.2.8 in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>
<ul>
<li>There is only <a href="https://github.com/pgjdbc/pgjdbc/issues/1567">one minor fix to a usecase we aren&rsquo;t using</a> so I will deploy this on the servers the next time I do updates</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode19) and reboot it</li>
<li>Start looking at IITA&rsquo;s latest round of batch updates that Sisay had <a href="https://dspacetest.cgiar.org/handle/10568/105486">uploaded to DSpace Test</a> earlier this month
<ul>
<li>For posterity, IITA&rsquo;s original input file was 20196th.xls and Sisay uploaded it as &ldquo;IITA_Sep_06&rdquo; to DSpace Test</li>
<li>Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn&rsquo;t run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields</li>
<li>In addition, a few records were missing authorship type</li>
<li>I deleted two invalid AGROVOC terms because they were ambiguous</li>
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
</ul>
</li>
<li>I also looked through the IITA subjects to normalize some values</li>
</ul>
</li>
<li>Follow up with Marissa again about the CCAFS phase II project tags</li>
<li>Generate a list of the top 1500 authors on CGSpace:</li>
</ul>
<pre><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
</code></pre><ul>
<li>Then I used <code>csvcut</code> to select the column of author names, strip the header and quote characters, and saved the sorted file:</li>
</ul>
<pre><code>$ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/&quot;//g' | sort &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
</code></pre><ul>
<li>After adding the XML formatting back to the file I formatted it using XML tidy:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
</code></pre><ul>
<li>I created and merged <a href="https://github.com/ilri/DSpace/pull/433">a pull request for the updates</a>
<ul>
<li>This is the first time we&rsquo;ve updated this controlled vocabulary since 2018-09</li>
</ul>
</li>
</ul>
<h2 id="2019-09-20">2019-09-20</h2>
<ul>
<li>Deploy a fresh snapshot of CGSpace&rsquo;s PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations</li>
<li>Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
<ul>
<li>They want to do some enrichment of the metadata to add countries and regions</li>
<li>Also, they noticed that some items have a blank ISSN in the citation like &ldquo;ISSN:&rdquo;</li>
<li>I told them it&rsquo;s probably best if we have Francesco produce a new export from Typo 3</li>
<li>But on second thought I think that I&rsquo;ve already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs</li>
<li>Other corrections would be to replace &ldquo;Inst.&rdquo; and &ldquo;Instit.&rdquo; with &ldquo;Institute&rdquo; and remove those blank ISSNs from the citations</li>
<li>I will rename the files with multiple underscores so they match the filename column in the CSV using this command:</li>
</ul>
</li>
</ul>
<pre><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf
</code></pre><ul>
<li>I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
<ul>
<li>There are a <em>few dozen</em> that have completely fucked up names due to some encoding error</li>
<li>To make matters worse, when I tried to download them, some of the links in the &ldquo;URL&rdquo; column that Francesco included are wrong, so I had to go to the permalink and get a link that worked</li>
<li>After downloading everything I had to use Ubuntu&rsquo;s version of rename to get rid of all the double and triple underscores:</li>
</ul>
</li>
</ul>
<pre><code>$ rename -v 's/___/_/g'  *.pdf
$ rename -v 's/__/_/g'  *.pdf
</code></pre><ul>
<li>I&rsquo;m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I&rsquo;ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</li>
<li>I wrote two fairly long GREL expressions to clean up the institutional author names in the <code>dc.contributor.author</code> and <code>dc.identifier.citation</code> fields using OpenRefine
<ul>
<li>The first targets acronyms in parentheses like &ldquo;International Livestock Research Institute (ILRI)&quot;:</li>
</ul>
</li>
</ul>
<pre><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&quot;&quot;)
</code></pre><ul>
<li>The second targets cities and countries after names like &ldquo;International Livestock Research Intstitute, Kenya&rdquo;:</li>
</ul>
<pre><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&quot;&quot;)
</code></pre><ul>
<li>I imported the 1,427 Bioversity records with bitstreams to a new collection called <a href="https://dspacetest.cgiar.org/handle/10568/103688">2019-09-20 Bioversity Migration Test</a> on DSpace Test (after splitting them in two batches of about 700 each):</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
</code></pre><ul>
<li>After that I exported the collection again and started doing some quality checks and cleanups:
<ul>
<li>Change all DOIs to use <a href="https://doi.org">https://doi.org</a> format</li>
<li>Change all bioversityinternational.org links to use https://</li>
<li>Fix ten authors with invalid names like &ldquo;Orth,.&rdquo; by checking the correct name in the citation</li>
<li>Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs!</li>
<li>Fix some citations that were using &ldquo;ISSN&rdquo; instead of ISBN</li>
</ul>
</li>
<li>The next steps are:
<ul>
<li>Check for duplicates</li>
<li>Continue with institutional author normalization</li>
<li>Ask which collection to map items with type Brochure, Journal Item, and Thesis?</li>
</ul>
</li>
</ul>
<h2 id="2019-09-21">2019-09-21</h2>
<ul>
<li>Re-upload the <a href="https://dspacetest.cgiar.org/handle/10568/105116">IITA Sept 6 (20196th.xls) records to DSpace Test</a> after I did the re-sync yesterday
<ul>
<li>Then I looked at the records again and sent some feedback about three duplicates to Bosede</li>
<li>Also I noticed that many journal articles have the journal and page information in the citation, but are missing <code>dc.source</code> and <code>dc.format.extent</code> fields</li>
</ul>
</li>
<li>Play with language identification using the langdetect, fasttext, polyglot, and langid libraries
<ul>
<li>ployglot requires too many system things to compile</li>
<li>langdetect didn&rsquo;t seem as accurate as the others</li>
<li>fasttext is likely the best, but <a href="https://github.com/facebookresearch/fastText/issues/909">prints a blank link to the console when loading a model</a></li>
<li>langid seems to be the best considering the above experiences</li>
</ul>
</li>
<li>I added very experimental language detection to the <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> module
<ul>
<li>It works by checking the predicted language of the <code>dc.title</code> field against the item&rsquo;s <code>dc.language.iso</code> field</li>
<li>I tested it on the Bioversity migration data set and it actually helped me correct eleven language fields in their records!</li>
</ul>
</li>
</ul>
<h2 id="2019-09-24">2019-09-24</h2>
<ul>
<li>Bosede fixed a few of the things I mentioned in her Sept 6 batch records, but there were still issues
<ul>
<li>I sent her a bit more feedback because when I asked her to delete a duplicate, she deleted the <em>existing</em> item on DSpace Test rather than the new one in the new batch file!</li>
<li>I fixed two incorrect languages after analyzing it with my beta language detection in the csv-metadata-quality tool</li>
</ul>
</li>
</ul>
<h2 id="2019-09-26">2019-09-26</h2>
<ul>
<li>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.0">version 0.3.0 of the csv-metadata-quality</a> tool
<ul>
<li>This version includes the experimental validation of languages using the Python <code>langid</code> library</li>
<li>I also included updated pytest tests and test files that specifically test this functionality</li>
</ul>
</li>
<li>Give more feedback to Bosede about the <a href="https://dspacetest.cgiar.org/handle/10568/105116">IITA Sept 6 (20196th.xls) records on DSpace Test</a>
<ul>
<li>I told her to delete one item that appears to be a duplicate, or to fix its citation to be correct if she thinks it is not a duplicate</li>
<li>I deleted another item that I had previously identified as a duplicate that she had fixed by incorrectly deleting the original (ugh)</li>
</ul>
</li>
<li>Get a list of institutions from CCAFS&rsquo;s Clarisa API and try to parse it with <code>jq</code>, do some small cleanups and add a header in <code>sed</code>, and then pass it through <code>csvcut</code> to add line numbers:</li>
</ul>
<pre><code>$ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
</code></pre><ul>
<li>The csv-metadata-quality tool caught a few records with excessive spacing and unnecessary Unicode</li>
<li>I could potentially use this with reconcile-csv and OpenRefine as a source to validate our institutional authors against&hellip;</li>
</ul>
<h2 id="2019-09-27">2019-09-27</h2>
<ul>
<li>Skype with Peter and Abenet about CGSpace actions
<ul>
<li>Peter will respond to ICARDA&rsquo;s request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc</li>
<li>We discussed using ISO 3166 for countries, though Peter doesn&rsquo;t like the formal names like &ldquo;Moldova, Republic of&rdquo; and &ldquo;Tanzania, United Republic of&rdquo;
<ul>
<li>The Debian <code>iso-codes</code> package has ISO 3166-1 with &ldquo;common name&rdquo;, &ldquo;name&rdquo;, and &ldquo;official name&rdquo; representations, for example:
<ul>
<li>common_name: Tanzania</li>
<li>name: Tanzania, United Republic of</li>
<li>official_name: United Republic of Tanzania</li>
</ul>
</li>
<li>There are still some unfortunate ones there, though:
<ul>
<li>name: Korea, Democratic People&rsquo;s Republic of</li>
<li>official_name: Democratic People&rsquo;s Republic of Korea</li>
</ul>
</li>
<li>And this, which isn&rsquo;t even in English&hellip;
<ul>
<li>name: Côte d&rsquo;Ivoire</li>
<li>official_name: Republic of Côte d&rsquo;Ivoire</li>
</ul>
</li>
<li>The other alternative is to just keep using the names we have, which are mostly compliant with AGROVOC</li>
</ul>
</li>
<li>Peter said that a new server for DSpace Test is fine, so I can proceed with the normal process of getting approval from Michael Victor and ICT when I have time (recommend moving from $40 to $80/month Linode, with 16GB RAM)</li>
<li>I need to ask Atmire for a quote to upgrade CGSpace to DSpace 6 with all current modules so we can see how many more credits we need</li>
</ul>
</li>
<li>A little bit more work on the Sept 6 IITA batch records
<ul>
<li>Bosede deleted the one item that I told her was a duplicate</li>
<li>I checked the AGROVOC subjects and fixed one incorrect one</li>
<li>Then I told her that I think the items are ready to go to CGSpace and asked Abenet for a final comment</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->


</article>


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">


        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2020-03/">March, 2020</a></li>

<li><a href="/cgspace-notes/2020-02/">February, 2020</a></li>

<li><a href="/cgspace-notes/2020-01/">January, 2020</a></li>

<li><a href="/cgspace-notes/2019-12/">December, 2019</a></li>

<li><a href="/cgspace-notes/2019-11/">November, 2019</a></li>

    </ol>
  </section>


  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">

      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>

      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>

      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>

    </ol>
  </section>

</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->


    <footer class="blog-footer">
      <p dir="auto">

      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.

      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>


  </body>

</html>