Add notes for 2020-01-27

This commit is contained in:
2020-01-27 16:20:44 +02:00
parent 207ace0883
commit 8feb93be39
112 changed files with 11466 additions and 5158 deletions

View File

@ -69,7 +69,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.62.2" />
<meta name="generator" content="Hugo 0.63.1" />
@ -99,7 +99,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel="stylesheet" integrity="sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy&#43;piAwENoVPTw=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css" rel="stylesheet" integrity="sha256-I&#43;LDKYvMjBE2wZq6MwwhHslMNvfERU6hXPTTVINwBCo=" crossorigin="anonymous">
<!-- RSS 2.0 feed -->
@ -146,7 +146,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-09/">September, 2019</a></h2>
<p class="blog-post-meta"><time datetime="2019-09-01T10:17:51&#43;03:00">Sun Sep 01, 2019</time> by Alan Orth in
<i class="fa fa-folder" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
</p>
@ -197,7 +197,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
2350 discover
71 handle
</code></pre><ul>
<li>I'm not sure why the outbound traffic rate was so high&hellip;</li>
<li>I&rsquo;m not sure why the outbound traffic rate was so high&hellip;</li>
</ul>
<h2 id="2019-09-02">2019-09-02</h2>
<ul>
@ -304,7 +304,7 @@ dspace.log.2019-09-15:808
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.OREDisseminationCrosswalk&quot;, name=&quot;ore&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.DIMDisseminationCrosswalk&quot;, name=&quot;dim&quot;
</code></pre><ul>
<li>I restarted Tomcat and the item views came back, but then the Solr statistics cores didn't all load properly
<li>I restarted Tomcat and the item views came back, but then the Solr statistics cores didn&rsquo;t all load properly
<ul>
<li>After restarting Tomcat once again, both the item views and the Solr statistics cores all came back OK</li>
</ul>
@ -312,7 +312,7 @@ dspace.log.2019-09-15:808
</ul>
<h2 id="2019-09-19">2019-09-19</h2>
<ul>
<li>For some reason my podman PostgreSQL container isn't working so I had to use Docker to re-create it for my testing work today:</li>
<li>For some reason my podman PostgreSQL container isn&rsquo;t working so I had to use Docker to re-create it for my testing work today:</li>
</ul>
<pre><code># docker pull docker.io/library/postgres:9.6-alpine
# docker create volume dspacedb_data
@ -357,14 +357,14 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<li>I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update</li>
<li>Update the PostgreSQL JDBC driver to version 42.2.8 in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>
<ul>
<li>There is only <a href="https://github.com/pgjdbc/pgjdbc/issues/1567">one minor fix to a usecase we aren't using</a> so I will deploy this on the servers the next time I do updates</li>
<li>There is only <a href="https://github.com/pgjdbc/pgjdbc/issues/1567">one minor fix to a usecase we aren&rsquo;t using</a> so I will deploy this on the servers the next time I do updates</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode19) and reboot it</li>
<li>Start looking at IITA's latest round of batch updates that Sisay had <a href="https://dspacetest.cgiar.org/handle/10568/105486">uploaded to DSpace Test</a> earlier this month
<li>Start looking at IITA&rsquo;s latest round of batch updates that Sisay had <a href="https://dspacetest.cgiar.org/handle/10568/105486">uploaded to DSpace Test</a> earlier this month
<ul>
<li>For posterity, IITA's original input file was 20196th.xls and Sisay uploaded it as &ldquo;IITA_Sep_06&rdquo; to DSpace Test</li>
<li>Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn't run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields</li>
<li>For posterity, IITA&rsquo;s original input file was 20196th.xls and Sisay uploaded it as &ldquo;IITA_Sep_06&rdquo; to DSpace Test</li>
<li>Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn&rsquo;t run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields</li>
<li>In addition, a few records were missing authorship type</li>
<li>I deleted two invalid AGROVOC terms because they were ambiguous</li>
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
@ -391,19 +391,19 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</code></pre><ul>
<li>I created and merged <a href="https://github.com/ilri/DSpace/pull/433">a pull request for the updates</a>
<ul>
<li>This is the first time we've updated this controlled vocabulary since 2018-09</li>
<li>This is the first time we&rsquo;ve updated this controlled vocabulary since 2018-09</li>
</ul>
</li>
</ul>
<h2 id="2019-09-20">2019-09-20</h2>
<ul>
<li>Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations</li>
<li>Deploy a fresh snapshot of CGSpace&rsquo;s PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations</li>
<li>Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
<ul>
<li>They want to do some enrichment of the metadata to add countries and regions</li>
<li>Also, they noticed that some items have a blank ISSN in the citation like &ldquo;ISSN:&rdquo;</li>
<li>I told them it's probably best if we have Francesco produce a new export from Typo 3</li>
<li>But on second thought I think that I've already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs</li>
<li>I told them it&rsquo;s probably best if we have Francesco produce a new export from Typo 3</li>
<li>But on second thought I think that I&rsquo;ve already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs</li>
<li>Other corrections would be to replace &ldquo;Inst.&rdquo; and &ldquo;Instit.&rdquo; with &ldquo;Institute&rdquo; and remove those blank ISSNs from the citations</li>
<li>I will rename the files with multiple underscores so they match the filename column in the CSV using this command:</li>
</ul>
@ -415,14 +415,14 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<ul>
<li>There are a <em>few dozen</em> that have completely fucked up names due to some encoding error</li>
<li>To make matters worse, when I tried to download them, some of the links in the &ldquo;URL&rdquo; column that Francesco included are wrong, so I had to go to the permalink and get a link that worked</li>
<li>After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores:</li>
<li>After downloading everything I had to use Ubuntu&rsquo;s version of rename to get rid of all the double and triple underscores:</li>
</ul>
</li>
</ul>
<pre><code>$ rename -v 's/___/_/g' *.pdf
$ rename -v 's/__/_/g' *.pdf
</code></pre><ul>
<li>I'm still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink)</li>
<li>I&rsquo;m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I&rsquo;ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</li>
<li>I wrote two fairly long GREL expressions to clean up the institutional author names in the <code>dc.contributor.author</code> and <code>dc.identifier.citation</code> fields using OpenRefine
<ul>
<li>The first targets acronyms in parentheses like &ldquo;International Livestock Research Institute (ILRI)&quot;:</li>
@ -469,14 +469,14 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
<li>Play with language identification using the langdetect, fasttext, polyglot, and langid libraries
<ul>
<li>ployglot requires too many system things to compile</li>
<li>langdetect didn't seem as accurate as the others</li>
<li>langdetect didn&rsquo;t seem as accurate as the others</li>
<li>fasttext is likely the best, but <a href="https://github.com/facebookresearch/fastText/issues/909">prints a blank link to the console when loading a model</a></li>
<li>langid seems to be the best considering the above experiences</li>
</ul>
</li>
<li>I added very experimental language detection to the <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> module
<ul>
<li>It works by checking the predicted language of the <code>dc.title</code> field against the item's <code>dc.language.iso</code> field</li>
<li>It works by checking the predicted language of the <code>dc.title</code> field against the item&rsquo;s <code>dc.language.iso</code> field</li>
<li>I tested it on the Bioversity migration data set and it actually helped me correct eleven language fields in their records!</li>
</ul>
</li>
@ -504,7 +504,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
<li>I deleted another item that I had previously identified as a duplicate that she had fixed by incorrectly deleting the original (ugh)</li>
</ul>
</li>
<li>Get a list of institutions from CCAFS's Clarisa API and try to parse it with <code>jq</code>, do some small cleanups and add a header in <code>sed</code>, and then pass it through <code>csvcut</code> to add line numbers:</li>
<li>Get a list of institutions from CCAFS&rsquo;s Clarisa API and try to parse it with <code>jq</code>, do some small cleanups and add a header in <code>sed</code>, and then pass it through <code>csvcut</code> to add line numbers:</li>
</ul>
<pre><code>$ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
@ -516,8 +516,8 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
<ul>
<li>Skype with Peter and Abenet about CGSpace actions
<ul>
<li>Peter will respond to ICARDA's request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc</li>
<li>We discussed using ISO 3166 for countries, though Peter doesn't like the formal names like &ldquo;Moldova, Republic of&rdquo; and &ldquo;Tanzania, United Republic of&rdquo;
<li>Peter will respond to ICARDA&rsquo;s request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc</li>
<li>We discussed using ISO 3166 for countries, though Peter doesn&rsquo;t like the formal names like &ldquo;Moldova, Republic of&rdquo; and &ldquo;Tanzania, United Republic of&rdquo;
<ul>
<li>The Debian <code>iso-codes</code> package has ISO 3166-1 with &ldquo;common name&rdquo;, &ldquo;name&rdquo;, and &ldquo;official name&rdquo; representations, for example:
<ul>
@ -528,14 +528,14 @@ $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institut
</li>
<li>There are still some unfortunate ones there, though:
<ul>
<li>name: Korea, Democratic People's Republic of</li>
<li>official_name: Democratic People's Republic of Korea</li>
<li>name: Korea, Democratic People&rsquo;s Republic of</li>
<li>official_name: Democratic People&rsquo;s Republic of Korea</li>
</ul>
</li>
<li>And this, which isn't even in English&hellip;
<li>And this, which isn&rsquo;t even in English&hellip;
<ul>
<li>name: Côte d'Ivoire</li>
<li>official_name: Republic of Côte d'Ivoire</li>
<li>name: Côte d&rsquo;Ivoire</li>
<li>official_name: Republic of Côte d&rsquo;Ivoire</li>
</ul>
</li>
<li>The other alternative is to just keep using the names we have, which are mostly compliant with AGROVOC</li>