Update notes for 2018-06-11

This commit is contained in:
Alan Orth 2018-06-11 15:21:14 +03:00
parent da85011fae
commit a8715c203c
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
3 changed files with 57 additions and 8 deletions

View File

@ -105,3 +105,25 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
- dÕpassÕ
- Also the abstracts have missing accents, ie "recherche sur le d veloppement"
- I will have to tell IITA people to redo these entirely I think...
## 2018-06-11
- Sisay sent a new version of the last IITA records that he created from the original CSV from IITA
- The 200 records are in the [IITA_Junel_11 (10568/95870)](https://dspacetest.cgiar.org/handle/10568/95870) collection
- Many errors:
- Authorship types: "CGIAR ans advanced research institute", "CGAIR and advanced research institute", "CGIAR and advanced research institutes", "CGAIR single center"
- Lots of inconsistencies and mispellings in author affiliations:
- "Institut des Recherches Agricoles du Bénin" and "Institut National des Recherche Agricoles du Benin" and "National Agricultural Research Institute, Benin"
- International Insitute of Tropical Agriculture
- Centro Internacional de Agricultura Tropical
- "Rivers State University of Science and Technology" and "Rivers State University"
- "Institut de la Recherche Agronomique, Cameroon" and "Institut de Recherche Agronomique, Cameroon"
- Inconsistency in countries: "COTE DIVOIRE" and "COTE D'IVOIRE"
- A few DOIs with spaces or invalid characters
- Inconsistency in IITA subjects, for example "PRODUCTION VEGETALE" and "PRODUCTION VÉGÉTALE" and several others
- I ran `value.unescape('javascript')` on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped
- It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede's original file it doesn't have all those corrections
- So I told Sisay to re-create the collection using Abenet's XLS from last week (`Mercy1805_AY.xls`)
- I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces
- I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: `isNotNull(value.match(/.*?\s{2,}.*?/))`
- I wonder if I should start checking for "smart" quotes like (hex 2019)

View File

@ -41,7 +41,7 @@ sys 2m7.289s
<meta property="article:published_time" content="2018-06-04T19:49:54-07:00"/>
<meta property="article:modified_time" content="2018-06-10T14:12:07&#43;03:00"/>
<meta property="article:modified_time" content="2018-06-10T19:32:12&#43;03:00"/>
@ -93,9 +93,9 @@ sys 2m7.289s
"@type": "BlogPosting",
"headline": "June, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-06/",
"wordCount": "750",
"wordCount": "1025",
"datePublished": "2018-06-04T19:49:54-07:00",
"dateModified": "2018-06-10T14:12:07&#43;03:00",
"dateModified": "2018-06-10T19:32:12&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -285,6 +285,33 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
<li>I will have to tell IITA people to redo these entirely I think&hellip;</li>
</ul>
<h2 id="2018-06-11">2018-06-11</h2>
<ul>
<li>Sisay sent a new version of the last IITA records that he created from the original CSV from IITA</li>
<li>The 200 records are in the <a href="https://dspacetest.cgiar.org/handle/10568/95870">IITA_Junel_11 (<sup>10568</sup>&frasl;<sub>95870</sub>)</a> collection</li>
<li>Many errors:
<ul>
<li>Authorship types: &ldquo;CGIAR ans advanced research institute&rdquo;, &ldquo;CGAIR and advanced research institute&rdquo;, &ldquo;CGIAR and advanced research institutes&rdquo;, &ldquo;CGAIR single center&rdquo;</li>
<li>Lots of inconsistencies and mispellings in author affiliations:</li>
<li>&ldquo;Institut des Recherches Agricoles du Bénin&rdquo; and &ldquo;Institut National des Recherche Agricoles du Benin&rdquo; and &ldquo;National Agricultural Research Institute, Benin&rdquo;</li>
<li>International Insitute of Tropical Agriculture</li>
<li>Centro Internacional de Agricultura Tropical</li>
<li>&ldquo;Rivers State University of Science and Technology&rdquo; and &ldquo;Rivers State University&rdquo;</li>
<li>&ldquo;Institut de la Recherche Agronomique, Cameroon&rdquo; and &ldquo;Institut de Recherche Agronomique, Cameroon&rdquo;</li>
<li>Inconsistency in countries: &ldquo;COTE DIVOIRE&rdquo; and &ldquo;COTE D&rsquo;IVOIRE&rdquo;</li>
<li>A few DOIs with spaces or invalid characters</li>
<li>Inconsistency in IITA subjects, for example &ldquo;PRODUCTION VEGETALE&rdquo; and &ldquo;PRODUCTION VÉGÉTALE&rdquo; and several others</li>
<li>I ran <code>value.unescape('javascript')</code> on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped</li>
</ul></li>
<li>It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede&rsquo;s original file it doesn&rsquo;t have all those corrections</li>
<li>So I told Sisay to re-create the collection using Abenet&rsquo;s XLS from last week (<code>Mercy1805_AY.xls</code>)</li>
<li>I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces</li>
<li>I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: <code>isNotNull(value.match(/.*?\s{2,}.*?/))</code></li>
<li>I wonder if I should start checking for &ldquo;smart&rdquo; quotes like (hex 2019)</li>
</ul>

View File

@ -4,7 +4,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2018-06/</loc>
<lastmod>2018-06-10T14:12:07+03:00</lastmod>
<lastmod>2018-06-10T19:32:12+03:00</lastmod>
</url>
<url>
@ -169,7 +169,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2018-06-10T14:12:07+03:00</lastmod>
<lastmod>2018-06-10T19:32:12+03:00</lastmod>
<priority>0</priority>
</url>
@ -180,7 +180,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2018-06-10T14:12:07+03:00</lastmod>
<lastmod>2018-06-10T19:32:12+03:00</lastmod>
<priority>0</priority>
</url>
@ -192,13 +192,13 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2018-06-10T14:12:07+03:00</lastmod>
<lastmod>2018-06-10T19:32:12+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2018-06-10T14:12:07+03:00</lastmod>
<lastmod>2018-06-10T19:32:12+03:00</lastmod>
<priority>0</priority>
</url>