diff --git a/content/posts/2018-06.md b/content/posts/2018-06.md index 627358944..abbec36dc 100644 --- a/content/posts/2018-06.md +++ b/content/posts/2018-06.md @@ -105,3 +105,25 @@ Failed to startup the DSpace Service Manager: failure starting up spring service - dÕpassÕ - Also the abstracts have missing accents, ie "recherche sur le d veloppement" - I will have to tell IITA people to redo these entirely I think... + +## 2018-06-11 + +- Sisay sent a new version of the last IITA records that he created from the original CSV from IITA +- The 200 records are in the [IITA_Junel_11 (10568/95870)](https://dspacetest.cgiar.org/handle/10568/95870) collection +- Many errors: + - Authorship types: "CGIAR ans advanced research institute", "CGAIR and advanced research institute", "CGIAR and advanced research institutes", "CGAIR single center" + - Lots of inconsistencies and mispellings in author affiliations: + - "Institut des Recherches Agricoles du Bénin" and "Institut National des Recherche Agricoles du Benin" and "National Agricultural Research Institute, Benin" + - International Insitute of Tropical Agriculture + - Centro Internacional de Agricultura Tropical + - "Rivers State University of Science and Technology" and "Rivers State University" + - "Institut de la Recherche Agronomique, Cameroon" and "Institut de Recherche Agronomique, Cameroon" + - Inconsistency in countries: "COTE D’IVOIRE" and "COTE D'IVOIRE" + - A few DOIs with spaces or invalid characters + - Inconsistency in IITA subjects, for example "PRODUCTION VEGETALE" and "PRODUCTION VÉGÉTALE" and several others + - I ran `value.unescape('javascript')` on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped +- It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede's original file it doesn't have all those corrections +- So I told Sisay to re-create the collection using Abenet's XLS from last week (`Mercy1805_AY.xls`) +- I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces +- I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: `isNotNull(value.match(/.*?\s{2,}.*?/))` +- I wonder if I should start checking for "smart" quotes like ’ (hex 2019) diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index e43e643b0..d01930120 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -41,7 +41,7 @@ sys 2m7.289s - + @@ -93,9 +93,9 @@ sys 2m7.289s "@type": "BlogPosting", "headline": "June, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-06/", - "wordCount": "750", + "wordCount": "1025", "datePublished": "2018-06-04T19:49:54-07:00", - "dateModified": "2018-06-10T14:12:07+03:00", + "dateModified": "2018-06-10T19:32:12+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -285,6 +285,33 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
value.unescape('javascript')
on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escapedMercy1805_AY.xls
)isNotNull(value.match(/.*?\s{2,}.*?/))