Update notes for 2018-06-11

2025-01-27 05:49:12 +01:00 · 2018-06-11 15:21:14 +03:00
parent da85011fae
commit a8715c203c
3 changed files with 57 additions and 8 deletions
--- a/content/posts/2018-06.md
+++ b/content/posts/2018-06.md
@@ -105,3 +105,25 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
    - dÕpassÕ
  - Also the abstracts have missing accents, ie "recherche sur le d veloppement"
 - I will have to tell IITA people to redo these entirely I think...
+
+## 2018-06-11
+
+- Sisay sent a new version of the last IITA records that he created from the original CSV from IITA
+- The 200 records are in the [IITA_Junel_11 (10568/95870)](https://dspacetest.cgiar.org/handle/10568/95870) collection
+- Many errors:
+  - Authorship types: "CGIAR ans advanced research institute", "CGAIR and advanced research institute", "CGIAR and advanced research institutes", "CGAIR single center"
+  - Lots of inconsistencies and mispellings in author affiliations:
+    - "Institut des Recherches Agricoles du Bénin" and "Institut National des Recherche Agricoles du Benin" and "National Agricultural Research Institute, Benin"
+    - International Insitute of Tropical Agriculture
+    - Centro Internacional de Agricultura Tropical
+    - "Rivers State University of Science and Technology" and "Rivers State University"
+    - "Institut de la Recherche Agronomique, Cameroon" and "Institut de Recherche Agronomique, Cameroon"
+  - Inconsistency in countries: "COTE D’IVOIRE" and "COTE D'IVOIRE"
+  - A few DOIs with spaces or invalid characters
+  - Inconsistency in IITA subjects, for example "PRODUCTION VEGETALE" and "PRODUCTION VÉGÉTALE" and several others
+  - I ran `value.unescape('javascript')` on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped
+- It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede's original file it doesn't have all those corrections
+- So I told Sisay to re-create the collection using Abenet's XLS from last week (`Mercy1805_AY.xls`)
+- I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces
+- I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells: `isNotNull(value.match(/.*?\s{2,}.*?/))`
+- I wonder if I should start checking for "smart" quotes like ’ (hex 2019)