diff --git a/content/post/2018-03.md b/content/post/2018-03.md index 0dbac6c24..3c6841969 100644 --- a/content/post/2018-03.md +++ b/content/post/2018-03.md @@ -56,3 +56,65 @@ UPDATE 1 - Help Sisay proof 200 IITA records on DSpace Test - Finally import Udana's 24 items to [IWMI Journal Articles](https://cgspace.cgiar.org/handle/10568/36185) on CGSpace - Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc + +## 2018-03-08 + +- Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata +- This makes the CSV have tons of columns, for example `dc.title`, `dc.title[]`, `dc.title[en]`, `dc.title[eng]`, `dc.title[en_US]` and so on! +- I think I can fix — or at least normalize — them in the database: + +``` +dspace=# select distinct text_lang from metadatavalue where resource_type_id=2; + text_lang +----------- + + ethnob + en + spa + EN + En + en_ + en_US + E. + + EN_US + en_U + eng + fr + es_ES + es +(16 rows) + +dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng'); +UPDATE 122227 +dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2; + text_lang +----------- + + ethnob + en_US + spa + E. + + fr + es_ES + es +(9 rows) +``` + +- In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine +- Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the `cg.creator.id` field +- For example, a GREL expression in a custom text facet to get all items with `dc.contributor.author[en_US]` of a certain author with several name variations (this is how you use a logical OR in OpenRefine): + +``` +or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos')) +``` + +- Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value: + +``` +if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918") +``` + +- One thing that bothers me is that this won't honor author order +- It might be better to do batches of these in PostgreSQL with a script that takes the `place` column of an author into account when setting the `cg.creator.id` diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html index 9e6608031..ff5adae59 100644 --- a/docs/2018-03/index.html +++ b/docs/2018-03/index.html @@ -20,7 +20,7 @@ Export a CSV of the IITA community metadata for Martin Mueller - + @@ -51,9 +51,9 @@ Export a CSV of the IITA community metadata for Martin Mueller "@type": "BlogPosting", "headline": "March, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-03/", - "wordCount": "323", + "wordCount": "599", "datePublished": "2018-03-02T16:07:54+02:00", - "dateModified": "2018-03-07T12:29:24+02:00", + "dateModified": "2018-03-08T15:05:29+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -182,6 +182,73 @@ UPDATE 1
dc.title
, dc.title[]
, dc.title[en]
, dc.title[eng]
, dc.title[en_US]
and so on!dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
+ text_lang
+-----------
+
+ ethnob
+ en
+ spa
+ EN
+ En
+ en_
+ en_US
+ E.
+
+ EN_US
+ en_U
+ eng
+ fr
+ es_ES
+ es
+(16 rows)
+
+dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
+UPDATE 122227
+dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
+ text_lang
+-----------
+
+ ethnob
+ en_US
+ spa
+ E.
+
+ fr
+ es_ES
+ es
+(9 rows)
+
+
+cg.creator.id
fielddc.contributor.author[en_US]
of a certain author with several name variations (this is how you use a logical OR in OpenRefine):or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
+
+
+if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
+
+
+place
column of an author into account when setting the cg.creator.id