cgspace-notes/content/posts/2018-03.md

---
title: "March, 2018"
date: 2018-03-02T16:07:54+02:00
author: "Alan Orth"
tags: ["Notes"]
---

## 2018-03-02

- Export a CSV of the IITA community metadata for Martin Mueller

<!--more-->

## 2018-03-06

- Add three new CCAFS project tags to `input-forms.xml` ([#357](https://github.com/ilri/DSpace/pull/357))
- Andrea from Macaroni Bros had sent me an email that CCAFS needs them
- Give Udana more feedback on his WLE records from last month
- There were some records using a non-breaking space in their AGROVOC subject field
- I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace

```
$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
```

- This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character
- Add new CRP subject "GRAIN LEGUMES AND DRYLAND CEREALS" to `input-forms.xml` ([#358](https://github.com/ilri/DSpace/pull/358))
- Merge the ORCID integration stuff in to `5_x-prod` for deployment on CGSpace soon ([#359](https://github.com/ilri/DSpace/pull/359))
- Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server
- Run all system updates on DSpace Test and reboot server
- I ran the [orcid-authority-to-item.py](https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c) script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata

```
$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
```

- I ran the DSpace cleanup script on CGSpace and it threw an error (as always):

```
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
```

- The solution is, as always:
```
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
UPDATE 1
```

- Apply the proposed PostgreSQL indexes from DS-3636 (pull request [#1791](https://github.com/DSpace/DSpace/pull/1791/) on CGSpace (linode18)

## 2018-03-07

- Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers ([#360](https://github.com/ilri/DSpace/pull/360))
- Help Sisay proof 200 IITA records on DSpace Test
- Finally import Udana's 24 items to [IWMI Journal Articles](https://cgspace.cgiar.org/handle/10568/36185) on CGSpace
- Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc

## 2018-03-08

- Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata
- This makes the CSV have tons of columns, for example `dc.title`, `dc.title[]`, `dc.title[en]`, `dc.title[eng]`, `dc.title[en_US]` and so on!
- I think I can fix — or at least normalize — them in the database:

```
dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
 text_lang 
-----------
 
 ethnob
 en
 spa
 EN
 En
 en_
 en_US
 E.
 
 EN_US
 en_U
 eng
 fr
 es_ES
 es
(16 rows)

dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
UPDATE 122227
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
 text_lang
-----------

 ethnob
 en_US
 spa
 E.

 fr
 es_ES
 es
(9 rows)
```

- On second inspection it looks like `dc.description.provenance` fields use the text_lang "en" so that's probably why there are over 100,000 fields changed...
- If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:

```
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
UPDATE 2309
```

- I will apply this on CGSpace right now
- In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine
- Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the `cg.creator.id` field
- For example, a GREL expression in a custom text facet to get all items with `dc.contributor.author[en_US]` of a certain author with several name variations (this is how you use a logical OR in OpenRefine):

```
or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
```

- Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:

```
if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
```

- One thing that bothers me is that this won't honor author order
- It might be better to do batches of these in PostgreSQL with a script that takes the `place` column of an author into account when setting the `cg.creator.id`
- I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching `cg.creator.id` fieldsa: [add-orcid-identifiers-csv.py ](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050)
- The CSV should have two columns: author name and ORCID identifier:

```
dc.contributor.author,cg.creator.id
"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
```

- I didn't integrate the ORCID API lookup for author names in this script for now because I was only interested in "tagging" old items for a few given authors
- I added ORCID identifers for 187 items by CIAT's Hernan Ceballos, because that is what Elizabeth was trying to do manually!
- Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well
-												Add notes for 2018-03-02

											
										
										
											2018-03-02 15:09:18 +01:00
+								---
 								title: "March, 2018"
 								date: 2018-03-02T16:07:54+02:00
 								author: "Alan Orth"
 								tags: ["Notes"]
 								---
-												Update notes

											
										
										
											2018-03-06 09:17:14 +01:00
+								## 2018-03-02
-												Add notes for 2018-03-02

											
										
										
											2018-03-02 15:09:18 +01:00
 								- Export a CSV of the IITA community metadata for Martin Mueller
 								<!--more-->
-												Update notes

											
										
										
											2018-03-06 09:17:14 +01:00
 								## 2018-03-06
 								- Add three new CCAFS project tags to `input-forms.xml` ([#357](https://github.com/ilri/DSpace/pull/357))
 								- Andrea from Macaroni Bros had sent me an email that CCAFS needs them
-												Update notes for 2018-03-06

											
										
										
											2018-03-06 11:25:26 +01:00
+								- Give Udana more feedback on his WLE records from last month
 								- There were some records using a non-breaking space in their AGROVOC subject field
 								- I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace
 								```
 								$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
 								$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
 								```
 								- This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character
-												Update notes for 2018-03-06

											
										
										
											2018-03-06 13:45:51 +01:00
+								- Add new CRP subject "GRAIN LEGUMES AND DRYLAND CEREALS" to `input-forms.xml` ([#358](https://github.com/ilri/DSpace/pull/358))
-												Update notes for 2018-03-06

											
										
										
											2018-03-06 20:48:10 +01:00
+								- Merge the ORCID integration stuff in to `5_x-prod` for deployment on CGSpace soon ([#359](https://github.com/ilri/DSpace/pull/359))
 								- Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server
 								- Run all system updates on DSpace Test and reboot server
 								- I ran the [orcid-authority-to-item.py](https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c) script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata
 								```
 								$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
 								```
 								- I ran the DSpace cleanup script on CGSpace and it threw an error (as always):
 								```
 								Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
 								  Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
 								```
 								- The solution is, as always:
 								```
 								$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
 								UPDATE 1
 								```
-												Update notes

											
										
										
											2018-03-06 23:24:03 +01:00
 								- Apply the proposed PostgreSQL indexes from DS-3636 (pull request [#1791](https://github.com/DSpace/DSpace/pull/1791/) on CGSpace (linode18)
-												Update notes for 2018-03-07

											
										
										
											2018-03-07 11:29:24 +01:00
 								## 2018-03-07
 								- Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers ([#360](https://github.com/ilri/DSpace/pull/360))
 								- Help Sisay proof 200 IITA records on DSpace Test
 								- Finally import Udana's 24 items to [IWMI Journal Articles](https://cgspace.cgiar.org/handle/10568/36185) on CGSpace
-												Update notes

											
										
										
											2018-03-08 14:05:29 +01:00
+								- Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc
-												Update

											
										
										
											2018-03-08 16:32:38 +01:00
 								## 2018-03-08
 								- Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata
 								- This makes the CSV have tons of columns, for example `dc.title`, `dc.title[]`, `dc.title[en]`, `dc.title[eng]`, `dc.title[en_US]` and so on!
 								- I think I can fix — or at least normalize — them in the database:
 								```
 								dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
 								 text_lang
 								-----------
 								 ethnob
 								 en
 								 spa
 								 EN
 								 En
 								 en_
 								 en_US
 								 E.
 								 EN_US
 								 en_U
 								 eng
 								 fr
 								 es_ES
 								 es
 								(16 rows)
 								dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
 								UPDATE 122227
 								dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
 								 text_lang
 								-----------
 								 ethnob
 								 en_US
 								 spa
 								 E.
 								 fr
 								 es_ES
 								 es
 								(9 rows)
 								```
-												Update notes

											
										
										
											2018-03-08 21:47:12 +01:00
+								- On second inspection it looks like `dc.description.provenance` fields use the text_lang "en" so that's probably why there are over 100,000 fields changed...
 								- If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:
 								```
 								dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
 								UPDATE 2309
 								```
 								- I will apply this on CGSpace right now
-												Update

											
										
										
											2018-03-08 16:32:38 +01:00
+								- In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine
 								- Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the `cg.creator.id` field
 								- For example, a GREL expression in a custom text facet to get all items with `dc.contributor.author[en_US]` of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
 								```
 								or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
 								```
 								- Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:
 								```
 								if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
 								```
 								- One thing that bothers me is that this won't honor author order
 								- It might be better to do batches of these in PostgreSQL with a script that takes the `place` column of an author into account when setting the `cg.creator.id`
-												Add notes for 2018-03-08

											
										
										
											2018-03-08 20:10:16 +01:00
+								- I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching `cg.creator.id` fieldsa: [add-orcid-identifiers-csv.py ](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050)
 								- The CSV should have two columns: author name and ORCID identifier:
 								```
 								dc.contributor.author,cg.creator.id
 								"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
 								"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
 								```
 								- I didn't integrate the ORCID API lookup for author names in this script for now because I was only interested in "tagging" old items for a few given authors
-												Update

											
										
										
											2018-03-08 20:20:39 +01:00
+								- I added ORCID identifers for 187 items by CIAT's Hernan Ceballos, because that is what Elizabeth was trying to do manually!
-												Update notes

											
										
										
											2018-03-08 20:29:37 +01:00
+								- Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well