- Merge the ORCID integration stuff in to `5_x-prod` for deployment on CGSpace soon ([#359](https://github.com/ilri/DSpace/pull/359))
- Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server
- Run all system updates on DSpace Test and reboot server
- I ran the [orcid-authority-to-item.py](https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c) script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata
- On second inspection it looks like `dc.description.provenance` fields use the text_lang "en" so that's probably why there are over 100,000 fields changed...
- If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:
```
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
- In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine
- Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the `cg.creator.id` field
- For example, a GREL expression in a custom text facet to get all items with `dc.contributor.author[en_US]` of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
- Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:
```
if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
```
- One thing that bothers me is that this won't honor author order
- It might be better to do batches of these in PostgreSQL with a script that takes the `place` column of an author into account when setting the `cg.creator.id`
- I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching `cg.creator.id` fields: [add-orcid-identifiers-csv.py ](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050)
- The CSV should have two columns: author name and ORCID identifier:
```
dc.contributor.author,cg.creator.id
"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
```
- I didn't integrate the ORCID API lookup for author names in this script for now because I was only interested in "tagging" old items for a few given authors
- Create a pull request to disable ORCID authority integration for `dc.contributor.author` in the submission forms and XMLUI display ([#363](https://github.com/ilri/DSpace/pull/363))
- I created a new Linode server for DSpace Test (linode6623840) so I could try the block storage stuff, but when I went to add a 300GB volume it said that block storage capacity was exceeded in that datacenter (Newark, NJ)
- I deleted the Linode and created another one (linode6624164) in the Fremont, CA region
- After that I deployed the Ubuntu 16.04 image and attached a 300GB block storage volume to the image
- Magdalena wrote to ask why there was no Altmetric donut for an item on CGSpace, but there was one on the related CCAFS publication page
- It looks the the CCAFS publications page fetches the donut using its DOI, whereas CGSpace queries via Handle
- I will write to Altmetric support and ask them, as perhaps its part of a larger issue
- ICT made the DNS updates for dspacetest.cgiar.org late last night
- I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164
- Looking at the CRP subjects on CGSpace I see there is one blank one so I'll just fix it:
```
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
```
- Copy all CRP subjects to a CSV to do the mass updates:
```
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
COPY 21
```
- Once I prepare the new input forms ([#362](https://github.com/ilri/DSpace/issues/362)) I will need to do the batch corrections:
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
```
- I have no idea why it crashed
- I ran all system updates and rebooted it
- Abenet told me that one of Lance Robinson's ORCID iDs on CGSpace is incorrect
- I will remove it from the controlled vocabulary ([#367](https://github.com/ilri/DSpace/pull/367)) and update any items using the old one:
```
dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
UPDATE 1
```
- Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits
- Looks like the indexing gets confused that there is still data in the `authority` column
- Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!
- Since we've migrated the ORCID identifiers associated with the authority data to the `cg.creator.id` field we can nullify the authorities remaining in the database:
```sql
dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
UPDATE 195463
```
- After this the indexing works as usual and item counts and facets are back to normal
- Send Peter a list of all authors to correct:
```sql
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element
= 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
COPY 56156
```
- Afterwards we'll want to do some batch tagging of ORCID identifiers to these names
- I guess we need to give it more RAM because it now has CGSpace's large Solr core
- I will increase the memory from 3072m to 4096m
- Update [Ansible playbooks](https://github.com/ilri/rmg-ansible-public) to use [PostgreSQL JBDC driver](https://jdbc.postgresql.org/) 42.2.2
- Deploy the new JDBC driver on DSpace Test
- I'm also curious to see how long the `dspace index-discovery -b` takes on DSpace Test where the DSpace installation directory is on one of Linode's new block storage volumes
- But it's probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):
```
or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/))
)
```
- And here's one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it's time to add delete support to my `fix-metadata-values.py` script:
```
or(
isNotNull(value.match(/.*delete.*/i)),
isNotNull(value.match(/.*remove.*/i)),
isNotNull(value.match(/.*check.*/i))
)
```
- So I guess the routine is in OpenRefine is:
- Transform: trim leading/trailing whitespace
- Transform: collapse consecutive whitespace
- Custom text facet for items to delete/check
- Custom text facet for illegal characters
- Test the corrections and deletions locally, then run them on CGSpace:
- Atmire got back to me about the [Listings and Reports issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589) and said it's caused by items that have missing `dc.identifier.citation` fields
- The will send a fix
## 2018-03-27
- Atmire got back with an updated quote about the DSpace 5.8 compatibility so I've forwarded it to Peter
## 2018-03-28
- DSpace Test crashed due to heap space so I've increased it from 4096m to 5120m
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
Fixed 100 occurences of: ROOTS, TUBERS AND BANANAS
Fixed 31 occurences of: HUMIDTROPICS
Fixed 21 occurences of: MAIZE
Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
Fixed 28 occurences of: GRAIN LEGUMES
Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
Fixed 5 occurences of: GENEBANKS
```
- That's weird because we just updated them last week...
- Create a pull request to enable searching by ORCID identifier (`cg.creator.id`) in Discovery and Listings and Reports ([#371](https://github.com/ilri/DSpace/pull/371))
- I will test it on DSpace Test first!
- Fix one missing XMLUI string for "Access Status" (cg.identifier.status)