CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2018

2018-03-02

  • Export a CSV of the IITA community metadata for Martin Mueller

2018-03-06

  • Add three new CCAFS project tags to input-forms.xml (#357)
  • Andrea from Macaroni Bros had sent me an email that CCAFS needs them
  • Give Udana more feedback on his WLE records from last month
  • There were some records using a non-breaking space in their AGROVOC subject field
  • I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace
$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
  • This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character
  • Add new CRP subject “GRAIN LEGUMES AND DRYLAND CEREALS” to input-forms.xml (#358)
  • Merge the ORCID integration stuff in to 5_x-prod for deployment on CGSpace soon (#359)
  • Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server
  • Run all system updates on DSpace Test and reboot server
  • I ran the orcid-authority-to-item.py script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata
$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
  • I ran the DSpace cleanup script on CGSpace and it threw an error (as always):
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
  • The solution is, as always:

    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
    UPDATE 1
    
  • Apply the proposed PostgreSQL indexes from DS-3636 (pull request #1791 on CGSpace (linode18)

2018-03-07

  • Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (#360)
  • Help Sisay proof 200 IITA records on DSpace Test
  • Finally import Udana’s 24 items to IWMI Journal Articles on CGSpace
  • Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc

2018-03-08

  • Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata
  • This makes the CSV have tons of columns, for example dc.title, dc.title[], dc.title[en], dc.title[eng], dc.title[en_US] and so on!
  • I think I can fix — or at least normalize — them in the database:
dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
 text_lang 
-----------
 
 ethnob
 en
 spa
 EN
 En
 en_
 en_US
 E.
 
 EN_US
 en_U
 eng
 fr
 es_ES
 es
(16 rows)

dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
UPDATE 122227
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
 text_lang
-----------

 ethnob
 en_US
 spa
 E.

 fr
 es_ES
 es
(9 rows)
  • On second inspection it looks like dc.description.provenance fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…
  • If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
UPDATE 2309
  • I will apply this on CGSpace right now
  • In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine
  • Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the cg.creator.id field
  • For example, a GREL expression in a custom text facet to get all items with dc.contributor.author[en_US] of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
  • Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:
if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
  • One thing that bothers me is that this won’t honor author order
  • It might be better to do batches of these in PostgreSQL with a script that takes the place column of an author into account when setting the cg.creator.id
  • I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching cg.creator.id fieldsa: add-orcid-identifiers-csv.py
  • The CSV should have two columns: author name and ORCID identifier:
dc.contributor.author,cg.creator.id
"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
  • I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors
  • I added ORCID identifers for 187 items by CIAT’s Hernan Ceballos, because that is what Elizabeth was trying to do manually!
  • Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well

2018-03-09

  • Give James Stapleton input on Sisay’s KRAs
  • Create a pull request to disable ORCID authority integration for dc.contributor.author in the submission forms and XMLUI display (#363)

2018-03-11

  • Peter also wrote to say he is having issues with the Atmire Listings and Reports module
  • When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:
2018-03-11 11:38:15,592 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
g/jspui/listings-and-reports
-- Method: POST
-- Parameters were:
-- selected_admin_preset: "ilri authors2"
-- load: "normal"
-- next: "NEXT STEP >>"
-- step: "1"

org.apache.jasper.JasperException: java.lang.NullPointerException
  • Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn’t find them
  • I made a quick fix and it’s working now (#364)

2018-03-12

  • Increase upload size on CGSpace’s nginx config to 85MB so Sisay can upload some data