CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2018

2018-03-02

  • Export a CSV of the IITA community metadata for Martin Mueller

2018-03-06

  • Add three new CCAFS project tags to input-forms.xml (#357)
  • Andrea from Macaroni Bros had sent me an email that CCAFS needs them
  • Give Udana more feedback on his WLE records from last month
  • There were some records using a non-breaking space in their AGROVOC subject field
  • I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace

    $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
    $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
    
  • This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character

  • Add new CRP subject “GRAIN LEGUMES AND DRYLAND CEREALS” to input-forms.xml (#358)

  • Merge the ORCID integration stuff in to 5_x-prod for deployment on CGSpace soon (#359)

  • Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server

  • Run all system updates on DSpace Test and reboot server

  • I ran the orcid-authority-to-item.py script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata

    $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
    
  • I ran the DSpace cleanup script on CGSpace and it threw an error (as always):

    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
    
  • The solution is, as always:

    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
    UPDATE 1
    
  • Apply the proposed PostgreSQL indexes from DS-3636 (pull request #1791 on CGSpace (linode18)

2018-03-07

  • Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (#360)
  • Help Sisay proof 200 IITA records on DSpace Test
  • Finally import Udana’s 24 items to IWMI Journal Articles on CGSpace
  • Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc

2018-03-08

  • Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata
  • This makes the CSV have tons of columns, for example dc.title, dc.title[], dc.title[en], dc.title[eng], dc.title[en_US] and so on!
  • I think I can fix — or at least normalize — them in the database:

    dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
    text_lang 
    -----------
     
    ethnob
    en
    spa
    EN
    En
    en_
    en_US
    E.
     
    EN_US
    en_U
    eng
    fr
    es_ES
    es
    (16 rows)
    
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
    UPDATE 122227
    dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
    text_lang
    -----------
    
    ethnob
    en_US
    spa
    E.
    
    fr
    es_ES
    es
    (9 rows)
    
  • On second inspection it looks like dc.description.provenance fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…

  • If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:

    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
    UPDATE 2309
    
  • I will apply this on CGSpace right now

  • In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine

  • Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the cg.creator.id field

  • For example, a GREL expression in a custom text facet to get all items with dc.contributor.author[en_US] of a certain author with several name variations (this is how you use a logical OR in OpenRefine):

    or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
    
  • Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:

    if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
    
  • One thing that bothers me is that this won’t honor author order

  • It might be better to do batches of these in PostgreSQL with a script that takes the place column of an author into account when setting the cg.creator.id

  • I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching cg.creator.id fields: add-orcid-identifiers-csv.py

  • The CSV should have two columns: author name and ORCID identifier:

    dc.contributor.author,cg.creator.id
    "Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
    "Orth, A.",Alan S. Orth: 0000-0002-1735-7458
    
  • I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors

  • I added ORCID identifers for 187 items by CIAT’s Hernan Ceballos, because that is what Elizabeth was trying to do manually!

  • Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well

2018-03-09

  • Give James Stapleton input on Sisay’s KRAs
  • Create a pull request to disable ORCID authority integration for dc.contributor.author in the submission forms and XMLUI display (#363)

2018-03-11

  • Peter also wrote to say he is having issues with the Atmire Listings and Reports module
  • When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:

    2018-03-11 11:38:15,592 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
    g/jspui/listings-and-reports
    -- Method: POST
    -- Parameters were:
    -- selected_admin_preset: "ilri authors2"
    -- load: "normal"
    -- next: "NEXT STEP >>"
    -- step: "1"
    
    org.apache.jasper.JasperException: java.lang.NullPointerException
    
  • Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn’t find them

  • I made a quick fix and it’s working now (#364)

2018-03-12

  • Increase upload size on CGSpace’s nginx config to 85MB so Sisay can upload some data

2018-03-13

  • I created a new Linode server for DSpace Test (linode6623840) so I could try the block storage stuff, but when I went to add a 300GB volume it said that block storage capacity was exceeded in that datacenter (Newark, NJ)
  • I deleted the Linode and created another one (linode6624164) in the Fremont, CA region
  • After that I deployed the Ubuntu 16.04 image and attached a 300GB block storage volume to the image
  • Magdalena wrote to ask why there was no Altmetric donut for an item on CGSpace, but there was one on the related CCAFS publication page
  • It looks the the CCAFS publications page fetches the donut using its DOI, whereas CGSpace queries via Handle
  • I will write to Altmetric support and ask them, as perhaps its part of a larger issue
  • CGSpace item: https://cgspace.cgiar.org/handle/10568/89643
  • CCAFS publication page: https://ccafs.cgiar.org/publications/can-scenario-planning-catalyse-transformational-change-evaluating-climate-change-policy
  • Peter tweeted the Handle link and now Altmetric shows the donut for both the DOI and the Handle

2018-03-14

  • Help Abenet with a troublesome Listings and Report question for CIAT author Steve Beebe
  • Continue migrating DSpace Test to the new server (linode6624164)
  • I emailed ILRI service desk to update the DNS records for dspacetest.cgiar.org
  • Abenet was having problems saving Listings and Reports configurations or layouts but I tested it and it works

2018-03-15

  • Help Abenet troubleshoot the Listings and Reports issue again
  • It looks like it’s an issue with the layouts, if you create a new layout that only has one type (dc.identifier.citation):

Listing and Reports layout

2018-03-16

  • ICT made the DNS updates for dspacetest.cgiar.org late last night
  • I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164
  • Looking at the CRP subjects on CGSpace I see there is one blank one so I’ll just fix it:

    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
    
  • Copy all CRP subjects to a CSV to do the mass updates:

    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
    COPY 21
    
  • Once I prepare the new input forms (#362) I will need to do the batch corrections:

    $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
    
  • Create a pull request to update the input forms for the new CRP subject style (#366)

2018-03-19

  • Tezira has been having problems accessing CGSpace from the ILRI Nairobi campus since last week
  • She is getting an HTTPS error apparently
  • It’s working outside, and Ethiopian users seem to be having no issues so I’ve asked ICT to have a look
  • CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat
  • Around that time there were an increase of SQL errors:

    2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
    ...
    2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
    
  • But these errors, I don’t even know what they mean, because a handful of them happen every day:

    $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
    dspace.log.2018-03-10:13
    dspace.log.2018-03-11:15
    dspace.log.2018-03-12:13
    dspace.log.2018-03-13:13
    dspace.log.2018-03-14:14
    dspace.log.2018-03-15:13
    dspace.log.2018-03-16:13
    dspace.log.2018-03-17:13
    dspace.log.2018-03-18:15
    dspace.log.2018-03-19:90
    
  • There wasn’t even a lot of traffic at the time (8–9 AM):

    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     92 40.77.167.197
     92 83.103.94.48
     96 40.77.167.175
    116 207.46.13.178
    122 66.249.66.153
    140 95.108.181.88
    196 213.55.99.121
    206 197.210.168.174
    207 104.196.152.243
    294 54.198.169.202
    
  • Well there is a hint in Tomcat’s catalina.out:

    Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
    Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
    
  • So someone was doing something heavy somehow… my guess is content and usage stats!

  • ICT responded that they “fixed” the CGSpace connectivity issue in Nairobi without telling me the problem

  • When I asked, Robert Okal said CGNET messed up when updating the DNS for cgspace.cgiar.org last week

  • I told him that my request last week was for dspacetest.cgiar.org, not cgspace.cgiar.org!

  • So they updated the wrong fucking DNS records

  • Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export

  • It appears to be this one: https://cgspace.cgiar.org/handle/10568/83473?show=full

  • The title is “Untitled” and there is some metadata but indeed the citation is missing

  • I don’t know what would cause that

2018-03-20

  • DSpace Test has been down for a few hours with SQL and memory errors starting this morning:

    2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
    ...
    2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
    org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    
  • I have no idea why it crashed

  • I ran all system updates and rebooted it

  • Abenet told me that one of Lance Robinson’s ORCID iDs on CGSpace is incorrect

  • I will remove it from the controlled vocabulary (#367) and update any items using the old one:

    dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
    UPDATE 1
    
  • Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits

  • Merge the changes to CRP names to the 5_x-prod branch and deploy on CGSpace (#363)

  • Run corrections for CRP names in the database:

    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
    
  • Run all system updates on CGSpace (linode18) and reboot the server

  • I started a full Discovery re-index on CGSpace because of the updated CRPs

  • I see this error in the DSpace log:

    2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for  field "dc_contributor_author".
    java.lang.IllegalArgumentException: No choices plugin was configured for  field "dc_contributor_author".
        at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
        at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
        at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
        at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662)
        at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807)
        at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
        at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
        at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
    
  • I have to figure that one out…

2018-03-21

  • Looks like the indexing gets confused that there is still data in the authority column
  • Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!
  • Since we’ve migrated the ORCID identifiers associated with the authority data to the cg.creator.id field we can nullify the authorities remaining in the database:
dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
UPDATE 195463
  • After this the indexing works as usual and item counts and facets are back to normal
  • Send Peter a list of all authors to correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
COPY 56156
  • Afterwards we’ll want to do some batch tagging of ORCID identifiers to these names
  • CGSpace crashed again this afternoon, I’m not sure of the cause but there are a lot of SQL errors in the DSpace log:

    2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
    java.sql.SQLException: Connection has already been closed.
    
  • I have no idea why so many connections were abandoned this afternoon:

    # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
    268
    
  • DSpace Test crashed again due to Java heap space, this is from the DSpace log:

    2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
    org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    
  • And this is from the Tomcat Catalina log:

    Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
    SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
    java.lang.OutOfMemoryError: Java heap space
    
  • But there are tons of heap space errors on DSpace Test actually:

    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    319
    
  • I guess we need to give it more RAM because it now has CGSpace’s large Solr core

  • I will increase the memory from 3072m to 4096m

  • Update Ansible playbooks to use PostgreSQL JBDC driver 42.2.2

  • Deploy the new JDBC driver on DSpace Test

  • I’m also curious to see how long the dspace index-discovery -b takes on DSpace Test where the DSpace installation directory is on one of Linode’s new block storage volumes

    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    
    real    208m19.155s
    user    8m39.138s
    sys     2m45.135s
    
  • So that’s about three times as long as it took on CGSpace this morning

  • I should also check the raw read speed with hdparm -tT /dev/sdc

  • Looking at Peter’s author corrections there are some mistakes due to Windows 1252 encoding

  • I need to find a way to filter these easily with OpenRefine

  • For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields

  • I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:

    isNotNull(value.match(/.*\ufffd.*/))
    
  • I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues

2018-03-22

2018-03-24

  • More work on the Ubuntu 18.04 readiness stuff for the Ansible playbooks
  • The playbook now uses the system’s Ruby and Node.js so I don’t have to manually install RVM and NVM after

2018-03-25

  • Looking at Peter’s author corrections and trying to work out a way to find errors in OpenRefine easily
  • I can find all names that have acceptable characters using a GREL expression like:

    isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
    
  • But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):

    or(
    isNotNull(value.match(/.*[(|)].*/)),
    isNotNull(value.match(/.*\uFFFD.*/)),
    isNotNull(value.match(/.*\u00A0.*/)),
    isNotNull(value.match(/.*\u200A.*/))
    )
    
  • And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my fix-metadata-values.py script:

    or(
    isNotNull(value.match(/.*delete.*/i)),
    isNotNull(value.match(/.*remove.*/i)),
    isNotNull(value.match(/.*check.*/i))
    )
    
  • So I guess the routine is in OpenRefine is:

    • Transform: trim leading/trailing whitespace
    • Transform: collapse consecutive whitespace
    • Custom text facet for items to delete/check
    • Custom text facet for illegal characters
  • Test the corrections and deletions locally, then run them on CGSpace:

    $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
    
  • Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test

  • CGSpace took 76m28.292s

  • DSpace Test took 194m56.048s

2018-03-26

  • Atmire got back to me about the Listings and Reports issue and said it’s caused by items that have missing dc.identifier.citation fields
  • The will send a fix

2018-03-27

  • Atmire got back with an updated quote about the DSpace 5.8 compatibility so I’ve forwarded it to Peter

2018-03-28

  • DSpace Test crashed due to heap space so I’ve increased it from 4096m to 5120m
  • The error in Tomcat’s catalina.out was:

    Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
    
  • Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (#370) for Abenet

  • I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:

    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
    Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
    Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
    Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
    Fixed 100 occurences of: ROOTS, TUBERS AND BANANAS
    Fixed 31 occurences of: HUMIDTROPICS
    Fixed 21 occurences of: MAIZE
    Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
    Fixed 28 occurences of: GRAIN LEGUMES
    Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
    Fixed 5 occurences of: GENEBANKS
    
  • That’s weird because we just updated them last week…

  • Create a pull request to enable searching by ORCID identifier (cg.creator.id) in Discovery and Listings and Reports (#371)

  • I will test it on DSpace Test first!

  • Fix one missing XMLUI string for “Access Status” (cg.identifier.status)

  • Run all system updates on DSpace Test and reboot the machine