cgspace-notes/content/post/2016-06.md

14 KiB

+++ date = "2016-06-01T10:53:00+03:00" author = "Alan Orth" title = "June, 2016" tags = ["notes"] image = "../images/bg.jpg"

+++

2016-06-01

dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
UPDATE 497
dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
UPDATE 14
  • Fix a few minor miscellaneous issues in dspace.cfg (#227)

2016-06-02

  • Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with cg.coverage.admin-unit
  • Seems that the Browse configuration in dspace.cfg can't handle the '-' in the field name:
webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
  • But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error
  • I've sent a message to the DSpace mailing list to ask about the Browse index definition
  • A user was having problems with submission and from the stacktrace it looks like a Sherpa/Romeo issue
  • I found a thread on the mailing list talking about it and there is bug report and a patch: https://jira.duraspace.org/browse/DS-2740
  • The patch applies successfully on DSpace 5.1 so I will try it later

2016-06-03

  • Investigating the CCAFS authority issue, I exported the metadata for the Videos collection
  • The top two authors are:
CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600
  • So the only difference is the "confidence"
  • Ok, well THAT is interesting:
dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
 text_value |              authority               | confidence
------------+--------------------------------------+------------
 Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
 Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
 Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
 Orth, Alan |                                      |         -1
 Orth, Alan |                                      |         -1
 Orth, Alan |                                      |         -1
 Orth, Alan |                                      |         -1
 Orth, A.   | 05c2c622-d252-4efb-b9ed-95a07d3adf11 |         -1
 Orth, A.   | 05c2c622-d252-4efb-b9ed-95a07d3adf11 |         -1
 Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
 Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
 Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 |        600
 Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 |        600
(13 rows)
  • And now an actually relevent example:
dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
 count
-------
   707
(1 row)

dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
 count
-------
   253
(1 row)
  • Trying something experimental:
dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
  • And then re-indexing authority and Discovery...?
  • After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet
  • The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:
webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
  • That would only be for the "Browse by" function... so we'll have to see what effect that has later

2016-06-04

  • Re-sync DSpace Test with CGSpace and perform test of metadata migration again
  • Run phase two of metadata migrations on CGSpace (see the migration notes)
  • Run all system updates and reboot CGSpace server

2016-06-07

  • Figured out how to export a list of the unique values from a metadata field ordered by count:
dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
  • Identified the next round of fields to migrate:

    • dc.title.jtitle → dc.source
    • dc.crsubject.crpsubject → cg.contributor.crp
    • dc.contributor.affiliation → cg.contributor.affiliation
    • dc.Species → cg.species
    • dc.contributor.corporate → dc.contributor
    • dc.identifier.url → cg.identifier.url
    • dc.identifier.doi → cg.identifier.doi
    • dc.identifier.googleurl → cg.identifier.googleurl
    • dc.identifier.dataurl → cg.identifier.dataurl
  • Discuss pulling data from IFPRI's ContentDM with Ryan Miller

  • Looks like OAI is kinda obtuse for this, and if we use ContentDM's API we'll be able to access their internal field names (rather than trying to figure out how they stuffed them into various, repeated Dublin Core fields)

2016-06-08

$ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
  • Write to Atmire about the use of atmire.orcid.id to see if we can change it
  • Seems to be a virtual field that is queried from the authority cache... hmm
  • In other news, I found out that the About page that we haven't been using lives in dspace/config/about.xml, so now we can update the text
  • File bug about closed="true" attribute of controlled vocabularies not working: https://jira.duraspace.org/browse/DS-3238

2016-06-09

  • Atmire explained that the atmire.orcid.id field doesn't exist in the schema, as it actually comes from the authority cache during XMLUI run time
  • This means we don't see it when harvesting via OAI or REST, for example
  • They opened a feature ticket on the DSpace tracker to ask for support of this: https://jira.duraspace.org/browse/DS-3239

2016-06-10

  • Investigating authority confidences
  • It looks like the values are documented in Choices.java
  • Experiment with setting all 960 CCAFS author values to be 500:
dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';

dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
  • After the database edit, I did a full Discovery re-index
  • And now there are exactly 960 items in the authors facet for 'CGIAR Research Program on Climate Change, Agriculture and Food Security'
  • Now I ran the same on CGSpace
  • Merge controlled vocabulary functionality for animal breeds to 5_x-prod (#236)
  • Write python script to update metadata values in batch via PostgreSQL: fix-metadata-values.py
  • We need to use this to correct some pretty ugly values in fields like dc.description.sponsorship
  • Merge item display tweaks from earlier this week (#231)
  • Merge controlled vocabulary functionality for subregions (#238)

2016-06-11

  • Merge controlled vocabulary for sponsorship field (#239)
  • Fix character encoding issues for animal breed lookup that I merged yesterday

2016-06-17

  • Linode has free RAM upgrades for their 13th birthday so I migrated DSpace Test (4→8GB of RAM)

2016-06-18

  • Clean up titles and hints in input-forms.xml to use title/sentence case and a few more consistency things (#241)

  • The final list of fields to migrate in the third phase of metadata migrations is:

    • dc.title.jtitle → dc.source
    • dc.crsubject.crpsubject → cg.contributor.crp
    • dc.contributor.affiliation → cg.contributor.affiliation
    • dc.srplace.subregion → cg.coverage.subregion
    • dc.Species → cg.species
    • dc.contributor.corporate → dc.contributor
    • dc.identifier.url → cg.identifier.url
    • dc.identifier.doi → cg.identifier.doi
    • dc.identifier.googleurl → cg.identifier.googleurl
    • dc.identifier.dataurl → cg.identifier.dataurl
  • Interesting "Sunburst" visualization on a Digital Commons page: http://www.repository.law.indiana.edu/sunburst.html

  • Final testing on metadata fix/delete for dc.description.sponsorship cleanup

  • Need to run fix-metadata-values.py and then fix-metadata-values.py

2016-06-20

  • CGSpace's HTTPS certificate expired last night and I didn't notice, had to renew:
# /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
  • I really need to fix that cron job...

2016-06-24

  • Run the replacements/deletes for dc.description.sponsorship (investors) on CGSpace:
$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace

2016-06-28

  • Testing the cleanup of dc.contributor.corporate with 13 deletions and 121 replacements
  • There are still ~97 fields that weren't indicated to do anything
  • After the above deletions and replacements I regenerated a CSV and sent it to Peter et al to have a look
dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
  • Re-evaluate dc.contributor.corporate and it seems we will move it to dc.contributor.author as this is more in line with how editors are actually using it

2016-06-29

  • Test run of migrate-fields.sh with the following re-mappings:
72  55  #dc.source
86  230 #cg.contributor.crp
91  211 #cg.contributor.affiliation
94  212 #cg.species
107 231 #cg.coverage.subregion
126 3   #dc.contributor.author
73  219 #cg.identifier.url
74  220 #cg.identifier.doi
79  222 #cg.identifier.googleurl
89  223 #cg.identifier.dataurl
  • Run all cleanups and deletions of dc.contributor.corporate on CGSpace:
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
  • Re-deploy CGSpace and DSpace Test with latest June changes
  • Now the sharing and Altmetric bits are more prominent:

DSpace 5.1 XMLUI With Altmetric Badge

  • Run all system updates on the servers and reboot
  • Start working on config changes for phase three of the metadata migrations

2016-06-30

  • Wow, there are 95 authors in the database who have ',' at the end of their name:
# select text_value from  metadatavalue where metadata_field_id=3 and text_value like '%,';
  • We need to use something like this to fix them, need to write a proper regex later:
# update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';