CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2017

2017-03-01

  • Run the 279 CIAT author corrections on CGSpace

2017-03-02

  • Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
  • CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
  • They might come in at the top level in one “CGIAR System” community, or with several communities
  • I need to spend a bit of time looking at the multiple handle support in DSpace and see if new content can be minted in both handles, or just one?
  • Need to send Peter and Michael some notes about this in a few days
  • Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
  • Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
  • This results in discolored thumbnails when compared to the original PDF, for example sRGB and CMYK:

Thumbnail in sRGB colorspace

Thumbnial in CMYK colorspace

  • I filed an issue for the color space thing: DS-3517

2017-03-03

$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
  • This reads the input file, applies the CMYK profile, applies the RGB profile, then writes the file
  • Note that you should set the first profile immediately after the input file
  • Also, it is better to use profiles than setting -colorspace
  • This is a great resource describing the color stuff: http://www.imagemagick.org/Usage/formats/#profiles
  • Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)
  • This is trivial with identify (even by the Java ImageMagick API):
$ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
DirectClass CMYK
$ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\]
DirectClass sRGB Alpha

2017-03-04

  • Spent more time looking at the ImageMagick CMYK issue
  • The default_cmyk.icc and default_rgb.icc files are both part of the Ghostscript GPL distribution, but according to DSpace’s LICENSES_THIRD_PARTY file, DSpace doesn’t allow distribution of dependencies that are licensed solely under the GPL
  • So this issue is kinda pointless now, as the ICC profiles are absolutely necessary to make a meaningful CMYK→sRGB conversion

2017-03-05

  • Look into helping developers from landportal.info with a query for items related to LAND on the REST API
  • They want something like the items that are returned by the general “LAND” query in the search interface, but we cannot do that
  • We can only return specific results for metadata fields, like:
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
# List any additional prefixes that need to be managed by this handle server
# (as for examle handle prefix coming from old dspace repository merged in
# that repository)
# handle.additional.prefixes = prefix1[, prefix2]
  • Because of this I noticed that our Handle server’s config.dct was potentially misconfigured!
  • We had some default values still present:
"300:0.NA/YOUR_NAMING_AUTHORITY"
  • I’ve changed them to the following and restarted the handle server:
"300:0.NA/10568"
  • In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk
  • From dspace/config/crosswalks/google-metadata.properties:
google.citation_doi = cg.identifier.doi
  • This works, and makes DSpace output the following metadata on the item view page:
<meta content="https://dx.doi.org/10.1186/s13059-017-1153-y" name="citation_doi">

2017-03-06

  • Someone on the mailing list said that handle.plugin.checknameauthority should be false if we’re using multiple handle prefixes

2017-03-07

  • I set up a top-level community as a test for the CGIAR Library and imported one item with the the 10947 handle prefix
  • When testing the Handle resolver locally it shows the item to be on the local repository
  • So this seems to work, with the following caveats:
    • New items will have the default handle
    • Communities and collections will have the default handle
    • Only items imported manually can have the other handles
  • I need to talk to Michael and Peter to share the news, and discuss the structure of their community(s) and try some actual test data
  • We’ll need to do some data cleaning to make sure they are using the same fields we are, like dc.type and cg.identifier.status
  • Another thing is that the import process creates new dc.date.accessioned and dc.date.available fields, so we end up with duplicates (is it important to preserve the originals for these?)
  • Report DS-3520 issue to Atmire

2017-03-08

  • Merge the author separator changes to 5_x-prod, as everyone has responded positively about it, and it’s the default in Mirage2 afterall!
  • Cherry pick the commons-collections patch from DSpace’s dspace-5_x branch to address DS-3520: https://jira.duraspace.org/browse/DS-3520

2017-03-09

  • Export list of sponsors so Peter can clean it up:
dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
COPY 285

2017-03-12

  • Test the sponsorship fixes and deletes from Peter:
$ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
$ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
  • Generate a new list of unique sponsors so we can update the controlled vocabulary:
dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;

Livestock CRP theme

2017-03-15

2017-03-16

  • Merge pull request for PABRA subjects: https://github.com/ilri/DSpace/pull/310
  • Abenet and Peter say we can add them to Discovery, Atmire modules, etc, but I might not have time to do it now
  • Help Sisay with RTB theme again
  • Remove ICARDA subject from Discovery sidebar facets: https://github.com/ilri/DSpace/pull/312
  • Remove ICARDA subject from Browse and item submission form: https://github.com/ilri/DSpace/pull/313
  • Merge the CCAFS Phase II changes but hold off on doing the flagship metadata updates until Macaroni Bros gets their importer updated
  • Deploy latest changes and investor fixes/deletions on CGSpace
  • Run system updates on CGSpace and reboot server

2017-03-20

2017-03-24

  • Still helping Sisay try to figure out how to create a theme for the RTB community

2017-03-28

  • CCAFS said they are ready for the flagship updates for Phase II to be run (cg.subject.ccafs), so I ran them on CGSpace:
$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
  • We’ve been waiting since February to run these
  • Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;

2017-03-29

  • Dump a list of fields in the DC and CG schemas to compare with CG Core:
dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
  • Ooh, a better one!
dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);

2017-03-30

  • Adjust the Linode CPU usage alerts for the CGSpace server from 150% to 200%, as generally the nightly Solr indexing causes a usage around 150–190%, so this should make the alerts less regular
  • Adjust the threshold for DSpace Test from 90 to 100%