CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

December, 2018

2018-12-01

  • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
  • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
  • Then I ran all system updates and restarted the server

2018-12-02

  • The error when I try to manually run the media filter for one item from the command line:
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
        at org.im4java.core.Info.getBaseInfo(Info.java:360)
        at org.im4java.core.Info.<init>(Info.java:151)
        at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
        at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
        at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
  • A comment on StackOverflow question from yesterday suggests it might be a bug with the pngalpha device in Ghostscript and links to an upstream bug
  • I think we need to wait for a fix from Ubuntu
  • For what it’s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
zsh: segmentation fault (core dumped)  gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
  • When I replace the pngalpha device with png16m as suggested in the StackOverflow comments it works:
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
  • Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (IITA_Dec_1_1997 aka Daniel1807)
    • One item missing the authorship type
    • Some invalid countries (smart quotes, mispellings)
    • Added countries to some items that mentioned research in particular countries in their abstracts
    • One item had “MADAGASCAR” for ISI Journal
    • Minor corrections in IITA subject (LIVELIHOOD→LIVELIHOODS)
    • Trim whitespace in abstract field
    • Fix some sponsors (though some with “Governments of Canada” etc I’m not sure why those are plural)
    • Eighteen items had en||fr for the language, but the content was only in French so changed them to just fr
    • Six items had encoding errors in French text so I will ask Bosede to re-do them carefully
    • Correct and normalize a few AGROVOC subjects
  • Expand my “encoding error” detection GREL to include ~ as I saw a lot of that in some copy pasted French text recently:
or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/)),
  isNotNull(value.match(/.*\u007e.*/))
)

2018-12-03

  • I looked at the DSpace Ghostscript issue more and it seems to only affect certain PDFs…
  • I can successfully generate a thumbnail for another recent item (1056898394), but not for 1056898930
  • Even manually on my Arch Linux desktop with ghostscript 9.26-1 and the pngalpha device, I can generate a thumbnail for the first one (1056898394):
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
  • So it seems to be something about the PDFs themselves, perhaps related to alpha support?
  • The first item (1056898394) has the following information:
$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
  • And wow, I can’t even run ImageMagick’s identify on the first page of the second item (1056898930):
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
  • But with GraphicsMagick’s identify it works:
$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
DEBUG: FC_WEIGHT didn't match
Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
$ identify Food\ safety\ Kenya\ fruits.pdf
Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
  • As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):
$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
zsh: abort (core dumped)  convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
$ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
DEBUG: FC_WEIGHT didn't match
  • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn’t list a profile, though I don’t think this is relevant