cgspace-notes/content/posts/2018-12.md

19 KiB

title date author tags
December, 2018 2018-12-02T02:09:30+02:00 Alan Orth
Notes

2018-12-01

  • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
  • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
  • Then I ran all system updates and restarted the server

2018-12-02

  • The error when I try to manually run the media filter for one item from the command line:
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
        at org.im4java.core.Info.getBaseInfo(Info.java:360)
        at org.im4java.core.Info.<init>(Info.java:151)
        at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
        at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
        at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
  • A comment on StackOverflow question from yesterday suggests it might be a bug with the pngalpha device in Ghostscript and links to an upstream bug
  • I think we need to wait for a fix from Ubuntu
  • For what it's worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
zsh: segmentation fault (core dumped)  gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
  • When I replace the pngalpha device with png16m as suggested in the StackOverflow comments it works:
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
  • Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (IITA_Dec_1_1997 aka Daniel1807)
    • One item missing the authorship type
    • Some invalid countries (smart quotes, mispellings)
    • Added countries to some items that mentioned research in particular countries in their abstracts
    • One item had "MADAGASCAR" for ISI Journal
    • Minor corrections in IITA subject (LIVELIHOOD→LIVELIHOODS)
    • Trim whitespace in abstract field
    • Fix some sponsors (though some with "Governments of Canada" etc I'm not sure why those are plural)
    • Eighteen items had en||fr for the language, but the content was only in French so changed them to just fr
    • Six items had encoding errors in French text so I will ask Bosede to re-do them carefully
    • Correct and normalize a few AGROVOC subjects
  • Expand my "encoding error" detection GREL to include ~ as I saw a lot of that in some copy pasted French text recently:
or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/)),
  isNotNull(value.match(/.*\u007e.*/))
)

2018-12-03

  • I looked at the DSpace Ghostscript issue more and it seems to only affect certain PDFs...
  • I can successfully generate a thumbnail for another recent item (10568/98394), but not for 10568/98930
  • Even manually on my Arch Linux desktop with ghostscript 9.26-1 and the pngalpha device, I can generate a thumbnail for the first one (10568/98394):
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
  • So it seems to be something about the PDFs themselves, perhaps related to alpha support?
  • The first item (10568/98394) has the following information:
$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
  • And wow, I can't even run ImageMagick's identify on the first page of the second item (10568/98930):
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
  • But with GraphicsMagick's identify it works:
$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
DEBUG: FC_WEIGHT didn't match
Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
$ identify Food\ safety\ Kenya\ fruits.pdf
Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
  • As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):
$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
zsh: abort (core dumped)  convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
$ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
DEBUG: FC_WEIGHT didn't match
  • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn't list a profile, though I don't think this is relevant
  • I found another item that fails when generating a thumbnail (10568/98391, DSpace complains:
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
        at org.im4java.core.Info.getBaseInfo(Info.java:360)
        at org.im4java.core.Info.<init>(Info.java:151)
        at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
        at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
        at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
        at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
        at org.im4java.core.Info.getBaseInfo(Info.java:342)
        ... 14 more
Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
        at org.im4java.core.ImageCommand.finished(ImageCommand.java:253)
        at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314)
        at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
        ... 15 more
  • And on my Arch Linux environment ImageMagick's convert also segfaults:
$ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
zsh: abort (core dumped)  convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\]  x60
  • But GraphicsMagick's convert works:
$ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
  • So far the only thing that stands out is that the two files that don't work were created with Microsoft Office 2016:
$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
Creator:        Microsoft® Word 2016
Producer:       Microsoft® Word 2016
$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
Creator:        Microsoft® Word 2016
Producer:       Microsoft® Word 2016
  • And the one that works was created with Office 365:
$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
Creator:        Microsoft® Word for Office 365
Producer:       Microsoft® Word for Office 365
  • I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:
$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
$ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
  • I've tried a few times this week to register for the Ethiopian eVisa website, but it is never successful
  • In the end I tried one last time to just apply without registering and it was apparently successful
  • Testing DSpace 5.8 (5_x-prod branch) in an Ubuntu 18.04 VM with Tomcat 8.5 and had some issues:
    • JSPUI shows an internal error (log shows something about tag cloud, though, so might be unrelated)
    • Atmire Listings and Reports, which use JSPUI, asks you to log in again and then doesn't work
    • Content and Usage Analysis doesn't show up in the sidebar after logging in
    • I can navigate to /atmire/reporting-suite/usage-graph-editor, but it's only the Atmire theme and a "page not found" message
    • Related messages from dspace.log:
2018-12-03 15:44:00,030 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2018-12-03 15:44:03,390 ERROR com.atmire.app.webui.servlet.ExportServlet @ Error converter plugin not found: interface org.infoCon.ConverterPlugin
...
2018-12-03 15:45:01,667 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-listing-and-reports not found
  • I tested it on my local environment with Tomcat 8.5.34 and the JSPUI application still has an error (again, the logs show something about tag cloud, so be unrelated), and the Listings and Reports still asks you to log in again, despite already being logged in in XMLUI, but does appear to work (I generated a report and exported a PDF)
  • I think the errors about missing Atmire components must be important, here on my local machine as well (though not the one about atmire-listings-and-reports):
2018-12-03 16:44:00,009 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
  • This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness...?

2018-12-04

  • Last night Linode sent a message that the load on CGSpace (linode18) was too high, here's a list of the top users at the time and throughout the day:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    225 40.77.167.142
    226 66.249.64.63
    232 46.101.86.248
    285 45.5.186.2
    333 54.70.40.11
    411 193.29.13.85
    476 34.218.226.147
    962 66.249.70.27
   1193 35.237.175.180
   1450 2a01:4f8:140:3192::2
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1141 207.46.13.57
   1299 197.210.168.174
   1341 54.70.40.11
   1429 40.77.167.142
   1528 34.218.226.147
   1973 66.249.70.27
   2079 50.116.102.77
   2494 78.46.79.71
   3210 2a01:4f8:140:3192::2
   4190 35.237.175.180
  • 35.237.175.180 is known to us (CCAFS?), and I've already added it to the list of bot IPs in nginx, which appears to be working:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
4772
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
630
  • I haven't seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
  • At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
5111
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
419
  • 78.46.79.71 is another host on Hetzner with the following user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
  • This is not the first time a host on Hetzner has used a "normal" user agent to make thousands of requests
  • At least it is re-using its Tomcat sessions somehow:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
2044
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
1
  • In other news, it's good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):

PostgreSQL connections day