mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-04 19:24:54 +01:00
527 lines
28 KiB
Markdown
527 lines
28 KiB
Markdown
---
|
|
title: "December, 2018"
|
|
date: 2018-12-02T02:09:30+02:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2018-12-01
|
|
|
|
- Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
|
|
- I manually installed OpenJDK, then removed Oracle JDK, then re-ran the [Ansible playbook](http://github.com/ilri/rmg-ansible-public) to update all configuration files, etc
|
|
- Then I ran all system updates and restarted the server
|
|
|
|
## 2018-12-02
|
|
|
|
- I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another [Ghostscript vulnerability last week](https://usn.ubuntu.com/3831-1/)
|
|
|
|
<!--more-->
|
|
|
|
- The error when I try to manually run the media filter for one item from the command line:
|
|
|
|
```
|
|
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
|
|
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
|
|
at org.im4java.core.Info.getBaseInfo(Info.java:360)
|
|
at org.im4java.core.Info.<init>(Info.java:151)
|
|
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
|
|
at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
|
|
at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
|
|
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
|
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
|
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
|
at java.lang.reflect.Method.invoke(Method.java:498)
|
|
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
|
|
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
|
|
```
|
|
|
|
- A comment on [StackOverflow question](https://stackoverflow.com/questions/53560755/ghostscript-9-26-update-breaks-imagick-readimage-for-multipage-pdf) from yesterday suggests it might be a bug with the `pngalpha` device in Ghostscript and [links to an upstream bug](https://bugs.ghostscript.com/show_bug.cgi?id=699815)
|
|
- I think we need to wait for a fix from Ubuntu
|
|
- For what it's worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
|
|
|
|
```
|
|
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
|
|
DEBUG: FC_WEIGHT didn't match
|
|
zsh: segmentation fault (core dumped) gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
|
|
```
|
|
|
|
- When I replace the `pngalpha` device with `png16m` as suggested in the StackOverflow comments it works:
|
|
|
|
```
|
|
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
|
|
DEBUG: FC_WEIGHT didn't match
|
|
```
|
|
|
|
- Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend ([IITA_Dec_1_1997 aka Daniel1807](https://dspacetest.cgiar.org/handle/10568/108298))
|
|
- One item missing the authorship type
|
|
- Some invalid countries (smart quotes, mispellings)
|
|
- Added countries to some items that mentioned research in particular countries in their abstracts
|
|
- One item had "MADAGASCAR" for ISI Journal
|
|
- Minor corrections in IITA subject (LIVELIHOOD→LIVELIHOODS)
|
|
- Trim whitespace in abstract field
|
|
- Fix some sponsors (though some with "Governments of Canada" etc I'm not sure why those are plural)
|
|
- Eighteen items had `en||fr` for the language, but the content was only in French so changed them to just `fr`
|
|
- Six items had encoding errors in French text so I will ask Bosede to re-do them carefully
|
|
- Correct and normalize a few AGROVOC subjects
|
|
- Expand my "encoding error" detection GREL to include `~` as I saw a lot of that in some copy pasted French text recently:
|
|
|
|
```
|
|
or(
|
|
isNotNull(value.match(/.*\uFFFD.*/)),
|
|
isNotNull(value.match(/.*\u00A0.*/)),
|
|
isNotNull(value.match(/.*\u200A.*/)),
|
|
isNotNull(value.match(/.*\u2019.*/)),
|
|
isNotNull(value.match(/.*\u00b4.*/)),
|
|
isNotNull(value.match(/.*\u007e.*/))
|
|
)
|
|
```
|
|
|
|
## 2018-12-03
|
|
|
|
- I looked at the DSpace Ghostscript issue more and it seems to only affect certain PDFs...
|
|
- I can successfully generate a thumbnail for another recent item ([10568/98394](https://hdl.handle.net/10568/98394)), but not for [10568/98930](https://hdl.handle.net/10568/98390)
|
|
- Even manually on my Arch Linux desktop with ghostscript 9.26-1 and the `pngalpha` device, I can generate a thumbnail for the first one (10568/98394):
|
|
|
|
```
|
|
$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
|
|
```
|
|
|
|
- So it seems to be something about the PDFs themselves, perhaps related to alpha support?
|
|
- The first item (10568/98394) has the following information:
|
|
|
|
```
|
|
$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
|
|
Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
|
|
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
|
|
```
|
|
|
|
- And wow, I can't even run ImageMagick's `identify` on the first page of the second item (10568/98930):
|
|
|
|
```
|
|
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
|
|
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
|
|
```
|
|
|
|
- But with GraphicsMagick's `identify` it works:
|
|
|
|
```
|
|
$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
|
|
DEBUG: FC_WEIGHT didn't match
|
|
Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
|
|
```
|
|
|
|
- Interesting that ImageMagick's `identify` *does* work if you do not specify a page, perhaps as [alluded to in the recent Ghostscript bug report](https://bugs.ghostscript.com/show_bug.cgi?id=699815):
|
|
|
|
```
|
|
$ identify Food\ safety\ Kenya\ fruits.pdf
|
|
Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
|
|
Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
|
|
Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
|
|
Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
|
|
Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
|
|
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
|
|
```
|
|
|
|
- As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):
|
|
|
|
```
|
|
$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
|
|
zsh: abort (core dumped) convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
|
|
$ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
|
|
DEBUG: FC_WEIGHT didn't match
|
|
```
|
|
|
|
- I inspected the troublesome PDF using [jhove](http://jhove.openpreservation.org/) and noticed that it is using `ISO PDF/A-1, Level B` and the other one doesn't list a profile, though I don't think this is relevant
|
|
- I found another item that fails when generating a thumbnail ([10568/98391](https://hdl.handle.net/10568/98391), DSpace complains:
|
|
|
|
```
|
|
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
|
|
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
|
|
at org.im4java.core.Info.getBaseInfo(Info.java:360)
|
|
at org.im4java.core.Info.<init>(Info.java:151)
|
|
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
|
|
at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:24)
|
|
at org.dspace.app.mediafilter.FormatFilter.processBitstream(FormatFilter.java:170)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:475)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:429)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:401)
|
|
at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:237)
|
|
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
|
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
|
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
|
at java.lang.reflect.Method.invoke(Method.java:498)
|
|
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
|
|
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
|
|
Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
|
|
at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
|
|
at org.im4java.core.Info.getBaseInfo(Info.java:342)
|
|
... 14 more
|
|
Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
|
|
at org.im4java.core.ImageCommand.finished(ImageCommand.java:253)
|
|
at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314)
|
|
at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
|
|
... 15 more
|
|
```
|
|
|
|
- And on my Arch Linux environment ImageMagick's `convert` also segfaults:
|
|
|
|
```
|
|
$ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
|
|
zsh: abort (core dumped) convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] x60
|
|
```
|
|
|
|
- But GraphicsMagick's `convert` works:
|
|
|
|
```
|
|
$ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
|
|
```
|
|
|
|
- So far the only thing that stands out is that the two files that don't work were created with Microsoft Office 2016:
|
|
|
|
```
|
|
$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
|
|
Creator: Microsoft® Word 2016
|
|
Producer: Microsoft® Word 2016
|
|
$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
|
|
Creator: Microsoft® Word 2016
|
|
Producer: Microsoft® Word 2016
|
|
```
|
|
|
|
- And the one that works was created with Office 365:
|
|
|
|
```
|
|
$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
|
|
Creator: Microsoft® Word for Office 365
|
|
Producer: Microsoft® Word for Office 365
|
|
```
|
|
|
|
- I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:
|
|
|
|
```
|
|
$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
|
|
$ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
|
|
```
|
|
|
|
- I've tried a few times this week to register for the [Ethiopian eVisa website](https://www.evisa.gov.et/), but it is never successful
|
|
- In the end I tried one last time to just apply without registering and it was apparently successful
|
|
- Testing DSpace 5.8 (`5_x-prod` branch) in an Ubuntu 18.04 VM with Tomcat 8.5 and had some issues:
|
|
- JSPUI shows an internal error (log shows something about tag cloud, though, so might be unrelated)
|
|
- Atmire Listings and Reports, which use JSPUI, asks you to log in again and then doesn't work
|
|
- Content and Usage Analysis doesn't show up in the sidebar after logging in
|
|
- I can navigate to [/atmire/reporting-suite/usage-graph-editor](https://dspacetest.cgiar.org/atmire/reporting-suite/usage-graph-editor), but it's only the Atmire theme and a "page not found" message
|
|
- Related messages from dspace.log:
|
|
|
|
```
|
|
2018-12-03 15:44:00,030 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
|
|
2018-12-03 15:44:03,390 ERROR com.atmire.app.webui.servlet.ExportServlet @ Error converter plugin not found: interface org.infoCon.ConverterPlugin
|
|
...
|
|
2018-12-03 15:45:01,667 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-listing-and-reports not found
|
|
```
|
|
|
|
- I tested it on my local environment with Tomcat 8.5.34 and the JSPUI application still has an error (again, the logs show something about tag cloud, so be unrelated), and the Listings and Reports still asks you to log in again, despite already being logged in in XMLUI, but does appear to work (I generated a report and exported a PDF)
|
|
- I think the errors about missing Atmire components must be important, here on my local machine as well (though not the one about atmire-listings-and-reports):
|
|
|
|
```
|
|
2018-12-03 16:44:00,009 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
|
|
```
|
|
|
|
- This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness...?
|
|
|
|
## 2018-12-04
|
|
|
|
- Last night Linode sent a message that the load on CGSpace (linode18) was too high, here's a list of the top users at the time and throughout the day:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
225 40.77.167.142
|
|
226 66.249.64.63
|
|
232 46.101.86.248
|
|
285 45.5.186.2
|
|
333 54.70.40.11
|
|
411 193.29.13.85
|
|
476 34.218.226.147
|
|
962 66.249.70.27
|
|
1193 35.237.175.180
|
|
1450 2a01:4f8:140:3192::2
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
1141 207.46.13.57
|
|
1299 197.210.168.174
|
|
1341 54.70.40.11
|
|
1429 40.77.167.142
|
|
1528 34.218.226.147
|
|
1973 66.249.70.27
|
|
2079 50.116.102.77
|
|
2494 78.46.79.71
|
|
3210 2a01:4f8:140:3192::2
|
|
4190 35.237.175.180
|
|
```
|
|
|
|
- `35.237.175.180` is known to us (CCAFS?), and I've already added it to the list of bot IPs in nginx, which appears to be working:
|
|
|
|
```
|
|
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
|
|
4772
|
|
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
|
|
630
|
|
```
|
|
|
|
- I haven't seen `2a01:4f8:140:3192::2` before. Its user agent is some new bot:
|
|
|
|
```
|
|
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
|
|
```
|
|
|
|
- At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:
|
|
|
|
```
|
|
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
|
|
5111
|
|
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
|
|
419
|
|
```
|
|
|
|
- `78.46.79.71` is another host on Hetzner with the following user agent:
|
|
|
|
```
|
|
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
|
|
```
|
|
|
|
- This is not the first time a host on Hetzner has used a "normal" user agent to make thousands of requests
|
|
- At least it is re-using its Tomcat sessions somehow:
|
|
|
|
```
|
|
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
|
|
2044
|
|
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
|
|
1
|
|
```
|
|
|
|
- In other news, it's good to see my re-work of the database connectivity in the [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) actually caused a reduction of persistent database connections (from 1 to 0, but still!):
|
|
|
|
![PostgreSQL connections day](/cgspace-notes/2018/12/postgres_connections_db-month.png)
|
|
|
|
## 2018-12-05
|
|
|
|
- Discuss RSS issues with IWMI and WLE people
|
|
|
|
## 2018-12-06
|
|
|
|
- Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night
|
|
- I looked in the logs and there's nothing particular going on:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
1225 157.55.39.177
|
|
1240 207.46.13.12
|
|
1261 207.46.13.101
|
|
1411 207.46.13.157
|
|
1529 34.218.226.147
|
|
2085 50.116.102.77
|
|
3334 2a01:7e00::f03c:91ff:fe0a:d645
|
|
3733 66.249.70.27
|
|
3815 35.237.175.180
|
|
7669 54.70.40.11
|
|
```
|
|
|
|
- `54.70.40.11` is some new bot with the following user agent:
|
|
|
|
```
|
|
Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
|
|
```
|
|
|
|
- But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:
|
|
|
|
```
|
|
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
|
|
6980
|
|
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
|
|
1156
|
|
```
|
|
|
|
- `2a01:7e00::f03c:91ff:fe0a:d645` appears to be the CKM dev server where Danny is testing harvesting via Drupal
|
|
- It seems they are hitting the XMLUI's OpenSearch a bit, but mostly on the REST API so no issues here yet
|
|
- `Drupal` is already in the Tomcat Crawler Session Manager Valve's regex so that's good!
|
|
|
|
## 2018-12-10
|
|
|
|
- I ran into Mia Signs in Addis and we discussed Altmetric as well as RSS feeds again
|
|
- We came up with an [OpenSearch query for all items tagged with the WLE CRP subject](https://cgspace.cgiar.org/open-search/discover?query=crpsubject:Water,+Land+and+Ecosystems&sort_by=3&order=DESC) (where the `sort_by=3` parameter is the accession date, as configured in `dspace.cfg`)
|
|
- About Altmetric she was wondering why some low-ranking items of theirs do not have the Handle/DOI relationship, but high-ranking ones do
|
|
- It sounds kinda crazy, but she said when she talked to Altmetric about their Twitter harvesting they said their coverage is not perfect, so it might be some kinda prioritization thing where they only do it for popular items?
|
|
- I am testing this by [tweeting](https://twitter.com/mralanorth/status/1072153586342211584) one [WLE item from CGSpace](https://cgspace.cgiar.org/handle/10568/98380) that currently has no Altmetric score
|
|
- Interestingly, after about an hour I see it has already been [picked up by Altmetric](https://cgspace.altmetric.com/details/50160871/twitter) and has my tweet as well as some other tweet from over a month ago...
|
|
- I [tweeted a link to the item's DOI](https://twitter.com/mralanorth/status/1072198292182892545) to see if Altmetric will notice it, hopefully associated with the Handle I tweeted earlier
|
|
|
|
## 2018-12-11
|
|
|
|
- I checked the [latest tweet of the IWMI item with a DOI](https://twitter.com/mralanorth/status/1072198292182892545) and it was [picked up by Altmetric](https://cgspace.altmetric.com/details/50160871/twitter)
|
|
- There is one [curious other tweet](twitter.com/ArboNews/statuses/1055036747787223042) from another user where they used the NCBI link, and that is also associated with our Handle on Altmetric
|
|
- So that means Altmetric is picking up the DOI from the NCBI metadata and making the association properly
|
|
|
|
## 2018-12-13
|
|
|
|
- Oh this is very interesting: [WorldFish's repository is live now](https://digitalarchive.worldfishcenter.org)
|
|
- It's running DSpace 5.9-SNAPSHOT running on KnowledgeArc and the OAI and REST interfaces are active at least
|
|
- Also, I notice they ended up registering a Handle (they had been considering taking KnowledgeArc's advice to *not* use Handles!)
|
|
- Did some coordination work on the hotel bookings for the January AReS workshop in Amman
|
|
|
|
## 2018-12-17
|
|
|
|
- Linode alerted me twice today that the load on CGSpace (linode18) was very high
|
|
- Looking at the nginx logs I see a few new IPs in the top 10:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "17/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
927 157.55.39.81
|
|
975 54.70.40.11
|
|
2090 50.116.102.77
|
|
2121 66.249.66.219
|
|
3811 35.237.175.180
|
|
4590 205.186.128.185
|
|
4590 70.32.83.92
|
|
5436 2a01:4f8:173:1e85::2
|
|
5438 143.233.227.216
|
|
6706 94.71.244.172
|
|
```
|
|
|
|
- `94.71.244.172` and `143.233.227.216` are both in Greece and use the following user agent:
|
|
|
|
```
|
|
Mozilla/3.0 (compatible; Indy Library)
|
|
```
|
|
|
|
- I see that I added this bot to the Tomcat Crawler Session Manager valve in 2017-12 so its XMLUI sessions are getting re-used
|
|
- `2a01:4f8:173:1e85::2` is some new bot called `BLEXBot/1.0` which should be matching the existing "bot" pattern in the Tomcat Crawler Session Manager regex
|
|
|
|
## 2018-12-18
|
|
|
|
- Open a [ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657) with Atmire to ask them to prepare the Metadata Quality Module for our DSpace 5.8 code
|
|
|
|
## 2018-12-19
|
|
|
|
- Update Atmire Listings and Reports to add the journal title (`dc.source`) to bibliography and update example bibliography values ([#405](https://github.com/ilri/DSpace/pull/405))
|
|
|
|
## 2018-12-20
|
|
|
|
- Testing compression of PostgreSQL backups with xz and gzip:
|
|
|
|
```
|
|
$ time xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz
|
|
xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz 48.29s user 0.19s system 99% cpu 48.579 total
|
|
$ time gzip -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.gz
|
|
gzip -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.gz 2.78s user 0.09s system 99% cpu 2.899 total
|
|
$ ls -lh cgspace_2018-12-19.backup*
|
|
-rw-r--r-- 1 aorth aorth 96M Dec 19 02:15 cgspace_2018-12-19.backup
|
|
-rw-r--r-- 1 aorth aorth 94M Dec 20 11:36 cgspace_2018-12-19.backup.gz
|
|
-rw-r--r-- 1 aorth aorth 93M Dec 20 11:35 cgspace_2018-12-19.backup.xz
|
|
```
|
|
|
|
- Looks like it's really not worth it...
|
|
- Peter pointed out that Discovery filters for CTA subjects on item pages were not working
|
|
- It looks like there were some mismatches in the Discovery index names and the XMLUI configuration, so I fixed them ([#406](https://github.com/ilri/DSpace/pull/406))
|
|
- Peter asked if we could create a controlled vocabulary for publishers (`dc.publisher`)
|
|
- I see we have about 3500 distinct publishers:
|
|
|
|
```
|
|
# SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
|
|
count
|
|
-------
|
|
3522
|
|
(1 row)
|
|
```
|
|
|
|
- I reverted the metadata changes related to "Unrestricted Access" and "Restricted Access" on DSpace Test because we're not pushing forward with the new status terms for now
|
|
- Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:
|
|
|
|
```
|
|
# dpkg -P oracle-java8-installer oracle-java8-set-default
|
|
```
|
|
|
|
- Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:
|
|
|
|
```
|
|
$ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
|
|
Connected to database.
|
|
Fixed 466 occurences of: Copyrighted; Any re-use allowed
|
|
```
|
|
|
|
- Upgrade PostgreSQL on CGSpace (linode18) from 9.5 to 9.6:
|
|
|
|
```
|
|
# apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
|
|
# pg_ctlcluster 9.5 main stop
|
|
# tar -cvzpf var-lib-postgresql-9.5.tar.gz /var/lib/postgresql/9.5
|
|
# tar -cvzpf etc-postgresql-9.5.tar.gz /etc/postgresql/9.5
|
|
# pg_ctlcluster 9.6 main stop
|
|
# pg_dropcluster 9.6 main
|
|
# pg_upgradecluster 9.5 main
|
|
# pg_dropcluster 9.5 main
|
|
# dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
|
|
```
|
|
|
|
- I've been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments
|
|
- Run all system updates on CGSpace (linode18) and restart the server
|
|
- Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:
|
|
|
|
```
|
|
$ dspace cleanup -v
|
|
- Deleting bitstream information (ID: 158227)
|
|
- Deleting bitstream record from database (ID: 158227)
|
|
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
|
Detail: Key (bitstream_id)=(158227) is still referenced from table "bundle".
|
|
...
|
|
```
|
|
|
|
- As always, the solution is to delete those IDs manually in PostgreSQL:
|
|
|
|
```
|
|
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
|
|
UPDATE 1
|
|
```
|
|
|
|
- After all that I started a full Discovery reindex to get the index name changes and rights updates
|
|
|
|
## 2018-12-29
|
|
|
|
- CGSpace went down today for a few minutes while I was at dinner and I quickly restarted Tomcat
|
|
- The top IP addresses as of this evening are:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
963 40.77.167.152
|
|
987 35.237.175.180
|
|
1062 40.77.167.55
|
|
1464 66.249.66.223
|
|
1660 34.218.226.147
|
|
1801 70.32.83.92
|
|
2005 50.116.102.77
|
|
3218 66.249.66.219
|
|
4608 205.186.128.185
|
|
5585 54.70.40.11
|
|
```
|
|
|
|
- And just around the time of the alert:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E "29/Dec/2018:1(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
115 66.249.66.223
|
|
118 207.46.13.14
|
|
123 34.218.226.147
|
|
133 95.108.181.88
|
|
137 35.237.175.180
|
|
164 66.249.66.219
|
|
260 157.55.39.59
|
|
291 40.77.167.55
|
|
312 207.46.13.129
|
|
1253 54.70.40.11
|
|
```
|
|
|
|
- All these look ok (`54.70.40.11` is known to us from earlier this month and should be reusing its Tomcat sessions)
|
|
- So I'm not sure what was going on last night...
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|