Add notes for 2022-10-19

This commit is contained in:
2022-10-19 21:32:01 +03:00
parent 7713ecefa8
commit 46a9178bdb
33 changed files with 183 additions and 34 deletions

View File

@ -555,4 +555,74 @@ $ ./ilri/fix-metadata-values.py -i 2022-10-18-update-initiatives.csv -db dspace
- I created a new "TIP test" collection under Alliance's community and added the users accordingly
- I think I'll be able to just add these two submit/approve users to the Alliance Admins and Alliance Editors groups once we're ready
## 2022-10-19
- I submitted a [bug report for the two-page portrait layout of some PDF thumbnails](https://bugs.ghostscript.com/show_bug.cgi?id=705994) on Ghostscript's bug tracker
- For reference, the thumbnail for PDFs like in [10568/116598](https://hdl.handle.net/10568/116598) looks like this:
![gs thumbnail](/cgspace-notes/2022/10/gs-10568-116598.pdf.jpg)
- In other news, I see `pdftocairo` from the poppler package produces a similar, though slightly prettier version of the thumbnail of that PDF:
![pdftocairo thumbnail]('/cgspace-notes/2022/10/pdftocairo-10568-116598.pdf.jpg)
- I used the command:
```console
$ pdftocairo -jpeg -singlefile -f 1 -l 1 -scale-to-x 640 -scale-to-y -1 10568-116598.pdf thumb
```
- The Ghostscript developers responded in a few minutes (!) and explained that PDFs can contain many different "boxes":
> PDF files can have multiple different 'Box' values; ArtBox, BleedBox, CropBox, MediaBox and TrimBox. The MediaBox is required the other boxes are optional, a given PDF page description must contain the MediaBox and may contain any or all of the others.
>
> By default Ghostscript uses the MediaBox to determine the size of the media. Other PDF consumers may exhibit other behaviours.
>
> The pages in your PDF file contain all of the Boxes. In the majority of cases the Boxes all contain the same values (which makes their inclusion pointless of course). But for page 1 they differ:
>
> /CropBox[594.375 0.0 1190.55 839.176]
> /MediaBox[0.0 0.0 1190.55 841.89]
>
> You can tell Ghostscript to use a different Box value for the media by using one of -dUseArtBox, -dUseBleedBox, -dUseCropBox, -dUseTrim,Box. If I specify -dUseCropBox then the file is rendered as you expect.
- I confirm that adding `-define pdf:use-cropbox=true` to the ImageMagick command produces a better thumbnail in this case
- We can check the boxes in a PDF using `pdfinfo` from the poppler package:
```console
$ pdfinfo -box data/10568-116598.pdf
Creator: Adobe InDesign 17.0 (Macintosh)
Producer: Adobe PDF Library 16.0.3
CreationDate: Tue Dec 7 12:44:46 2021 EAT
ModDate: Tue Dec 7 15:37:58 2021 EAT
Custom Metadata: no
Metadata Stream: yes
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 17
Encrypted: no
Page size: 596.175 x 839.176 pts
Page rot: 0
MediaBox: 0.00 0.00 1190.55 841.89
CropBox: 594.38 0.00 1190.55 839.18
BleedBox: 594.38 0.00 1190.55 839.18
TrimBox: 594.38 0.00 1190.55 839.18
ArtBox: 594.38 0.00 1190.55 839.18
File size: 572600 bytes
Optimized: no
PDF version: 1.6
```
- In this case the MediaBox is a strange size, and we should use the CropBox
- I wonder if we can check that from DSpace...
- Apply some corrections from Peter on CGSpace
- Meeting with Leroy, Daniel, Francesca, and Maria from Alliance to review their TIP tool and talk about next steps
- We asked them to do some real submissions (as opposed to "I like coffee" etc) to test the full breadth of the metadata and controlled vocabularies
- Minor work on the CG Core Types spreadsheet to clear up some of the actions and incorporate some of Peter's feedback
- After looking at the request patterns in nginx on CGSpace for the past few weeks I see nothing that would explain the high loads we see several times per week (especially Sundays!)
- So I suspect there is a noisy neighbor, and actually I do see some non-trivial amount of CPU steal in my Munin graphs and `iostat`
- I asked Linode to move the instance elsewhere
<!-- vim: set sw=2 ts=2: -->