Add notes for 2021-08-09

This commit is contained in:
2021-08-09 17:10:45 +03:00
parent 3a1264b858
commit 46bdcb7a45
25 changed files with 94 additions and 30 deletions

View File

@ -178,5 +178,37 @@ $ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],c
- Then in OpenRefine I merged all null, blank, and en fields into the `en_US` one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
- In total it was a few thousand metadata entries or so so I had to split the CSV with `xsv split` in order to process it
- I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes...
- In total it was 1,195 changes to ISSN and ISBN metadata fields
## 2021-08-09
- Extract all unique ISSNs to look up on Sherpa Romeo and Crossref
```console
$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
```
- Then I updated the CSV headers for each and joined the CSVs on the issn column:
```console
$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
```
- In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
```console
if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
```
- Then I exported the list of journals that differ and sent it to Peter for comments and corrections
- I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it
- Convert my `generate-thumbnails.py` script to use libvips instead of Graphicsmagick
- It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
- One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can't work in CMYK
- I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I'm not sure...
<!-- vim: set sw=2 ts=2: -->