Add notes for 2021-08-11

This commit is contained in:
2021-08-11 16:19:26 +03:00
parent 33d5d98502
commit dba953ae9b
25 changed files with 118 additions and 31 deletions

View File

@ -210,6 +210,7 @@ if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].
- It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
- One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can't work in CMYK
- I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I'm not sure...
- Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
- I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
```console
@ -225,5 +226,42 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
- The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
- libvips does use less time and memory... I should do more tests!
- I wonder if I can try to use these [unofficial Java bindings](https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java) in DSpace
- The authors of the JVips project wrote a nice blog post about libvips performance: https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55
- Ouch, JVips is Java 8 only as far as I can tell... that works now, but it's a non-starter going forward
<!-- vim: set sw=2 ts=2: -->
## 2021-08-11
- Peter got back to me about the journal title cleanup
- From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref's
- Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter's selections for where Sherpa Romeo and Crossref differred:
```console
$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
```
- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
- I exported a list of all the journal titles we have in the `cg.journal` field:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
```
- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don't match, so I'd have to go check many of them manually before selecting a match or fixing them...
- I think it's better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
- Or instead of doing it via SQL I could use CSV and parse the values there...
- A few more issues:
- Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org's web search (their API is invite only)
- Some titles are different across all three datasets, for example ISSN 0003-1305:
- [According to ISSN.org](https://portal.issn.org/resource/ISSN/0003-1305) this is "The American statistician"
- [According to Sherpa Romeo](https://v2.sherpa.ac.uk/id/publication/20807) this is "American Statistician"
- [According to Crossref](https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician) this is "The American Statistician"
- I also realized that our previous controlled vocabulary came from CGSpace's top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
- Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
- I requested access to the issn.org OAI-PMH API so I can use their registry...
<!-- v[im: set sw=2 ts=2: -->