From dba953ae9b09ea323a2d919d510b8b48aa899ab8 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 11 Aug 2021 16:19:26 +0300 Subject: [PATCH] Add notes for 2021-08-11 --- content/posts/2021-08.md | 40 +++++++++++++++++- docs/2021-08/index.html | 55 +++++++++++++++++++++++-- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/sitemap.xml | 10 ++--- 25 files changed, 118 insertions(+), 31 deletions(-) diff --git a/content/posts/2021-08.md b/content/posts/2021-08.md index 4f20eeef4..a8e5a04b4 100644 --- a/content/posts/2021-08.md +++ b/content/posts/2021-08.md @@ -210,6 +210,7 @@ if(cells['sherpa romeo journal title'].value == cells['crossref journal title']. - It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs - One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can't work in CMYK - I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I'm not sure... + - Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196 - I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick: ```console @@ -225,5 +226,42 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409 - The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail) - libvips does use less time and memory... I should do more tests! + - I wonder if I can try to use these [unofficial Java bindings](https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java) in DSpace + - The authors of the JVips project wrote a nice blog post about libvips performance: https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55 + - Ouch, JVips is Java 8 only as far as I can tell... that works now, but it's a non-starter going forward - +## 2021-08-11 + +- Peter got back to me about the journal title cleanup + - From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref's + - Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter's selections for where Sherpa Romeo and Crossref differred: + +```console +$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt +$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt +$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l +1911 +``` + +- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine +- I exported a list of all the journal titles we have in the `cg.journal` field: + +```console +localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV; +COPY 3245 +``` + +- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don't match, so I'd have to go check many of them manually before selecting a match or fixing them... + - I think it's better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way + - Or instead of doing it via SQL I could use CSV and parse the values there... +- A few more issues: + - Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org's web search (their API is invite only) + - Some titles are different across all three datasets, for example ISSN 0003-1305: + - [According to ISSN.org](https://portal.issn.org/resource/ISSN/0003-1305) this is "The American statistician" + - [According to Sherpa Romeo](https://v2.sherpa.ac.uk/id/publication/20807) this is "American Statistician" + - [According to Crossref](https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician) this is "The American Statistician" +- I also realized that our previous controlled vocabulary came from CGSpace's top 500 journals, so when I replaced it with the generated list earlier today we lost some journals + - Now I went back and merged the previous with the new, and manually removed duplicates (sigh) + - I requested access to the issn.org OAI-PMH API so I can use their registry... + + diff --git a/docs/2021-08/index.html b/docs/2021-08/index.html index 25bb33b0e..c4df4bb31 100644 --- a/docs/2021-08/index.html +++ b/docs/2021-08/index.html @@ -18,7 +18,7 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04 - + @@ -42,9 +42,9 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04 "@type": "BlogPosting", "headline": "August, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-08/", - "wordCount": "1636", + "wordCount": "2059", "datePublished": "2021-08-01T09:01:07+03:00", - "dateModified": "2021-08-09T17:10:45+03:00", + "dateModified": "2021-08-09T17:34:13+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -327,6 +327,7 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
  • It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
  • One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can’t work in CMYK
  • I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I’m not sure…
  • +
  • Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
  • I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
  • @@ -343,6 +344,54 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
  • The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail) +
  • + +

    2021-08-11

    + +
    $ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
    +$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
    +$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
    +1911
    +
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
    +COPY 3245
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index 7d857cb46..48820312b 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 2537e50f1..83da7f2cd 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 3d39227c2..75a58b8de 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 3035845fc..4261eed2e 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 59a08e50d..535d0def9 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 1b797cf8c..60f14759c 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 325293be1..a4623b7f2 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index cab94a1c1..126ef116b 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 1e6811f4c..387e00484 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index fe6506033..b432ce9d3 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 8efa78ded..467904b4c 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 02dde15b5..3b55a5f4c 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index adc9a7bee..6d60aba0c 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 67680ecf0..3a8611f67 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index b819cdac9..9be339522 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 65477cbfc..dcf44b56a 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 6e1255562..ad71b0aa7 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 08798ee40..24c495929 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 3117c601e..ab45b2c9b 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index d20f7ddba..270aedc91 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index dac59ff60..d01fb910d 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index fbc3c1258..0f15b4a8a 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 3d262bd81..4a098c1bb 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/2021-08/ - 2021-08-09T17:10:45+03:00 + 2021-08-09T17:34:13+03:00 https://alanorth.github.io/cgspace-notes/categories/ - 2021-08-09T17:10:45+03:00 + 2021-08-09T17:34:13+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-08-09T17:10:45+03:00 + 2021-08-09T17:34:13+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-08-09T17:10:45+03:00 + 2021-08-09T17:34:13+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-08-09T17:10:45+03:00 + 2021-08-09T17:34:13+03:00 https://alanorth.github.io/cgspace-notes/2021-07/ 2021-08-01T16:19:05+03:00