diff --git a/content/posts/2021-08.md b/content/posts/2021-08.md
index 4f20eeef4..a8e5a04b4 100644
--- a/content/posts/2021-08.md
+++ b/content/posts/2021-08.md
@@ -210,6 +210,7 @@ if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].
- It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
- One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can't work in CMYK
- I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I'm not sure...
+ - Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
- I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
```console
@@ -225,5 +226,42 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
- The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
- libvips does use less time and memory... I should do more tests!
+ - I wonder if I can try to use these [unofficial Java bindings](https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java) in DSpace
+ - The authors of the JVips project wrote a nice blog post about libvips performance: https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55
+ - Ouch, JVips is Java 8 only as far as I can tell... that works now, but it's a non-starter going forward
-
+## 2021-08-11
+
+- Peter got back to me about the journal title cleanup
+ - From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref's
+ - Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter's selections for where Sherpa Romeo and Crossref differred:
+
+```console
+$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
+$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
+$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
+1911
+```
+
+- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
+- I exported a list of all the journal titles we have in the `cg.journal` field:
+
+```console
+localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
+COPY 3245
+```
+
+- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don't match, so I'd have to go check many of them manually before selecting a match or fixing them...
+ - I think it's better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
+ - Or instead of doing it via SQL I could use CSV and parse the values there...
+- A few more issues:
+ - Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org's web search (their API is invite only)
+ - Some titles are different across all three datasets, for example ISSN 0003-1305:
+ - [According to ISSN.org](https://portal.issn.org/resource/ISSN/0003-1305) this is "The American statistician"
+ - [According to Sherpa Romeo](https://v2.sherpa.ac.uk/id/publication/20807) this is "American Statistician"
+ - [According to Crossref](https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician) this is "The American Statistician"
+- I also realized that our previous controlled vocabulary came from CGSpace's top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
+ - Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
+ - I requested access to the issn.org OAI-PMH API so I can use their registry...
+
+
diff --git a/docs/2021-08/index.html b/docs/2021-08/index.html
index 25bb33b0e..c4df4bb31 100644
--- a/docs/2021-08/index.html
+++ b/docs/2021-08/index.html
@@ -18,7 +18,7 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
-
+
@@ -42,9 +42,9 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"@type": "BlogPosting",
"headline": "August, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-08/",
- "wordCount": "1636",
+ "wordCount": "2059",
"datePublished": "2021-08-01T09:01:07+03:00",
- "dateModified": "2021-08-09T17:10:45+03:00",
+ "dateModified": "2021-08-09T17:34:13+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -327,6 +327,7 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can’t work in CMYK
I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I’m not sure…
+Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
@@ -343,6 +344,54 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
+
+
+2021-08-11
+
+- Peter got back to me about the journal title cleanup
+
+- From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref’s
+- Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter’s selections for where Sherpa Romeo and Crossref differred:
+
+
+
+$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
+$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
+$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
+1911
+
+- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
+- I exported a list of all the journal titles we have in the
cg.journal
field:
+
+localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
+COPY 3245
+
+- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them…
+
+- I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
+- Or instead of doing it via SQL I could use CSV and parse the values there…
+
+
+- A few more issues:
+
+- Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org’s web search (their API is invite only)
+- Some titles are different across all three datasets, for example ISSN 0003-1305:
+
+
+
+
+- I also realized that our previous controlled vocabulary came from CGSpace’s top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
+
+- Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
+- I requested access to the issn.org OAI-PMH API so I can use their registry…
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 7d857cb46..48820312b 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index 2537e50f1..83da7f2cd 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index 3d39227c2..75a58b8de 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 3035845fc..4261eed2e 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 59a08e50d..535d0def9 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index 1b797cf8c..60f14759c 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index 325293be1..a4623b7f2 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index cab94a1c1..126ef116b 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 1e6811f4c..387e00484 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index fe6506033..b432ce9d3 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 8efa78ded..467904b4c 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 02dde15b5..3b55a5f4c 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index adc9a7bee..6d60aba0c 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index 67680ecf0..3a8611f67 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index b819cdac9..9be339522 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 65477cbfc..dcf44b56a 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 6e1255562..ad71b0aa7 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 08798ee40..24c495929 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 3117c601e..ab45b2c9b 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index d20f7ddba..270aedc91 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index dac59ff60..d01fb910d 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index fbc3c1258..0f15b4a8a 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 3d262bd81..4a098c1bb 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/2021-08/
- 2021-08-09T17:10:45+03:00
+ 2021-08-09T17:34:13+03:00
https://alanorth.github.io/cgspace-notes/categories/
- 2021-08-09T17:10:45+03:00
+ 2021-08-09T17:34:13+03:00
https://alanorth.github.io/cgspace-notes/
- 2021-08-09T17:10:45+03:00
+ 2021-08-09T17:34:13+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2021-08-09T17:10:45+03:00
+ 2021-08-09T17:34:13+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2021-08-09T17:10:45+03:00
+ 2021-08-09T17:34:13+03:00
https://alanorth.github.io/cgspace-notes/2021-07/
2021-08-01T16:19:05+03:00