diff --git a/content/posts/2023-04.md b/content/posts/2023-04.md
index 5121410c1..cc0d7c828 100644
--- a/content/posts/2023-04.md
+++ b/content/posts/2023-04.md
@@ -14,4 +14,81 @@ categories: ["Notes"]
+- I'm starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python
+ - There doesn't seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use [Wand](https://docs.wand-py.org)?
+- Testing Wand in Python:
+
+```python
+from wand.image import Image
+
+with Image(filename='data/10568-103447.pdf[0]', resolution=144) as first_page:
+ print(first_page.height)
+```
+
+- I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes
+ - I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace's method of creating a lossy supersample followed by a lossy resized thumbnail
+
+## 2023-04-03
+
+- The harvest on AReS that I started yesterday never finished, and actually seems to have died...
+ - Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems
+ - I stopped the harvest and started the plugins to get the remaining items via the sitemap...
+
+## 2023-04-04
+
+- Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja's communications and development team at UNEP
+ - I uploaded the presentation to CGSpace here: https://hdl.handle.net/10568/129896
+- Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles
+
+```console
+$ csvcut -c Handle ~/Downloads/2023-04-04-Donald.csv \
+ | sed \
+ -e 1d \
+ -e 's_https://hdl.handle.net/__' \
+ -e 's_https://cgspace.cgiar.org/handle/__' \
+ -e 's_http://hdl.handle.net/__' \
+ | sort -u > /tmp/handles.txt
+```
+
+- Then I used the `get_dspace_pdfs.py` script to download them
+
+## 2023-04-05
+
+- After some cleanup on Donald's DOIs I started the `get_scihub_pdfs.py` script
+
+## 2023-04-06
+
+- I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts
+ - I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like `-density` or `-define` before reading the input file
+ - I started [a discussion on the ImageMagick GitHub](https://github.com/ImageMagick/ImageMagick/discussions/6234) to ask
+- Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs
+ - As a measure of caution, I extracted the list of DOIs and used my `crossref_doi_lookup.py` script to get their licenses from Crossref:
+
+```console
+$ ./ilri/crossref_doi_lookup.py -e xxxx@i.org -i /tmp/dois.txt -o /tmp/donald-crossref-dois.csv -d
+```
+
+- Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were "No Derivatives", and re-formatting the DOIs:
+
+```console
+$ csvcut -c doi,license /tmp/donald-crossref-dois.csv \
+ | csvgrep -c license -m 'creativecommons' \
+ | csvgrep -c license -i -r 'by-(nd|nc-nd)' \
+ | sed -e 's_^10_https://doi.org/10_' \
+ -e 's/\(am\|tdm\|unspecified\|vor\): //' \
+ | tee /tmp/donald-open-dois.csv \
+ | wc -l
+4268
+```
+
+- From those I filtered for the DOIs for which I had downloaded PDFs, in the `filename` column of the Sci-Hub script and copied them to a separate directory:
+
+```console
+$ for file in $(csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv | csvgrep -c filename -i -r '^$' | csvcut -c filename | sed 1d); do cp --reflink=always "$file" "creative-commons-licensed/$file"; done
+```
+
+- I used BTRFS copy-on-write via reflinks to make sure I didn't duplicate the files :-D
+- I ran out of time and had to stop the process around 3,127 PDFs
+ - I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses
+
diff --git a/docs/2023-03/index.html b/docs/2023-03/index.html
index d934d7f8f..6d005a82b 100644
--- a/docs/2023-03/index.html
+++ b/docs/2023-03/index.html
@@ -16,7 +16,7 @@ I finally got through with porting the input form from DSpace 6 to DSpace 7
-
+
@@ -40,7 +40,7 @@ I finally got through with porting the input form from DSpace 6 to DSpace 7
"url": "https://alanorth.github.io/cgspace-notes/2023-03/",
"wordCount": "4810",
"datePublished": "2023-03-01T07:58:36+03:00",
- "dateModified": "2023-03-30T16:59:20+03:00",
+ "dateModified": "2023-04-02T09:16:25+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
diff --git a/docs/2023-04/index.html b/docs/2023-04/index.html
index f91ca3073..1622024db 100644
--- a/docs/2023-04/index.html
+++ b/docs/2023-04/index.html
@@ -20,7 +20,7 @@ Start a harvest on AReS
-
+
@@ -46,9 +46,9 @@ Start a harvest on AReS
"@type": "BlogPosting",
"headline": "April, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-04/",
- "wordCount": "39",
+ "wordCount": "569",
"datePublished": "2023-04-02T08:19:36+03:00",
- "dateModified": "2023-04-02T08:19:36+03:00",
+ "dateModified": "2023-04-02T09:16:25+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -132,6 +132,95 @@ Start a harvest on AReS
Start a harvest on AReS
+
+- I’m starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python
+
+- There doesn’t seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use Wand?
+
+
+- Testing Wand in Python:
+
+from wand.image import Image
+
+with Image(filename='data/10568-103447.pdf[0]', resolution=144) as first_page:
+ print(first_page.height)
+
+- I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes
+
+- I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace’s method of creating a lossy supersample followed by a lossy resized thumbnail
+
+
+
+2023-04-03
+
+- The harvest on AReS that I started yesterday never finished, and actually seems to have died…
+
+- Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems
+- I stopped the harvest and started the plugins to get the remaining items via the sitemap…
+
+
+
+2023-04-04
+
+- Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja’s communications and development team at UNEP
+
+
+- Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles
+
+$ csvcut -c Handle ~/Downloads/2023-04-04-Donald.csv \
+ | sed \
+ -e 1d \
+ -e 's_https://hdl.handle.net/__' \
+ -e 's_https://cgspace.cgiar.org/handle/__' \
+ -e 's_http://hdl.handle.net/__' \
+ | sort -u > /tmp/handles.txt
+
+- Then I used the
get_dspace_pdfs.py
script to download them
+
+2023-04-05
+
+- After some cleanup on Donald’s DOIs I started the
get_scihub_pdfs.py
script
+
+2023-04-06
+
+- I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts
+
+- I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like
-density
or -define
before reading the input file
+- I started a discussion on the ImageMagick GitHub to ask
+
+
+- Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs
+
+- As a measure of caution, I extracted the list of DOIs and used my
crossref_doi_lookup.py
script to get their licenses from Crossref:
+
+
+
+$ ./ilri/crossref_doi_lookup.py -e xxxx@i.org -i /tmp/dois.txt -o /tmp/donald-crossref-dois.csv -d
+
+- Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were “No Derivatives”, and re-formatting the DOIs:
+
+$ csvcut -c doi,license /tmp/donald-crossref-dois.csv \
+ | csvgrep -c license -m 'creativecommons' \
+ | csvgrep -c license -i -r 'by-(nd|nc-nd)' \
+ | sed -e 's_^10_https://doi.org/10_' \
+ -e 's/\(am\|tdm\|unspecified\|vor\): //' \
+ | tee /tmp/donald-open-dois.csv \
+ | wc -l
+4268
+
+- From those I filtered for the DOIs for which I had downloaded PDFs, in the
filename
column of the Sci-Hub script and copied them to a separate directory:
+
+$ for file in $(csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv | csvgrep -c filename -i -r '^$' | csvcut -c filename | sed 1d); do cp --reflink=always "$file" "creative-commons-licensed/$file"; done
+
+- I used BTRFS copy-on-write via reflinks to make sure I didn’t duplicate the files :-D
+- I ran out of time and had to stop the process around 3,127 PDFs
+
+- I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses
+
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 6203d908e..347fd2342 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index 946002447..36768688c 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index a3433aa81..a928ea819 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 90402531f..1f58b3ccb 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 042bf70c9..93c45b99d 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index 6aca1a66a..7609daa4b 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index e7ebde51a..9872db1e0 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html
index f7230b0ee..bb5622c40 100644
--- a/docs/categories/notes/page/7/index.html
+++ b/docs/categories/notes/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index da3056848..32e81f152 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/10/index.html b/docs/page/10/index.html
index ccd289366..9fd088fc3 100644
--- a/docs/page/10/index.html
+++ b/docs/page/10/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 1cd1d530d..7e10b4941 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 00cd6090c..d9d747e10 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index 39c96660e..8feedf52a 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index cb152f71a..0dc344d71 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 29d24961c..dc4e3e24a 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 77365d465..0efd9190f 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index f49af38b0..588cb6b18 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/9/index.html b/docs/page/9/index.html
index 726fb15d4..9a607964b 100644
--- a/docs/page/9/index.html
+++ b/docs/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 1772c585b..1c9dda16d 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/10/index.html b/docs/posts/page/10/index.html
index 103c0f4a0..fa14c2a1c 100644
--- a/docs/posts/page/10/index.html
+++ b/docs/posts/page/10/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index adc42f90f..798af69fb 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 4f762a67d..4380065b9 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index c02d75f92..41eb442c5 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 3d03796d0..c99f3e429 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 9992b4750..6ea85845e 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 7f85b36f5..48b2e5723 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index f86a7f11e..0012cbfd4 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html
index 539e0bcdb..dbce30768 100644
--- a/docs/posts/page/9/index.html
+++ b/docs/posts/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index c0e5e263d..211105079 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,22 +3,22 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/2023-04/
- 2023-04-02T08:19:36+03:00
+ 2023-04-02T09:16:25+03:00
https://alanorth.github.io/cgspace-notes/categories/
- 2023-04-02T08:19:36+03:00
+ 2023-04-02T09:16:25+03:00
https://alanorth.github.io/cgspace-notes/
- 2023-04-02T08:19:36+03:00
+ 2023-04-02T09:16:25+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2023-04-02T08:19:36+03:00
+ 2023-04-02T09:16:25+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2023-04-02T08:19:36+03:00
+ 2023-04-02T09:16:25+03:00
https://alanorth.github.io/cgspace-notes/2023-03/
- 2023-03-30T16:59:20+03:00
+ 2023-04-02T09:16:25+03:00
https://alanorth.github.io/cgspace-notes/2023-02/
2023-03-01T08:30:25+03:00