<li>I’m starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python
<ul>
<li>There doesn’t seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use <ahref="https://docs.wand-py.org">Wand</a>?</li>
<li>I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes
<ul>
<li>I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace’s method of creating a lossy supersample followed by a lossy resized thumbnail</li>
</ul>
</li>
</ul>
<h2id="2023-04-03">2023-04-03</h2>
<ul>
<li>The harvest on AReS that I started yesterday never finished, and actually seems to have died…
<ul>
<li>Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems</li>
<li>I stopped the harvest and started the plugins to get the remaining items via the sitemap…</li>
</ul>
</li>
</ul>
<h2id="2023-04-04">2023-04-04</h2>
<ul>
<li>Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja’s communications and development team at UNEP
<ul>
<li>I uploaded the presentation to CGSpace here: <ahref="https://hdl.handle.net/10568/129896">https://hdl.handle.net/10568/129896</a></li>
</ul>
</li>
<li>Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles</li>
<li>Then I used the <code>get_dspace_pdfs.py</code> script to download them</li>
</ul>
<h2id="2023-04-05">2023-04-05</h2>
<ul>
<li>After some cleanup on Donald’s DOIs I started the <code>get_scihub_pdfs.py</code> script</li>
</ul>
<h2id="2023-04-06">2023-04-06</h2>
<ul>
<li>I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts
<ul>
<li>I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like <code>-density</code> or <code>-define</code> before reading the input file</li>
<li>I started <ahref="https://github.com/ImageMagick/ImageMagick/discussions/6234">a discussion on the ImageMagick GitHub</a> to ask</li>
</ul>
</li>
<li>Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs
<ul>
<li>As a measure of caution, I extracted the list of DOIs and used my <code>crossref_doi_lookup.py</code> script to get their licenses from Crossref:</li>
<li>Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were “No Derivatives”, and re-formatting the DOIs:</li>
<li>From those I filtered for the DOIs for which I had downloaded PDFs, in the <code>filename</code> column of the Sci-Hub script and copied them to a separate directory:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ <spanstyle="color:#66d9ef">for</span> file in <spanstyle="color:#66d9ef">$(</span>csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv | csvgrep -c filename -i -r <spanstyle="color:#e6db74">'^$'</span> | csvcut -c filename | sed 1d<spanstyle="color:#66d9ef">)</span>; <spanstyle="color:#66d9ef">do</span> cp --reflink<spanstyle="color:#f92672">=</span>always <spanstyle="color:#e6db74">"</span>$file<spanstyle="color:#e6db74">"</span><spanstyle="color:#e6db74">"creative-commons-licensed/</span>$file<spanstyle="color:#e6db74">"</span>; <spanstyle="color:#66d9ef">done</span>
</span></span></code></pre></div><ul>
<li>I used BTRFS copy-on-write via reflinks to make sure I didn’t duplicate the files :-D</li>
<li>I ran out of time and had to stop the process around 3,127 PDFs
<ul>
<li>I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses</li>