mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-06-26 08:03:48 +02:00
95 lines
4.1 KiB
Markdown
95 lines
4.1 KiB
Markdown
---
|
|
title: "April, 2023"
|
|
date: 2023-04-02T08:19:36+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2023-04-02
|
|
|
|
- Run all system updates on CGSpace and reboot it
|
|
- I exported CGSpace to CSV to check for any missing Initiative collection mappings
|
|
- I also did a check for missing country/region mappings with csv-metadata-quality
|
|
- Start a harvest on AReS
|
|
|
|
<!--more-->
|
|
|
|
- I'm starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python
|
|
- There doesn't seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use [Wand](https://docs.wand-py.org)?
|
|
- Testing Wand in Python:
|
|
|
|
```python
|
|
from wand.image import Image
|
|
|
|
with Image(filename='data/10568-103447.pdf[0]', resolution=144) as first_page:
|
|
print(first_page.height)
|
|
```
|
|
|
|
- I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes
|
|
- I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace's method of creating a lossy supersample followed by a lossy resized thumbnail
|
|
|
|
## 2023-04-03
|
|
|
|
- The harvest on AReS that I started yesterday never finished, and actually seems to have died...
|
|
- Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems
|
|
- I stopped the harvest and started the plugins to get the remaining items via the sitemap...
|
|
|
|
## 2023-04-04
|
|
|
|
- Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja's communications and development team at UNEP
|
|
- I uploaded the presentation to CGSpace here: https://hdl.handle.net/10568/129896
|
|
- Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles
|
|
|
|
```console
|
|
$ csvcut -c Handle ~/Downloads/2023-04-04-Donald.csv \
|
|
| sed \
|
|
-e 1d \
|
|
-e 's_https://hdl.handle.net/__' \
|
|
-e 's_https://cgspace.cgiar.org/handle/__' \
|
|
-e 's_http://hdl.handle.net/__' \
|
|
| sort -u > /tmp/handles.txt
|
|
```
|
|
|
|
- Then I used the `get_dspace_pdfs.py` script to download them
|
|
|
|
## 2023-04-05
|
|
|
|
- After some cleanup on Donald's DOIs I started the `get_scihub_pdfs.py` script
|
|
|
|
## 2023-04-06
|
|
|
|
- I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts
|
|
- I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like `-density` or `-define` before reading the input file
|
|
- I started [a discussion on the ImageMagick GitHub](https://github.com/ImageMagick/ImageMagick/discussions/6234) to ask
|
|
- Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs
|
|
- As a measure of caution, I extracted the list of DOIs and used my `crossref_doi_lookup.py` script to get their licenses from Crossref:
|
|
|
|
```console
|
|
$ ./ilri/crossref_doi_lookup.py -e xxxx@i.org -i /tmp/dois.txt -o /tmp/donald-crossref-dois.csv -d
|
|
```
|
|
|
|
- Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were "No Derivatives", and re-formatting the DOIs:
|
|
|
|
```console
|
|
$ csvcut -c doi,license /tmp/donald-crossref-dois.csv \
|
|
| csvgrep -c license -m 'creativecommons' \
|
|
| csvgrep -c license -i -r 'by-(nd|nc-nd)' \
|
|
| sed -e 's_^10_https://doi.org/10_' \
|
|
-e 's/\(am\|tdm\|unspecified\|vor\): //' \
|
|
| tee /tmp/donald-open-dois.csv \
|
|
| wc -l
|
|
4268
|
|
```
|
|
|
|
- From those I filtered for the DOIs for which I had downloaded PDFs, in the `filename` column of the Sci-Hub script and copied them to a separate directory:
|
|
|
|
```console
|
|
$ for file in $(csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv | csvgrep -c filename -i -r '^$' | csvcut -c filename | sed 1d); do cp --reflink=always "$file" "creative-commons-licensed/$file"; done
|
|
```
|
|
|
|
- I used BTRFS copy-on-write via reflinks to make sure I didn't duplicate the files :-D
|
|
- I ran out of time and had to stop the process around 3,127 PDFs
|
|
- I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|