- Remove `cg.subject.wle` and `cg.identifier.wletheme` from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
- [iso-codes 4.13.0 was released](https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/CHANGELOG.md#4130-2023-02-28), which incorporates my changes to the common names for Iran, Laos, and Syria
- I can't put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a `twobox`, etc, the entire form step appears blank
## 2023-03-02
- I did some experiments with the new [Pandas 2.0.0rc0 Apache Arrow support](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)
- There is a change to the way nulls are handled and it causes my tests for `pd.isna(field)` to fail
- I think we need consider blanks as null, but I'm not sure
- I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
- I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in `discovery.xml` since we are no longer using them in sidebar facets
## 2023-03-03
- Atmire merged one of my old pull requests into COUNTER-Robots:
- [COUNTER_Robots_list.json: Add new bots](https://github.com/atmire/COUNTER-Robots/pull/54)
- I will update the local ILRI overrides in our DSpace spider agents file
## 2023-03-04
- Submit a [pull request on pycountry to use iso-codes 4.13.0](https://github.com/flyingcircusio/pycountry/pull/156)
## 2023-03-05
- Start a harvest on AReS
## 2023-03-06
- Export CGSpace to do Initiative collection mappings
- There were thirty-three that needed updating
- Send Abenet and Sam a list of twenty-one CAS publications that had been marked as "multiple documents" that we uploaded as metadata-only items
- Goshu will download the PDFs for each and upload them to the items on CGSpace manually
- I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
- It seems there is a problem recognizing empty strings as na with `pd.isna()`
- If I do `pd.isna(field) or field == ""` then it works as expected, but that feels hacky
- I'm going to test again on the next release...
- Note that I had been setting both of these global options:
- Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn't have Handles
- I ran a duplicate check on them to find if they exist or if we can import them
- There were about ninety matches, but a few dozen of those were pre-prints!
- After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
- After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
- Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
- For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script
- After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref
- We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api)
- I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA...
- During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
- Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search):
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
```
- That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total`
- On a hunch, I tried with a GIN index:
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
```
- This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total`
- So clearly the GiST index is better for this task
- I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):
```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
```
- This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total`
- I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB):
```console
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
setting │ unit
─────────┼──────
4096 │ kB
(1 row)
```
- After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace
- Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html)
- I used this to fetch the open access status for each DOI from Unpaywall
- First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression:
request_url = unpaywall_baseurl + doi + '?email=' + email
return request_url
```
- Then create a new column based on fetching the values in that column. I called it "unpaywall_status"
- Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']`
- I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones
- I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
- The syntax was hairy because it's marked up with tags like `<jats:p>`, but this got me most of the way there:
- Meeting with FAO AGRIS team about how to detect duplicates
- They are currently using a sha256 hash on titles, which will work, but will only return exact matches
- I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches
- Meeting with Abenet to discuss CGSpace issues
- She reminded me about needing a metadata field for first author when the affiliation is ILRI
- I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her
- CKM is getting ready to launch their new website and they display CGSpace thumbnails at 255x362px
- Our thumbnails are 300px so they get up-scaled and look bad
- I realized that the last time we [increased the size of our thumbnails was in 2013](https://github.com/ilri/DSpace/commit/5de61e220124c1d0441c87cd7d36d18cb2293c03), from 94x130 to 300px
- I offered to CKM that we increase them again to 400 or 600px
- I did some tests to check the thumbnail file sizes for 300px, 400px, 500px, and 600px on [this item](https://hdl.handle.net/10568/126388):
```console
$ ls -lh 10568-126388-*
-rw-r--r-- 1 aorth aorth 31K Mar 10 12:42 10568-126388-300px.jpg
-rw-r--r-- 1 aorth aorth 52K Mar 10 12:41 10568-126388-400px.jpg
-rw-r--r-- 1 aorth aorth 76K Mar 10 12:43 10568-126388-500px.jpg
-rw-r--r-- 1 aorth aorth 106K Mar 10 12:44 10568-126388-600px.jpg
```
- Seems like 600px is 3 to 4 times larger file size, so maybe we should shoot for 400px or 500px
- On that note, I also re-worked the XMLUI item display to show larger thumbnails (from a max-width of 128px to 200px)
- And now that I'm looking at thumbnails I am curious what it would take to get DSpace to generate WebP or AVIF thumbnails
- Peter sent me citations and ILRI subjects for the 350 new ILRI publications
- I guess he edited it in Excel because there are a bunch of encoding issues with accents
- I merged Peter's citations and subjects with the other metadata, ran one last duplicate check (and found one item!), then ran the items through csv-metadata-quality and uploaded them to CGSpace
- In the end it was only 348 items for some reason...
- Extract a list of DOIs from the Creative Commons licensed ILRI journal articles that I uploaded last week, skipping any that are "no derivatives" (ND):
```console
$ csvgrep -c 'dc.description.provenance[en]' -m 'Made available in DSpace on 2023-03-10' /tmp/ilri-articles.csv \
- Jawoo was asking about possibilities to harvest PDFs from CGSpace for some kind of AI chatbot integration
- I see we have 45,000 PDFs (format ID 2)
```console
localhost/dspacetest= ☘ SELECT COUNT(*) FROM bitstream WHERE NOT deleted AND bitstream_format_id=2;
count
───────
45281
(1 row)
```
- Rework some of my Python scripts to use a common `db_connect` function from util
- I reworked my `post_bitstreams.py` script to be able to overwrite bitstreams if requested
- The use case is to upload thumbnails for all the journal articles where we have these horrible pixelated journal covers
- I replaced JPEG thumbnails for ~896 ILRI publications by exporting a list of DOIs from the 10568/3 collection that were CC-BY, getting their PDFs from Sci-Hub, and then posting them with my new script
## 2023-03-16
- Continue working on the ILRI publication thumbnails
- There were about sixty-four that had existing PNG "journal cover" thumbnails that didn't get replaced because I only overwrote the JPEG ones yesterday
- Now I generated a list of those bitstream UUIDs and deleted them with a shell script via the REST API
- I made a [pull request on DSpace 7 to update the bitstream format registry for PNG, WebP, and AVIF](https://github.com/DSpace/DSpace/pull/8722)
- Export CGSpace to perform mappings to Initiatives collections
- I also used this export to find CC-BY items with DOIs that had JPEGs or PNGs in their provenance, meaning that the submitter likely submitted a low-quality "journal cover" for the item
- I found about 330 of them and got most of their PDFs from Sci-Hub and replaced the crappy thumbnails with real ones where Sci-Hub had them (~245)
- In related news, I realized you can get an [API key from Elsevier and download the PDFs from their API](https://stackoverflow.com/questions/59202176/python-download-papers-from-sciencedirect-by-doi-with-requests):
with requests.get(request_url, stream=True, headers=headers) as r:
if r.status_code == 200:
with open("article.pdf", "wb") as f:
for chunk in r.iter_content(chunk_size=1024*1024):
f.write(chunk)
```
- The question is, how do we know if a DOI is Elsevier or not...
- CGIAR Repositories Working Group meeting
- We discussed controlled vocabularies for funders
- I suggested checking our combined lists against Crossref and ROR
- Export a list of donors from `cg.contributor.donor` on CGSpace:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=248) to /tmp/2023-03-16-donors.txt;
COPY 1521
```
- Then resolve them against Crossref's funders API:
- I learned how to use Python's built-in `logging` module and it simplifies all my debug and info printing
- I re-factored a few scripts to use the new logging
## 2023-03-18
- I applied changes for publishers on 16,000 items in batches of 5,000
- While working on my `post_bitstreams.py` script I realized the Tomcat Crawler Session Manager valve that groups bot user agents into sessions is causing my login to fail the first time, every time
- I've disabled it for now and will check the Munin session graphs after some time to see if it makes a difference
- In any case I have much better spider user agent lists in DSpace now than I did years ago when I started using the Crawler Session Manager valve
- Minor updates to a few of my DSpace Python scripts to fix the logging
- Minor updates to some records for Mazingira reported by Sonja
- Upgrade PostgreSQL on DSpace Test from version 12 to 14, the same way I did from 10 to 12 last year:
- First, I installed the new version of PostgreSQL via the Ansible playbook scripts
- Then I stopped Tomcat and all PostgreSQL clusters and used `pg_upgrade` to upgrade the old version:
```console
# systemctl stop tomcat7
# pg_ctlcluster 12 main stop
# tar -cvzpf var-lib-postgresql-12.tar.gz /var/lib/postgresql/12
# tar -cvzpf etc-postgresql-12.tar.gz /etc/postgresql/12
# pg_ctlcluster 14 main stop
# pg_dropcluster 14 main
# pg_upgradecluster 12 main
# pg_ctlcluster 14 main start
```
- After that I [re-indexed the database indexes using a query](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/):
- I spent some time trying to figure out how to get ssimulacra2 running so I could compare thumbnails in JPEG and WebP
- I realized that we can't directly compare JPEG to WebP, we need to convert to JPEG/WebP, then convert each to lossless PNG
- Also, we shouldn't be comparing the resulting images against each other, but rather the original, so I need to a straight PDF to lossless PNG version also
- After playing with WebP at Q82 and Q92, I see it has lower ssimulacra2 scores than JPEG Q92 for the dozen test files
- I updated csv-metadata-quality to use pandas 2.0.0rc1 and everything seems to work...?
- So the issues with nulls (isna) when I tried the first release candidate a few weeks ago were resolved?
- Meeting with Jawoo and others about a "ChatGPT-like" thing for CGIAR data using CGSpace documents and metadata
## 2023-03-23
- Add a missing IFPRI ORCID identifier to CGSpace and tag his items on CGSpace
- A super unscientific comparison between csv-metadata-quality's pytest regimen using Pandas 1.5.3 and Pandas 2.0.0rc1
- The data was gathered using [rusage](https://justine.lol/rusage), and this is the results of the last of three consecutive runs:
```
# Pandas 1.5.3
RL: took 1,585,999µs wall time
RL: ballooned to 272,380kb in size
RL: needed 2,093,947µs cpu (25% kernel)
RL: caused 55,856 page faults (100% memcpy)
RL: 699 context switches (1% consensual)
RL: performed 0 reads and 16 write i/o operations
# Pandas 2.0.0rc1
RL: took 1,625,718µs wall time
RL: ballooned to 262,116kb in size
RL: needed 2,148,425µs cpu (24% kernel)
RL: caused 63,934 page faults (100% memcpy)
RL: 461 context switches (2% consensual)
RL: performed 0 reads and 16 write i/o operations
```
- So it seems that Pandas 2.0.0rc1 took ten megabytes less RAM... interesting to see that the PyArrow-backed dtypes make a measurable difference even on my small test set
- I should try to compare runs of larger input files
## 2023-03-24
- I added a Flyway SQL migration for the PNG bitstream format registry changes on DSpace 7.6
- The way DSpace currently does its supersampling by exporting to a JPEG, then making a thumbnail of the JPEG, is a double lossy operation
- We should be exporting to something lossless like PNG, PPM, or MIFF, then making a thumbnail from that
- The PNG format is always lossless so the `-quality` setting controls compression and filtering, but has no effect on the appearance or signature of PNG images
- You can use `-quality n` with WebP's `-define webp:lossless=true`, but I'm not sure about the interaction between ImageMagick quality and WebP lossless...
- Also, if converting from a lossless format to WebP lossless in the same command, ImageMagick will ignore quality settings
- The MIFF format is useful for piping between ImageMagick commands, but it is also lossless and the quality setting is ignored
- You can use a format specifier when piping between ImageMagick commands without writing a file
- For example, I want to create a lossless PNG from a distorted JPEG for comparison:
- If I convert the JPEG to PNG directly it will ignore the quality setting, so I set the quality and the output format, then pipe it to ImageMagick again to convert to lossless PNG
- In an attempt to quantify the generation loss from DSpace's "JPG JPG" method of creating thumbnails I wrote a script called `generation-loss.sh` to test against a new "PNG JPG" method
- With my sample set of seventeen PDFs from CGSpace I found that _the "JPG JPG" method of thumbnailing results in scores an average of 1.6% lower than with the "PNG JPG" method_.
- The average file size with _the "PNG JPG" method was only 200 bytes larger_.