--- title: "March, 2023" date: 2023-03-01T07:58:36+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2023-03-01 - Remove `cg.subject.wle` and `cg.identifier.wletheme` from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021) - [iso-codes 4.13.0 was released](https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/CHANGELOG.md#4130-2023-02-28), which incorporates my changes to the common names for Iran, Laos, and Syria - I finally got through with porting the input form from DSpace 6 to DSpace 7 - I can't put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a `twobox`, etc, the entire form step appears blank ## 2023-03-02 - I did some experiments with the new [Pandas 2.0.0rc0 Apache Arrow support](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) - There is a change to the way nulls are handled and it causes my tests for `pd.isna(field)` to fail - I think we need consider blanks as null, but I'm not sure - I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration - I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in `discovery.xml` since we are no longer using them in sidebar facets ## 2023-03-03 - Atmire merged one of my old pull requests into COUNTER-Robots: - [COUNTER_Robots_list.json: Add new bots](https://github.com/atmire/COUNTER-Robots/pull/54) - I will update the local ILRI overrides in our DSpace spider agents file ## 2023-03-04 - Submit a [pull request on pycountry to use iso-codes 4.13.0](https://github.com/flyingcircusio/pycountry/pull/156) ## 2023-03-05 - Start a harvest on AReS ## 2023-03-06 - Export CGSpace to do Initiative collection mappings - There were thirty-three that needed updating - Send Abenet and Sam a list of twenty-one CAS publications that had been marked as "multiple documents" that we uploaded as metadata-only items - Goshu will download the PDFs for each and upload them to the items on CGSpace manually - I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0 - It seems there is a problem recognizing empty strings as na with `pd.isna()` - If I do `pd.isna(field) or field == ""` then it works as expected, but that feels hacky - I'm going to test again on the next release... - Note that I had been setting both of these global options: ``` pd.options.mode.dtype_backend = 'pyarrow' pd.options.mode.nullable_dtypes = True ``` - Then reading the CSV like this: ``` df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]' ``` ## 2023-03-07 - Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts: ```console $ podman pull docker.io/library/postgres:14-alpine $ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine $ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest $ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest ``` - Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn't have Handles - I ran a duplicate check on them to find if they exist or if we can import them - There were about ninety matches, but a few dozen of those were pre-prints! - After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items - After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs - Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs) - For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script - After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref - We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api) - I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA... - During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects - An unscientific comparison of duplicate checking Peter's file with ~500 titles on PostgreSQL 12 and PostgreSQL 14: - PostgreSQL 12: `0.11s user 0.04s system 0% cpu 19:24.65 total` - PostgreSQL 14: `0.12s user 0.04s system 0% cpu 18:13.47 total` ## 2023-03-08 - I am wondering how to speed up PostgreSQL trgm searches more - I see my local PostgreSQL is using vanilla configuration and I should update some configs: ```console localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers'; setting │ unit ─────────┼────── 16384 │ 8kB (1 row) ``` - I re-created my PostgreSQL 14 container with some extra memory settings: ```console $ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1 ``` - Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search): ```console localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB ``` - That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total` - On a hunch, I tried with a GIN index: ```console localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB ``` - This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total` - So clearly the GiST index is better for this task - I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken): ```console localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index... ``` - This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total` - I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB): ```console localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem'; setting │ unit ─────────┼────── 4096 │ kB (1 row) ``` - After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace - Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html) - I used this to fetch the open access status for each DOI from Unpaywall - First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression: ```python unpaywall_baseurl = 'https://api.unpaywall.org/v2/' email = "a.orth+unpaywall@cgiar.org" doi = value.replace("https://doi.org/", "") request_url = unpaywall_baseurl + doi + '?email=' + email return request_url ``` - Then create a new column based on fetching the values in that column. I called it "unpaywall_status" - Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']` - I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones - I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts - The syntax was hairy because it's marked up with tags like ``, but this got me most of the way there: ```console value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml() value.replace("","").replace("", "") value.replace("","").replace("", "").replace("","").replace("", "") ``` - I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them - I exported a list of authors, affiliations, and funders from the new items to let Peter correct them: ```console $ csvcut -c dc.contributor.author /tmp/new-items.csv | sed -e 1d -e 's/"//g' -e 's/||/\n/g' | sort | uniq -c | sort -nr | awk '{$1=""; print $0}' | sed -e 's/^ //' > /tmp/new-authors.csv ``` - Meeting with FAO AGRIS team about how to detect duplicates - They are currently using a sha256 hash on titles, which will work, but will only return exact matches - I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches - Meeting with Abenet to discuss CGSpace issues - She reminded me about needing a metadata field for first author when the affiliation is ILRI - I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her - Unless we could do that in AReS reports somehow ## 2023-03-09 - Apply a bunch of corrections to authors, affiliations, and donors on the new items on DSpace Test - Meeting with Peter and Abenet about future OpenRXV developments, DSpace 7, etc - I submitted an [issue on MEL asking them to add provenance metadata when submitting to CGSpace](https://github.com/CodeObia/MEL/issues/11173) ## 2023-03-10 - CKM is getting ready to launch their new website and they display CGSpace thumbnails at 255x362px - Our thumbnails are 300px so they get up-scaled and look bad - I realized that the last time we [increased the size of our thumbnails was in 2013](https://github.com/ilri/DSpace/commit/5de61e220124c1d0441c87cd7d36d18cb2293c03), from 94x130 to 300px - I offered to CKM that we increase them again to 400 or 600px - I did some tests to check the thumbnail file sizes for 300px, 400px, 500px, and 600px on [this item](https://hdl.handle.net/10568/126388): ```console $ ls -lh 10568-126388-* -rw-r--r-- 1 aorth aorth 31K Mar 10 12:42 10568-126388-300px.jpg -rw-r--r-- 1 aorth aorth 52K Mar 10 12:41 10568-126388-400px.jpg -rw-r--r-- 1 aorth aorth 76K Mar 10 12:43 10568-126388-500px.jpg -rw-r--r-- 1 aorth aorth 106K Mar 10 12:44 10568-126388-600px.jpg ``` - Seems like 600px is 3 to 4 times larger file size, so maybe we should shoot for 400px or 500px - I decided on 500px - I started re-generating new thumbnails for the ILRI Publications, CGIAR Initiatives, and other collections - On that note, I also re-worked the XMLUI item display to show larger thumbnails (from a max-width of 128px to 200px) - And now that I'm looking at thumbnails I am curious what it would take to get DSpace to generate WebP or AVIF thumbnails - Peter sent me citations and ILRI subjects for the 350 new ILRI publications - I guess he edited it in Excel because there are a bunch of encoding issues with accents - I merged Peter's citations and subjects with the other metadata, ran one last duplicate check (and found one item!), then ran the items through csv-metadata-quality and uploaded them to CGSpace - In the end it was only 348 items for some reason... ## 2023-03-12 - Start a harvest on AReS ## 2023-03-13 - Extract a list of DOIs from the Creative Commons licensed ILRI journal articles that I uploaded last week, skipping any that are "no derivatives" (ND): ```console $ csvgrep -c 'dc.description.provenance[en]' -m 'Made available in DSpace on 2023-03-10' /tmp/ilri-articles.csv \ | csvgrep -c 'dcterms.license[en_US]' -r 'CC(0|\-BY)' | csvgrep -c 'dcterms.license[en_US]' -i -r '\-ND\-' | csvcut -c 'id,cg.identifier.doi[en_US],dcterms.type[en_US]' > 2023-03-13-journal-articles.csv ``` - I want to write a script to download the PDFs and create thumbnails for them, then upload to CGSpace - I wrote one based on `post_ciat_pdfs.py` but it seems there is an issue uploading anything other than a PDF - When I upload a JPG or a PNG the file begins with: ```console Content-Disposition: form-data; name="file"; filename="10.1017-s0031182013001625.pdf.jpg" ``` - ... this means it is invalid... - I tried in both the `ORIGINAL` and `THUMBNAIL` bundle, and with different filenames - I tried manually on the command line with `http` and both PDF and PNG work... hmmmm - Hmm, this seems to have been due to some difference in behavior between the `files` and `data` parameters of `requests.get()` - I finalized the `post_bitstreams.py` script and uploaded eighty-five PDF thumbnails - It seems Bizu uploaded covers for a handful so I deleted them and ran them through the script to get proper thumbnails ## 2023-03-14 - Add twelve IFPRI authors to our controlled vocabulary for authors and ORCID identifiers - I also tagged their existing items on CGSpace - Export all our ORCIDs and resolve their names to see if any have changed: ```console $ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-03-14-orcids.txt $ ./ilri/resolve_orcids.py -i /tmp/2023-03-14-orcids.txt -o /tmp/2023-03-14-orcids-names.txt -d ``` - Then update them in the database: ```console $ ./ilri/update_orcids.py -i /tmp/2023-03-14-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247 ``` ## 2023-03-15 - Jawoo was asking about possibilities to harvest PDFs from CGSpace for some kind of AI chatbot integration - I see we have 45,000 PDFs (format ID 2) ```console localhost/dspacetest= ☘ SELECT COUNT(*) FROM bitstream WHERE NOT deleted AND bitstream_format_id=2; count ─────── 45281 (1 row) ``` - Rework some of my Python scripts to use a common `db_connect` function from util - I reworked my `post_bitstreams.py` script to be able to overwrite bitstreams if requested - The use case is to upload thumbnails for all the journal articles where we have these horrible pixelated journal covers - I replaced JPEG thumbnails for ~896 ILRI publications by exporting a list of DOIs from the 10568/3 collection that were CC-BY, getting their PDFs from Sci-Hub, and then posting them with my new script ## 2023-03-16 - Continue working on the ILRI publication thumbnails - There were about sixty-four that had existing PNG "journal cover" thumbnails that didn't get replaced because I only overwrote the JPEG ones yesterday - Now I generated a list of those bitstream UUIDs and deleted them with a shell script via the REST API - I made a [pull request on DSpace 7 to update the bitstream format registry for PNG, WebP, and AVIF](https://github.com/DSpace/DSpace/pull/8722) - Export CGSpace to perform mappings to Initiatives collections - I also used this export to find CC-BY items with DOIs that had JPEGs or PNGs in their provenance, meaning that the submitter likely submitted a low-quality "journal cover" for the item - I found about 330 of them and got most of their PDFs from Sci-Hub and replaced the crappy thumbnails with real ones where Sci-Hub had them (~245) - In related news, I realized you can get an [API key from Elsevier and download the PDFs from their API](https://stackoverflow.com/questions/59202176/python-download-papers-from-sciencedirect-by-doi-with-requests): ```python import requests api_key = 'fuuuuuuuuu' doi = "10.1016/j.foodqual.2021.104362" request_url = f'https://api.elsevier.com/content/article/doi:{doi}' headers = { 'X-ELS-APIKEY': api_key, 'Accept': 'application/pdf' } with requests.get(request_url, stream=True, headers=headers) as r: if r.status_code == 200: with open("article.pdf", "wb") as f: for chunk in r.iter_content(chunk_size=1024*1024): f.write(chunk) ``` - The question is, how do we know if a DOI is Elsevier or not... - CGIAR Repositories Working Group meeting - We discussed controlled vocabularies for funders - I suggested checking our combined lists against Crossref and ROR - Export a list of donors from `cg.contributor.donor` on CGSpace: ```console localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=248) to /tmp/2023-03-16-donors.txt; COPY 1521 ``` - Then resolve them against Crossref's funders API: ```console $ ./ilri/crossref_funders_lookup.py -e fuuuu@cgiar.org -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv -d $ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l 472 $ sed 1d ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l 1521 ``` - That's a 31% hit rate, but I see some simple things like "Bill and Melinda Gates Foundation" instead of "Bill & Melinda Gates Foundation" ## 2023-03-17 - I did the same lookup of CGSpace donors on ROR's 2022-12-01 data dump: ```console $ ./ilri/ror_lookup.py -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv -r v1.15-2022-12-01-ror-data.json $ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l 407 $ sed 1d ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l 1521 ``` - That's a 26.7% hit rate - As for the number of funders in each dataset - Crossref has about 34,000 - ROR has 15,000 if "FundRef" data is a proxy for that: ```console $ grep -c -rsI FundRef v1.15-2022-12-01-ror-data.json 15162 ``` - On a related note, I remembered that DOI.org has a list of DOI prefixes and publishers: https://doi.crossref.org/getPrefixPublisher - In Python I can look up publishers by prefix easily, here with a nested list comprehension: ```console In [10]: [publisher for publisher in publishers if '10.3390' in publisher['prefixes']] Out[10]: [{'prefixes': ['10.1989', '10.32545', '10.20944', '10.3390', '10.35995'], 'name': 'MDPI AG', 'memberId': 1968}] ``` - And in OpenRefine, if I create a new column based on the DOI using Jython: ```python import json with open("/home/aorth/src/git/DSpace/publisher-doi-prefixes.json", "rb") as f: publishers = json.load(f) doi_prefix = value.split("/")[3] publisher = [publisher for publisher in publishers if doi_prefix in publisher['prefixes']] return publisher[0]['name'] ``` - ... though this is very slow and hung OpenRefine when I tried it - I added the ability to overwrite multiple bitstream formats at once in `post_bitstreams.py` ```console $ ./ilri/post_bitstreams.py -i test.csv -u https://dspacetest.cgiar.org/rest -e fuuu@example.com -p 'fffnjnjn' -d -s 2B40C7C4E34CEFCF5AFAE4B75A8C52E2 --overwrite JPEG --overwrite PNG -n Session valid: 2B40C7C4E34CEFCF5AFAE4B75A8C52E2 Opened test.csv 384142cb-58b9-4e64-bcdc-0a8cc34888b3: checking for existing bitstreams in THUMBNAIL bundle > (DRY RUN) Deleting bitstream: IFPRI Malawi_Maize Market Report_February_202_anonymous.pdf.jpg (16883cb0-1fc8-4786-a04f-32132e0617d4) > (DRY RUN) Deleting bitstream: AgroEcol_Newsletter_2.png (7e9cd434-45a6-4d55-8d56-4efa89d73813) > (DRY RUN) Uploading file: 10568-129666.pdf.jpg ``` - I learned how to use Python's built-in `logging` module and it simplifies all my debug and info printing - I re-factored a few scripts to use the new logging ## 2023-03-18 - I applied changes for publishers on 16,000 items in batches of 5,000 - While working on my `post_bitstreams.py` script I realized the Tomcat Crawler Session Manager valve that groups bot user agents into sessions is causing my login to fail the first time, every time - I've disabled it for now and will check the Munin session graphs after some time to see if it makes a difference - In any case I have much better spider user agent lists in DSpace now than I did years ago when I started using the Crawler Session Manager valve ## 2023-03-19 - Start a harvest on AReS ## 2023-03-20 - Minor updates to a few of my DSpace Python scripts to fix the logging - Minor updates to some records for Mazingira reported by Sonja - Upgrade PostgreSQL on DSpace Test from version 12 to 14, the same way I did from 10 to 12 last year: - First, I installed the new version of PostgreSQL via the Ansible playbook scripts - Then I stopped Tomcat and all PostgreSQL clusters and used `pg_upgrade` to upgrade the old version: ```console # systemctl stop tomcat7 # pg_ctlcluster 12 main stop # tar -cvzpf var-lib-postgresql-12.tar.gz /var/lib/postgresql/12 # tar -cvzpf etc-postgresql-12.tar.gz /etc/postgresql/12 # pg_ctlcluster 14 main stop # pg_dropcluster 14 main # pg_upgradecluster 12 main # pg_ctlcluster 14 main start ``` - After that I [re-indexed the database indexes using a query](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/): ```console $ su - postgres $ cat /tmp/generate-reindex.sql SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;' FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname = 'public' AND C.relkind = 'r' AND nspname !~ '^pg_toast' ORDER BY pg_total_relation_size(C.oid) ASC; $ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql $ $ psql dspace < /tmp/reindex.sql ``` - The index on `metadatavalue` shrunk by 90MB, and others a bit less - This is nice, but not as drastic as I noticed last year when upgrading to PostgreSQL 12 ## 2023-03-21 - Leigh sent me a list of IFPRI authors with ORCID identifiers so I combined them with our list and resolved all their names with `resolve_orcids.py` - It adds 154 new ORCID identifiers - I did a follow up to the publisher names from last week using the list from doi.org - Last week I only updated items with a DOI that had *no* publisher, but now I was curious to see how our existing publisher information compared - I checked a dozen or so manually and, other than CIFOR/ICRAF and CIAT/Alliance, the metadata was better than our existing data, so I overwrote them - I spent some time trying to figure out how to get ssimulacra2 running so I could compare thumbnails in JPEG and WebP - I realized that we can't directly compare JPEG to WebP, we need to convert to JPEG/WebP, then convert each to lossless PNG - Also, we shouldn't be comparing the resulting images against each other, but rather the original, so I need to a straight PDF to lossless PNG version also - After playing with WebP at Q82 and Q92, I see it has lower ssimulacra2 scores than JPEG Q92 for the dozen test files - Could it just be something with ImageMagick? ## 2023-03-22 - I updated csv-metadata-quality to use pandas 2.0.0rc1 and everything seems to work...? - So the issues with nulls (isna) when I tried the first release candidate a few weeks ago were resolved? - Meeting with Jawoo and others about a "ChatGPT-like" thing for CGIAR data using CGSpace documents and metadata ## 2023-03-23 - Add a missing IFPRI ORCID identifier to CGSpace and tag his items on CGSpace - A super unscientific comparison between csv-metadata-quality's pytest regimen using Pandas 1.5.3 and Pandas 2.0.0rc1 - The data was gathered using [rusage](https://justine.lol/rusage), and this is the results of the last of three consecutive runs: ``` # Pandas 1.5.3 RL: took 1,585,999µs wall time RL: ballooned to 272,380kb in size RL: needed 2,093,947µs cpu (25% kernel) RL: caused 55,856 page faults (100% memcpy) RL: 699 context switches (1% consensual) RL: performed 0 reads and 16 write i/o operations # Pandas 2.0.0rc1 RL: took 1,625,718µs wall time RL: ballooned to 262,116kb in size RL: needed 2,148,425µs cpu (24% kernel) RL: caused 63,934 page faults (100% memcpy) RL: 461 context switches (2% consensual) RL: performed 0 reads and 16 write i/o operations ``` - So it seems that Pandas 2.0.0rc1 took ten megabytes less RAM... interesting to see that the PyArrow-backed dtypes make a measurable difference even on my small test set - I should try to compare runs of larger input files ## 2023-03-24 - I added a Flyway SQL migration for the PNG bitstream format registry changes on DSpace 7.6 ## 2023-03-26 - There seems to be a slightly high load on CGSpace - I don't see any locks in PostgreSQL, but there's some new bot I have never heard of: ```console 92.119.18.13 - - [26/Mar/2023:18:41:47 +0200] "GET /handle/10568/16500/discover?filtertype_0=impactarea&filter_relational_operator_0=equals&filter_0=Climate+adaptation+and+mitigation&filtertype=sdg&filter_relational_operator=equals&filter=SDG+11+-+Sustainable+cities+and+communities HTTP/2.0" 200 7856 "-" "colly - https://github.com/gocolly/colly" ``` - In the last week I see a handful of IPs making requests with this agent: ```console # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2,3,4,5,6,7}.gz | grep go colly | awk '{print $1}' | sort | uniq -c | sort -h 2 194.233.95.37 4304 92.119.18.142 9496 5.180.208.152 27477 92.119.18.13 ``` - Most of these come from Packethub S.A. / ASN 62240 (CLOUVIDER Clouvider - Global ASN, GB) - Oh, I've apparently seen this user agent before, as it is in our ILRI spider user agent overrides - I exported CGSpace to check for missing Initiative collection mappings - Start a harvest on AReS ## 2023-03-27 - The harvest on AReS was incredibly slow and I stopped it about half way twelve hours later - Then I relied on the plugins to get missing items, which caused a high load on the server but actually worked fine - Continue working on thumbnails on DSpace ## 2023-03-28 - Regarding ImageMagick there are a few things I've learned - The `-quality` setting does different things for different output formats, see: https://imagemagick.org/script/command-line-options.php#quality - The `-compress` setting controls the compression algorithm for image data, and is unrelated to lossless/lossy - On that note, `-compress lossless` for JPEGs refers to Lossless JPEG, which is not well defined or supported and should be avoided - See: https://imagemagick.org/script/command-line-options.php#compress - The way DSpace currently does its supersampling by exporting to a JPEG, then making a thumbnail of the JPEG, is a double lossy operation - We should be exporting to something lossless like PNG, PPM, or MIFF, then making a thumbnail from that - The PNG format is always lossless so the `-quality` setting controls compression and filtering, but has no effect on the appearance or signature of PNG images - You can use `-quality n` with WebP's `-define webp:lossless=true`, but I'm not sure about the interaction between ImageMagick quality and WebP lossless... - Also, if converting from a lossless format to WebP lossless in the same command, ImageMagick will ignore quality settings - The MIFF format is useful for piping between ImageMagick commands, but it is also lossless and the quality setting is ignored - You can use a format specifier when piping between ImageMagick commands without writing a file - For example, I want to create a lossless PNG from a distorted JPEG for comparison: ```console $ magick convert reference.jpg -quality 85 jpg:- | convert - distorted-lossless.png ``` - If I convert the JPEG to PNG directly it will ignore the quality setting, so I set the quality and the output format, then pipe it to ImageMagick again to convert to lossless PNG - In an attempt to quantify the generation loss from DSpace's "JPG JPG" method of creating thumbnails I wrote a script called `generation-loss.sh` to test against a new "PNG JPG" method - With my sample set of seventeen PDFs from CGSpace I found that _the "JPG JPG" method of thumbnailing results in scores an average of 1.6% lower than with the "PNG JPG" method_. - The average file size with _the "PNG JPG" method was only 200 bytes larger_. - In my brief testing, the relationship between ImageMagick's `-quality` setting and WebP's `-define webp:lossless=true` setting are completely unpredictable: ```console $ magick convert img/10568-103447.pdf.png /tmp/10568-103447.webp $ magick convert img/10568-103447.pdf.png -define webp:lossless=true /tmp/10568-103447-lossless.webp $ magick convert img/10568-103447.pdf.png -define webp:lossless=true -quality 50 /tmp/10568-103447-lossless-q50.webp $ magick convert img/10568-103447.pdf.png -quality 10 -define webp:lossless=true /tmp/10568-103447-lossless-q10.webp $ magick convert img/10568-103447.pdf.png -quality 90 -define webp:lossless=true /tmp/10568-103447-lossless-q90.webp $ ls -l /tmp/10568-103447* -rw-r--r-- 1 aorth aorth 359258 Mar 28 21:16 /tmp/10568-103447-lossless-q10.webp -rw-r--r-- 1 aorth aorth 303850 Mar 28 21:15 /tmp/10568-103447-lossless-q50.webp -rw-r--r-- 1 aorth aorth 296832 Mar 28 21:16 /tmp/10568-103447-lossless-q90.webp -rw-r--r-- 1 aorth aorth 299566 Mar 28 21:13 /tmp/10568-103447-lossless.webp -rw-r--r-- 1 aorth aorth 190718 Mar 28 21:13 /tmp/10568-103447.webp ``` - I'm curious to see a comparison between the ImageMagick `-define webp:emulate-jpeg-size=true` (aka `-jpeg_like` in cwebp) option compared to normal lossy WebP quality: ```console $ for q in 70 80 90; do magick convert img/10568-103447.pdf.png -quality $q -define webp:emulate-jpeg-size=true /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp; done $ for q in 70 80 90; do magick convert /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp.png; done $ for q in 70 80 90; do ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-emulate-jpeg-q${q}.webp.png 2>/dev/null; done 81.29082887 84.42134524 85.84458964 $ for q in 70 80 90; do magick convert img/10568-103447.pdf.png -quality $q /tmp/10568-103447-lossy-q${q}.webp; done $ for q in 70 80 90; do magick convert /tmp/10568-103447-lossy-q${q}.webp /tmp/10568-103447-lossy-q${q}.webp.png; done $ for q in 70 80 90; do ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-q${q}.webp.png 2>/dev/null; done 77.25789006 80.79140936 84.79108246 ``` - Using `-define webp:method=6` (versus default 4) gets a ~0.5% increase on ssimulacra2 score