March, 2023

Wed Mar 01, 2023 in Notes

2023-03-01

Remove cg.subject.wle and cg.identifier.wletheme from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
iso-codes 4.13.0 was released, which incorporates my changes to the common names for Iran, Laos, and Syria
I finally got through with porting the input form from DSpace 6 to DSpace 7

I can’t put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a twobox, etc, the entire form step appears blank

2023-03-02

I did some experiments with the new Pandas 2.0.0rc0 Apache Arrow support
- There is a change to the way nulls are handled and it causes my tests for pd.isna(field) to fail
- I think we need consider blanks as null, but I’m not sure
I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
- I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in discovery.xml since we are no longer using them in sidebar facets

2023-03-03

Atmire merged one of my old pull requests into COUNTER-Robots:
- COUNTER_Robots_list.json: Add new bots
I will update the local ILRI overrides in our DSpace spider agents file

2023-03-04

Submit a pull request on pycountry to use iso-codes 4.13.0

2023-03-05

Start a harvest on AReS

2023-03-06

Export CGSpace to do Initiative collection mappings
- There were thirty-three that needed updating
Send Abenet and Sam a list of twenty-one CAS publications that had been marked as “multiple documents” that we uploaded as metadata-only items
- Goshu will download the PDFs for each and upload them to the items on CGSpace manually
I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
- It seems there is a problem recognizing empty strings as na with pd.isna()
- If I do pd.isna(field) or field == "" then it works as expected, but that feels hacky
- I’m going to test again on the next release…
- Note that I had been setting both of these global options:

pd.options.mode.dtype_backend = 'pyarrow'
pd.options.mode.nullable_dtypes = True

Then reading the CSV like this:

df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'

2023-03-07

Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:

$ podman pull docker.io/library/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine
$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest

Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn’t have Handles
- I ran a duplicate check on them to find if they exist or if we can import them
- There were about ninety matches, but a few dozen of those were pre-prints!
- After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
  - After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
  - Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
- For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my crossref-doi-lookup.py script
  - After spending some time cleaning the data in OpenRefine I realized we don’t get access status from Crossref
  - We can imply it if the item is Creative Commons, but otherwise I might be able to use Unpaywall’s API
  - I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA…
During this process I updated my crossref-doi-lookup.py script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
An unscientific comparison of duplicate checking Peter’s file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
- PostgreSQL 12: 0.11s user 0.04s system 0% cpu 19:24.65 total
- PostgreSQL 14: 0.12s user 0.04s system 0% cpu 18:13.47 total

2023-03-08

I am wondering how to speed up PostgreSQL trgm searches more
- I see my local PostgreSQL is using vanilla configuration and I should update some configs:

localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
 setting │ unit 
─────────┼──────
 16384   │ 8kB
(1 row)

I re-created my PostgreSQL 14 container with some extra memory settings:

$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1

Then I created a GiST index on the metadatavalue table to try to speed up the trgm similarity operations:

localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB

That took a few minutes to build… then the duplicate checker ran in 12 minutes: 0.07s user 0.02s system 0% cpu 12:43.08 total
On a hunch, I tried with a GIN index:

localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB

This ran in 19 minutes: 0.08s user 0.01s system 0% cpu 19:49.73 total
- So clearly the GiST index is better for this task
- I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):

localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...

This one finished in ten minutes: 0.07s user 0.02s system 0% cpu 10:04.04 total
I might also want to increase my work_mem (default 4MB):

localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
 setting │ unit 
─────────┼──────
 4096    │ kB
(1 row)

After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace
Wow, I found a really cool way to fetch URLs in OpenRefine
- I used this to fetch the open access status for each DOI from Unpaywall
First, create a new column called “url” based on the DOI that builds the request URL. I used a Jython expression:

unpaywall_baseurl = 'https://api.unpaywall.org/v2/'
email = "a.orth+unpaywall@cgiar.org"
doi = value.replace("https://doi.org/", "")
request_url = unpaywall_baseurl + doi + '?email=' + email

return request_url

Then create a new column based on fetching the values in that column. I called it “unpaywall_status”
Then you get a JSON blob in each and you can extract the Open Access status with a GREL like value.parseJson()['is_oa']
- I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones
I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
- The syntax was hairy because it’s marked up with tags like <jats:p>, but this got me most of the way there:

value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
value.replace("<jats:italic>","").replace("</jats:italic>", "")
value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")

I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them
I exported a list of authors, affiliations, and funders from the new items to let Peter correct them:

$ csvcut -c dc.contributor.author /tmp/new-items.csv | sed -e 1d -e 's/"//g' -e 's/||/\n/g' | sort | uniq -c | sort -nr | awk '{$1=""; print $0}' | sed -e 's/^ //' > /tmp/new-authors.csv

Meeting with FAO AGRIS team about how to detect duplicates
- They are currently using a sha256 hash on titles, which will work, but will only return exact matches
- I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches
Meeting with Abenet to discuss CGSpace issues
- She reminded me about needing a metadata field for first author when the affiliation is ILRI
- I said I prefer to write a small script for her that will check the first author and first affiliation… I could do it easily in Python, but would need to put a web frontend on it for her
- Unless we could do that in AReS reports somehow

2023-03-09

Apply a bunch of corrections to authors, affiliations, and donors on the new items on DSpace Test
Meeting with Peter and Abenet about future OpenRXV developments, DSpace 7, etc
- I submitted an issue on MEL asking them to add provenance metadata when submitting to CGSpace

2023-03-10

CKM is getting ready to launch their new website and they display CGSpace thumbnails at 255x362px
- Our thumbnails are 300px so they get up-scaled and look bad
- I realized that the last time we increased the size of our thumbnails was in 2013, from 94x130 to 300px
- I offered to CKM that we increase them again to 400 or 600px
- I did some tests to check the thumbnail file sizes for 300px, 400px, 500px, and 600px on this item:

$ ls -lh 10568-126388-*
-rw-r--r-- 1 aorth aorth  31K Mar 10 12:42 10568-126388-300px.jpg
-rw-r--r-- 1 aorth aorth  52K Mar 10 12:41 10568-126388-400px.jpg
-rw-r--r-- 1 aorth aorth  76K Mar 10 12:43 10568-126388-500px.jpg
-rw-r--r-- 1 aorth aorth 106K Mar 10 12:44 10568-126388-600px.jpg

Seems like 600px is 3 to 4 times larger file size, so maybe we should shoot for 400px or 500px
- I decided on 500px
On that note, I also re-worked the XMLUI item display to show larger thumbnails (from a max-width of 128px to 200px)
And now that I’m looking at thumbnails I am curious what it would take to get DSpace to generate WebP or AVIF thumbnails
Peter sent me citations and ILRI subjects for the 350 new ILRI publications
- I guess he edited it in Excel because there are a bunch of encoding issues with accents
- I merged Peter’s citations and subjects with the other metadata, ran one last duplicate check (and found one item!), then ran the items through csv-metadata-quality and uploaded them to CGSpace
- In the end it was only 348 items for some reason…