CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2023

2023-03-01

  • Remove cg.subject.wle and cg.identifier.wletheme from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
  • iso-codes 4.13.0 was released, which incorporates my changes to the common names for Iran, Laos, and Syria
  • I finally got through with porting the input form from DSpace 6 to DSpace 7
  • I can’t put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a twobox, etc, the entire form step appears blank

2023-03-02

  • I did some experiments with the new Pandas 2.0.0rc0 Apache Arrow support
    • There is a change to the way nulls are handled and it causes my tests for pd.isna(field) to fail
    • I think we need consider blanks as null, but I’m not sure
  • I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
    • I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in discovery.xml since we are no longer using them in sidebar facets

2023-03-03

2023-03-04

2023-03-05

  • Start a harvest on AReS

2023-03-06

  • Export CGSpace to do Initiative collection mappings
    • There were thirty-three that needed updating
  • Send Abenet and Sam a list of twenty-one CAS publications that had been marked as “multiple documents” that we uploaded as metadata-only items
    • Goshu will download the PDFs for each and upload them to the items on CGSpace manually
  • I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
    • It seems there is a problem recognizing empty strings as na with pd.isna()
    • If I do pd.isna(field) or field == "" then it works as expected, but that feels hacky
    • I’m going to test again on the next release…
    • Note that I had been setting both of these global options:
pd.options.mode.dtype_backend = 'pyarrow'
pd.options.mode.nullable_dtypes = True
  • Then reading the CSV like this:
df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'

2023-03-07

  • Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:
$ podman pull docker.io/library/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine
$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
  • Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn’t have Handles
    • I ran a duplicate check on them to find if they exist or if we can import them
    • There were about ninety matches, but a few dozen of those were pre-prints!
    • After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
      • After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
      • Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
    • For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my crossref-doi-lookup.py script
      • After spending some time cleaning the data in OpenRefine I realized we don’t get access status from Crossref
      • We can imply it if the item is Creative Commons, but otherwise I might be able to use Unpaywall’s API
      • I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA…
  • During this process I updated my crossref-doi-lookup.py script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
  • An unscientific comparison of duplicate checking Peter’s file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
    • PostgreSQL 12: 0.11s user 0.04s system 0% cpu 19:24.65 total
    • PostgreSQL 14: 0.12s user 0.04s system 0% cpu 18:13.47 total

2023-03-08

  • I am wondering how to speed up PostgreSQL trgm searches more
    • I see my local PostgreSQL is using vanilla configuration and I should update some configs:
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
 setting │ unit 
─────────┼──────
 16384   │ 8kB
(1 row)
  • I re-created my PostgreSQL 14 container with some extra memory settings:
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
  • That took a few minutes to build… then the duplicate checker ran in 12 minutes: 0.07s user 0.02s system 0% cpu 12:43.08 total
  • On a hunch, I tried with a GIN index:
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
  • This ran in 19 minutes: 0.08s user 0.01s system 0% cpu 19:49.73 total
    • So clearly the GiST index is better for this task
    • I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
  • This one finished in ten minutes: 0.07s user 0.02s system 0% cpu 10:04.04 total
  • I might also want to increase my work_mem (default 4MB):
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
 setting │ unit 
─────────┼──────
 4096    │ kB
(1 row)
  • After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace
  • Wow, I found a really cool way to fetch URLs in OpenRefine
    • I used this to fetch the open access status for each DOI from Unpaywall
  • First, create a new column called “url” based on the DOI that builds the request URL. I used a Jython expression:
unpaywall_baseurl = 'https://api.unpaywall.org/v2/'
email = "a.orth+unpaywall@cgiar.org"
doi = value.replace("https://doi.org/", "")
request_url = unpaywall_baseurl + doi + '?email=' + email

return request_url
  • Then create a new column based on fetching the values in that column. I called it “unpaywall_status”
  • Then you get a JSON blob in each and you can extract the Open Access status with a GREL like value.parseJson()['is_oa']
    • I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones
  • I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
    • The syntax was hairy because it’s marked up with tags like <jats:p>, but this got me most of the way there:
value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
value.replace("<jats:italic>","").replace("</jats:italic>", "")
value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")
  • I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them
  • I exported a list of authors, affiliations, and funders from the new items to let Peter correct them:
$ csvcut -c dc.contributor.author /tmp/new-items.csv | sed -e 1d -e 's/"//g' -e 's/||/\n/g' | sort | uniq -c | sort -nr | awk '{$1=""; print $0}' | sed -e 's/^ //' > /tmp/new-authors.csv
  • Meeting with FAO AGRIS team about how to detect duplicates
    • They are currently using a sha256 hash on titles, which will work, but will only return exact matches
    • I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches
  • Meeting with Abenet to discuss CGSpace issues
    • She reminded me about needing a metadata field for first author when the affiliation is ILRI
    • I said I prefer to write a small script for her that will check the first author and first affiliation… I could do it easily in Python, but would need to put a web frontend on it for her
    • Unless we could do that in AReS reports somehow

2023-03-09

2023-03-10

  • CKM is getting ready to launch their new website and they display CGSpace thumbnails at 255x362px
    • Our thumbnails are 300px so they get up-scaled and look bad
    • I realized that the last time we increased the size of our thumbnails was in 2013, from 94x130 to 300px
    • I offered to CKM that we increase them again to 400 or 600px
    • I did some tests to check the thumbnail file sizes for 300px, 400px, 500px, and 600px on this item:
$ ls -lh 10568-126388-*
-rw-r--r-- 1 aorth aorth  31K Mar 10 12:42 10568-126388-300px.jpg
-rw-r--r-- 1 aorth aorth  52K Mar 10 12:41 10568-126388-400px.jpg
-rw-r--r-- 1 aorth aorth  76K Mar 10 12:43 10568-126388-500px.jpg
-rw-r--r-- 1 aorth aorth 106K Mar 10 12:44 10568-126388-600px.jpg
  • Seems like 600px is 3 to 4 times larger file size, so maybe we should shoot for 400px or 500px
    • I decided on 500px
    • I started re-generating new thumbnails for the ILRI Publications, CGIAR Initiatives, and other collections
  • On that note, I also re-worked the XMLUI item display to show larger thumbnails (from a max-width of 128px to 200px)
  • And now that I’m looking at thumbnails I am curious what it would take to get DSpace to generate WebP or AVIF thumbnails
  • Peter sent me citations and ILRI subjects for the 350 new ILRI publications
    • I guess he edited it in Excel because there are a bunch of encoding issues with accents
    • I merged Peter’s citations and subjects with the other metadata, ran one last duplicate check (and found one item!), then ran the items through csv-metadata-quality and uploaded them to CGSpace
    • In the end it was only 348 items for some reason…

2023-03-12

  • Start a harvest on AReS

2023-03-13

  • Extract a list of DOIs from the Creative Commons licensed ILRI journal articles that I uploaded last week, skipping any that are “no derivatives” (ND):
$ csvgrep -c 'dc.description.provenance[en]' -m 'Made available in DSpace on 2023-03-10' /tmp/ilri-articles.csv \
    | csvgrep -c 'dcterms.license[en_US]' -r 'CC(0|\-BY)'
    | csvgrep -c 'dcterms.license[en_US]' -i -r '\-ND\-'
    | csvcut -c 'id,cg.identifier.doi[en_US],dcterms.type[en_US]' > 2023-03-13-journal-articles.csv
  • I want to write a script to download the PDFs and create thumbnails for them, then upload to CGSpace
    • I wrote one based on post_ciat_pdfs.py but it seems there is an issue uploading anything other than a PDF
    • When I upload a JPG or a PNG the file begins with:
Content-Disposition: form-data; name="file"; filename="10.1017-s0031182013001625.pdf.jpg"
  • … this means it is invalid…
    • I tried in both the ORIGINAL and THUMBNAIL bundle, and with different filenames
    • I tried manually on the command line with http and both PDF and PNG work… hmmmm
    • Hmm, this seems to have been due to some difference in behavior between the files and data parameters of requests.get()
    • I finalized the post_bitstreams.py script and uploaded eighty-five PDF thumbnails
  • It seems Bizu uploaded covers for a handful so I deleted them and ran them through the script to get proper thumbnails

2023-03-14

  • Add twelve IFPRI authors to our controlled vocabulary for authors and ORCID identifiers
    • I also tagged their existing items on CGSpace
  • Export all our ORCIDs and resolve their names to see if any have changed:
$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-03-14-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2023-03-14-orcids.txt -o /tmp/2023-03-14-orcids-names.txt -d
  • Then update them in the database:
$ ./ilri/update_orcids.py -i /tmp/2023-03-14-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247

2023-03-15

  • Jawoo was asking about possibilities to harvest PDFs from CGSpace for some kind of AI chatbot integration
    • I see we have 45,000 PDFs (format ID 2)
localhost/dspacetest= ☘ SELECT COUNT(*) FROM bitstream WHERE NOT deleted AND bitstream_format_id=2;
 count 
───────
 45281
(1 row)
  • Rework some of my Python scripts to use a common db_connect function from util
  • I reworked my post_bitstreams.py script to be able to overwrite bitstreams if requested
    • The use case is to upload thumbnails for all the journal articles where we have these horrible pixelated journal covers
    • I replaced JPEG thumbnails for ~896 ILRI publications by exporting a list of DOIs from the 10568/3 collection that were CC-BY, getting their PDFs from Sci-Hub, and then posting them with my new script

2023-03-16

  • Continue working on the ILRI publication thumbnails
    • There were about sixty-four that had existing PNG “journal cover” thumbnails that didn’t get replaced because I only overwrote the JPEG ones yesterday
    • Now I generated a list of those bitstream UUIDs and deleted them with a shell script via the REST API
  • I made a pull request on DSpace 7 to update the bitstream format registry for PNG, WebP, and AVIF
  • Export CGSpace to perform mappings to Initiatives collections
  • I also used this export to find CC-BY items with DOIs that had JPEGs or PNGs in their provenance, meaning that the submitter likely submitted a low-quality “journal cover” for the item
    • I found about 330 of them and got most of their PDFs from Sci-Hub and replaced the crappy thumbnails with real ones where Sci-Hub had them (~245)
  • In related news, I realized you can get an API key from Elsevier and download the PDFs from their API:
import requests

api_key = 'fuuuuuuuuu'
doi = "10.1016/j.foodqual.2021.104362"
request_url = f'https://api.elsevier.com/content/article/doi:{doi}'

headers = {
    'X-ELS-APIKEY': api_key,
    'Accept': 'application/pdf'
}

with requests.get(request_url, stream=True, headers=headers) as r:
    if r.status_code == 200:
        with open("article.pdf", "wb") as f:
            for chunk in r.iter_content(chunk_size=1024*1024):
                f.write(chunk)
  • The question is, how do we know if a DOI is Elsevier or not…
  • CGIAR Repositories Working Group meeting
    • We discussed controlled vocabularies for funders
    • I suggested checking our combined lists against Crossref and ROR
  • Export a list of donors from cg.contributor.donor on CGSpace:
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=248) to /tmp/2023-03-16-donors.txt;
COPY 1521
  • Then resolve them against Crossref’s funders API:
$ ./ilri/crossref_funders_lookup.py -e fuuuu@cgiar.org -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv -d
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
472
$ sed 1d ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l 
1521
  • That’s a 31% hit rate, but I see some simple things like “Bill and Melinda Gates Foundation” instead of “Bill & Melinda Gates Foundation”

2023-03-17

  • I did the same lookup of CGSpace donors on ROR’s 2022-12-01 data dump:
$ ./ilri/ror_lookup.py -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l                                            
407
$ sed 1d ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
1521
  • That’s a 26.7% hit rate
  • As for the number of funders in each dataset
    • Crossref has about 34,000
    • ROR has 15,000 if “FundRef” data is a proxy for that:
$ grep -c -rsI FundRef v1.15-2022-12-01-ror-data.json    
15162
  • On a related note, I remembered that DOI.org has a list of DOI prefixes and publishers: https://doi.crossref.org/getPrefixPublisher
    • In Python I can look up publishers by prefix easily, here with a nested list comprehension:
In [10]: [publisher for publisher in publishers if '10.3390' in publisher['prefixes']]
Out[10]: 
[{'prefixes': ['10.1989', '10.32545', '10.20944', '10.3390', '10.35995'],
  'name': 'MDPI AG',
  'memberId': 1968}]
  • And in OpenRefine, if I create a new column based on the DOI using Jython:
import json

with open("/home/aorth/src/git/DSpace/publisher-doi-prefixes.json", "rb") as f:
    publishers = json.load(f)

doi_prefix = value.split("/")[3]

publisher = [publisher for publisher in publishers if doi_prefix in publisher['prefixes']]

return publisher[0]['name']
  • … though this is very slow and hung OpenRefine when I tried it
  • I added the ability to overwrite multiple bitstream formats at once in post_bitstreams.py
$ ./ilri/post_bitstreams.py -i test.csv -u https://dspacetest.cgiar.org/rest -e fuuu@example.com -p 'fffnjnjn' -d -s 2B40C7C4E34CEFCF5AFAE4B75A8C52E2 --overwrite JPEG --overwrite PNG -n
Session valid: 2B40C7C4E34CEFCF5AFAE4B75A8C52E2
Opened test.csv
384142cb-58b9-4e64-bcdc-0a8cc34888b3: checking for existing bitstreams in THUMBNAIL bundle
> (DRY RUN) Deleting bitstream: IFPRI Malawi_Maize Market Report_February_202_anonymous.pdf.jpg (16883cb0-1fc8-4786-a04f-32132e0617d4)
> (DRY RUN) Deleting bitstream: AgroEcol_Newsletter_2.png (7e9cd434-45a6-4d55-8d56-4efa89d73813)
> (DRY RUN) Uploading file: 10568-129666.pdf.jpg
  • I learned how to use Python’s built-in logging module and it simplifies all my debug and info printing
    • I re-factored a few scripts to use the new logging

2023-03-18

  • I applied changes for publishers on 16,000 items in batches of 5,000
  • While working on my post_bitstreams.py script I realized the Tomcat Crawler Session Manager valve that groups bot user agents into sessions is causing my login to fail the first time, every time
    • I’ve disabled it for now and will check the Munin session graphs after some time to see if it makes a difference
    • In any case I have much better spider user agent lists in DSpace now than I did years ago when I started using the Crawler Session Manager valve

2023-03-19

  • Start a harvest on AReS

2023-03-20

  • Minor updates to a few of my DSpace Python scripts to fix the logging
  • Minor updates to some records for Mazingira reported by Sonja
  • Upgrade PostgreSQL on DSpace Test from version 12 to 14, the same way I did from 10 to 12 last year:
    • First, I installed the new version of PostgreSQL via the Ansible playbook scripts
    • Then I stopped Tomcat and all PostgreSQL clusters and used pg_upgrade to upgrade the old version:
# systemctl stop tomcat7
# pg_ctlcluster 12 main stop
# tar -cvzpf var-lib-postgresql-12.tar.gz /var/lib/postgresql/12
# tar -cvzpf etc-postgresql-12.tar.gz /etc/postgresql/12
# pg_ctlcluster 14 main stop
# pg_dropcluster 14 main
# pg_upgradecluster 12 main
# pg_ctlcluster 14 main start
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
  AND C.relkind = 'r'
  AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
  • The index on metadatavalue shrunk by 90MB, and others a bit less
    • This is nice, but not as drastic as I noticed last year when upgrading to PostgreSQL 12

2023-03-21

  • Leigh sent me a list of IFPRI authors with ORCID identifiers so I combined them with our list and resolved all their names with resolve_orcids.py
    • It adds 154 new ORCID identifiers
  • I did a follow up to the publisher names from last week using the list from doi.org
    • Last week I only updated items with a DOI that had no publisher, but now I was curious to see how our existing publisher information compared
    • I checked a dozen or so manually and, other than CIFOR/ICRAF and CIAT/Alliance, the metadata was better than our existing data, so I overwrote them
  • I spent some time trying to figure out how to get ssimulacra2 running so I could compare thumbnails in JPEG and WebP
    • I realized that we can’t directly compare JPEG to WebP, we need to convert to JPEG/WebP, then convert each to lossless PNG
    • Also, we shouldn’t be comparing the resulting images against each other, but rather the original, so I need to a straight PDF to lossless PNG version also
    • After playing with WebP at Q82 and Q92, I see it has lower ssimulacra2 scores than JPEG Q92 for the dozen test files
    • Could it just be something with ImageMagick?