CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2023

2023-03-01

  • Remove cg.subject.wle and cg.identifier.wletheme from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
  • iso-codes 4.13.0 was released, which incorporates my changes to the common names for Iran, Laos, and Syria
  • I finally got through with porting the input form from DSpace 6 to DSpace 7
  • I can’t put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a twobox, etc, the entire form step appears blank

2023-03-02

  • I did some experiments with the new Pandas 2.0.0rc0 Apache Arrow support
    • There is a change to the way nulls are handled and it causes my tests for pd.isna(field) to fail
    • I think we need consider blanks as null, but I’m not sure
  • I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
    • I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in discovery.xml since we are no longer using them in sidebar facets

2023-03-03

2023-03-04

2023-03-05

  • Start a harvest on AReS

2023-03-06

  • Export CGSpace to do Initiative collection mappings
    • There were thirty-three that needed updating
  • Send Abenet and Sam a list of twenty-one CAS publications that had been marked as “multiple documents” that we uploaded as metadata-only items
    • Goshu will download the PDFs for each and upload them to the items on CGSpace manually
  • I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
    • It seems there is a problem recognizing empty strings as na with pd.isna()
    • If I do pd.isna(field) or field == "" then it works as expected, but that feels hacky
    • I’m going to test again on the next release…
    • Note that I had been setting both of these global options:
pd.options.mode.dtype_backend = 'pyarrow'
pd.options.mode.nullable_dtypes = True
  • Then reading the CSV like this:
df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'

2023-03-07

  • Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:
$ podman pull docker.io/library/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine
$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
  • Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn’t have Handles
    • I ran a duplicate check on them to find if they exist or if we can import them
    • There were about ninety matches, but a few dozen of those were pre-prints!
    • After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
      • After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
      • Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
  • An unscientific comparison of duplicate checking Peter’s file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
    • PostgreSQL 12: 0.11s user 0.04s system 0% cpu 19:24.65 total
    • PostgreSQL 14: 0.12s user 0.04s system 0% cpu 18:13.47 total