cgspace-notes/content/posts/2023-03.md

---
title: "March, 2023"
date: 2023-03-01T07:58:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2023-03-01

- Remove `cg.subject.wle` and `cg.identifier.wletheme` from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
- [iso-codes 4.13.0 was released](https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/CHANGELOG.md#4130-2023-02-28), which incorporates my changes to the common names for Iran, Laos, and Syria
- I finally got through with porting the input form from DSpace 6 to DSpace 7

<!--more-->

- I can't put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a `twobox`, etc, the entire form step appears blank

## 2023-03-02

- I did some experiments with the new [Pandas 2.0.0rc0 Apache Arrow support](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)
  - There is a change to the way nulls are handled and it causes my tests for `pd.isna(field)` to fail
  - I think we need consider blanks as null, but I'm not sure
- I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
  - I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in `discovery.xml` since we are no longer using them in sidebar facets

## 2023-03-03

- Atmire merged one of my old pull requests into COUNTER-Robots:
  - [COUNTER_Robots_list.json: Add new bots](https://github.com/atmire/COUNTER-Robots/pull/54)
- I will update the local ILRI overrides in our DSpace spider agents file

## 2023-03-04

- Submit a [pull request on pycountry to use iso-codes 4.13.0](https://github.com/flyingcircusio/pycountry/pull/156)

## 2023-03-05

- Start a harvest on AReS

## 2023-03-06

- Export CGSpace to do Initiative collection mappings
  - There were thirty-three that needed updating
- Send Abenet and Sam a list of twenty-one CAS publications that had been marked as "multiple documents" that we uploaded as metadata-only items
  - Goshu will download the PDFs for each and upload them to the items on CGSpace manually
- I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
  - It seems there is a problem recognizing empty strings as na with `pd.isna()`
  - If I do `pd.isna(field) or field == ""` then it works as expected, but that feels hacky
  - I'm going to test again on the next release...
  - Note that I had been setting both of these global options:

```
pd.options.mode.dtype_backend = 'pyarrow'
pd.options.mode.nullable_dtypes = True
```

- Then reading the CSV like this:

```
df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'
```

## 2023-03-07

- Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:

```console
$ podman pull docker.io/library/postgres:14-alpine
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine
$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest
```

- Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn't have Handles
  - I ran a duplicate check on them to find if they exist or if we can import them
  - There were about ninety matches, but a few dozen of those were pre-prints!
  - After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
    - After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
    - Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
  - For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script
    - After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref
    - We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api)
    - I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA...
- During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
- An unscientific comparison of duplicate checking Peter's file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
  - PostgreSQL 12: `0.11s user 0.04s system 0% cpu 19:24.65 total`
  - PostgreSQL 14: `0.12s user 0.04s system 0% cpu 18:13.47 total`

## 2023-03-08

- I am wondering how to speed up PostgreSQL trgm searches more
  - I see my local PostgreSQL is using vanilla configuration and I should update some configs:

```console
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
 setting │ unit 
─────────┼──────
 16384   │ 8kB
(1 row)
```

- I re-created my PostgreSQL 14 container with some extra memory settings:

```console
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
```

- Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search):

```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
```

- That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total`
- On a hunch, I tried with a GIN index:

```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
```

- This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total`
  - So clearly the GiST index is better for this task
  - I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):

```console
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
```

- This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total`
- I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB):

```console
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
 setting │ unit 
─────────┼──────
 4096    │ kB
(1 row)
```

- After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace
- Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html)
  - I used this to fetch the open access status for each DOI from Unpaywall
- First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression:

```python
unpaywall_baseurl = 'https://api.unpaywall.org/v2/'
email = "a.orth+unpaywall@cgiar.org"
doi = value.replace("https://doi.org/", "")
request_url = unpaywall_baseurl + doi + '?email=' + email

return request_url
```

- Then create a new column based on fetching the values in that column. I called it "unpaywall_status"
- Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']`
  - I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones
- I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
  - The syntax was hairy because it's marked up with tags like `<jats:p>`, but this got me most of the way there:

```console
value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
value.replace("<jats:italic>","").replace("</jats:italic>", "")
value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")
```

- I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them
- I exported a list of authors, affiliations, and funders from the new items to let Peter correct them:

```console
$ csvcut -c dc.contributor.author /tmp/new-items.csv | sed -e 1d -e 's/"//g' -e 's/||/\n/g' | sort | uniq -c | sort -nr | awk '{$1=""; print $0}' | sed -e 's/^ //' > /tmp/new-authors.csv
```

- Meeting with FAO AGRIS team about how to detect duplicates
  - They are currently using a sha256 hash on titles, which will work, but will only return exact matches
  - I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches
- Meeting with Abenet to discuss CGSpace issues
  - She reminded me about needing a metadata field for first author when the affiliation is ILRI
  - I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her
  - Unless we could do that in AReS reports somehow

## 2023-03-09

- Apply a bunch of corrections to authors, affiliations, and donors on the new items on DSpace Test
- Meeting with Peter and Abenet about future OpenRXV developments, DSpace 7, etc
  - I submitted an [issue on MEL asking them to add provenance metadata when submitting to CGSpace](https://github.com/CodeObia/MEL/issues/11173)

<!-- vim: set sw=2 ts=2: -->
Add notes 2023-03-01 06:30:25 +01:00			`---`
			`title: "March, 2023"`
			`date: 2023-03-01T07:58:36+03:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2023-03-01`

			- Remove `cg.subject.wle` and `cg.identifier.wletheme` from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
			`- [iso-codes 4.13.0 was released](https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/CHANGELOG.md#4130-2023-02-28), which incorporates my changes to the common names for Iran, Laos, and Syria`
Update notes 2023-03-07 07:53:31 +01:00			`- I finally got through with porting the input form from DSpace 6 to DSpace 7`
Add notes 2023-03-01 06:30:25 +01:00
			`<!--more-->`

Update notes 2023-03-07 07:53:31 +01:00			- I can't put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a `twobox`, etc, the entire form step appears blank

			`## 2023-03-02`

			`- I did some experiments with the new [Pandas 2.0.0rc0 Apache Arrow support](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)`
			- There is a change to the way nulls are handled and it causes my tests for `pd.isna(field)` to fail
			`- I think we need consider blanks as null, but I'm not sure`
			`- I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration`
			- I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in `discovery.xml` since we are no longer using them in sidebar facets

			`## 2023-03-03`

			`- Atmire merged one of my old pull requests into COUNTER-Robots:`
			`- [COUNTER_Robots_list.json: Add new bots](https://github.com/atmire/COUNTER-Robots/pull/54)`
			`- I will update the local ILRI overrides in our DSpace spider agents file`

			`## 2023-03-04`

			`- Submit a [pull request on pycountry to use iso-codes 4.13.0](https://github.com/flyingcircusio/pycountry/pull/156)`

			`## 2023-03-05`

			`- Start a harvest on AReS`

			`## 2023-03-06`

			`- Export CGSpace to do Initiative collection mappings`
			`- There were thirty-three that needed updating`
			`- Send Abenet and Sam a list of twenty-one CAS publications that had been marked as "multiple documents" that we uploaded as metadata-only items`
			`- Goshu will download the PDFs for each and upload them to the items on CGSpace manually`
			`- I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0`
			- It seems there is a problem recognizing empty strings as na with `pd.isna()`
			- If I do `pd.isna(field) or field == ""` then it works as expected, but that feels hacky
			`- I'm going to test again on the next release...`
			`- Note that I had been setting both of these global options:`

			```
			`pd.options.mode.dtype_backend = 'pyarrow'`
			`pd.options.mode.nullable_dtypes = True`
			```

			`- Then reading the CSV like this:`

			```
			`df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'`
			```

Add notes for 2023-03-07 2023-03-07 08:05:12 +01:00			`## 2023-03-07`

Add notes for 2023-03-07 2023-03-07 15:15:26 +01:00			`- Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:`

			```console
			`$ podman pull docker.io/library/postgres:14-alpine`
			`$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine`
			`$ createuser -h localhost -p 5432 -U postgres --pwprompt dspacetest`
			`$ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dspacetest`
			```

			`- Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn't have Handles`
			`- I ran a duplicate check on them to find if they exist or if we can import them`
			`- There were about ninety matches, but a few dozen of those were pre-prints!`
			`- After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items`
			`- After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs`
			`- Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)`
Add notes for 2023-03-08 2023-03-08 16:53:32 +01:00			- For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script
			`- After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref`
			`- We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api)`
			`- I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA...`
			- During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
Add notes for 2023-03-07 2023-03-07 15:15:26 +01:00			`- An unscientific comparison of duplicate checking Peter's file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:`
			- PostgreSQL 12: `0.11s user 0.04s system 0% cpu 19:24.65 total`
			- PostgreSQL 14: `0.12s user 0.04s system 0% cpu 18:13.47 total`
Add notes for 2023-03-07 2023-03-07 08:05:12 +01:00
Add notes for 2023-03-08 2023-03-08 16:53:32 +01:00			`## 2023-03-08`

			`- I am wondering how to speed up PostgreSQL trgm searches more`
			`- I see my local PostgreSQL is using vanilla configuration and I should update some configs:`

			```console
			`localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';`
			`setting │ unit`
			`─────────┼──────`
			`16384 │ 8kB`
			`(1 row)`
			```

			`- I re-created my PostgreSQL 14 container with some extra memory settings:`

			```console
			`$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1`
			```

			- Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search):

			```console
			`localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB`
			```

			- That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total`
			`- On a hunch, I tried with a GIN index:`

			```console
			`localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB`
			```

			- This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total`
			`- So clearly the GiST index is better for this task`
			`- I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):`

			```console
			`localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...`
			```

			- This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total`
			- I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB):

			```console
			`localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';`
			`setting │ unit`
			`─────────┼──────`
			`4096 │ kB`
			`(1 row)`
			```

			`- After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace`
			`- Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html)`
			`- I used this to fetch the open access status for each DOI from Unpaywall`
			`- First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression:`

			```python
			`unpaywall_baseurl = 'https://api.unpaywall.org/v2/'`
			`email = "a.orth+unpaywall@cgiar.org"`
			`doi = value.replace("https://doi.org/", "")`
			`request_url = unpaywall_baseurl + doi + '?email=' + email`

			`return request_url`
			```

			`- Then create a new column based on fetching the values in that column. I called it "unpaywall_status"`
			- Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']`
			`- I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones`
			`- I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts`
			- The syntax was hairy because it's marked up with tags like `<jats:p>`, but this got me most of the way there:

			```console
			`value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()`
			`value.replace("<jats:italic>","").replace("</jats:italic>", "")`
			`value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")`
			```

			`- I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them`
Add notes for 2023-03-09 2023-03-09 15:01:50 +01:00			`- I exported a list of authors, affiliations, and funders from the new items to let Peter correct them:`

			```console
			`$ csvcut -c dc.contributor.author /tmp/new-items.csv \| sed -e 1d -e 's/"//g' -e 's/\|\|/\n/g' \| sort \| uniq -c \| sort -nr \| awk '{$1=""; print $0}' \| sed -e 's/^ //' > /tmp/new-authors.csv`
			```

Add notes for 2023-03-08 2023-03-08 16:53:32 +01:00			`- Meeting with FAO AGRIS team about how to detect duplicates`
			`- They are currently using a sha256 hash on titles, which will work, but will only return exact matches`
			`- I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches`
			`- Meeting with Abenet to discuss CGSpace issues`
			`- She reminded me about needing a metadata field for first author when the affiliation is ILRI`
			`- I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her`
			`- Unless we could do that in AReS reports somehow`

Add notes for 2023-03-09 2023-03-09 15:01:50 +01:00			`## 2023-03-09`

			`- Apply a bunch of corrections to authors, affiliations, and donors on the new items on DSpace Test`
			`- Meeting with Peter and Abenet about future OpenRXV developments, DSpace 7, etc`
			`- I submitted an [issue on MEL asking them to add provenance metadata when submitting to CGSpace](https://github.com/CodeObia/MEL/issues/11173)`

Add notes 2023-03-01 06:30:25 +01:00			`<!-- vim: set sw=2 ts=2: -->`