diff --git a/content/posts/2023-03.md b/content/posts/2023-03.md index 57a6d8cb8..e8a65d7c9 100644 --- a/content/posts/2023-03.md +++ b/content/posts/2023-03.md @@ -77,8 +77,99 @@ $ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dsp - After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items - After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs - Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs) + - For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script + - After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref + - We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api) + - I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA... +- During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects - An unscientific comparison of duplicate checking Peter's file with ~500 titles on PostgreSQL 12 and PostgreSQL 14: - PostgreSQL 12: `0.11s user 0.04s system 0% cpu 19:24.65 total` - PostgreSQL 14: `0.12s user 0.04s system 0% cpu 18:13.47 total` +## 2023-03-08 + +- I am wondering how to speed up PostgreSQL trgm searches more + - I see my local PostgreSQL is using vanilla configuration and I should update some configs: + +```console +localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers'; + setting │ unit +─────────┼────── + 16384 │ 8kB +(1 row) +``` + +- I re-created my PostgreSQL 14 container with some extra memory settings: + +```console +$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1 +``` + +- Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search): + +```console +localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB +``` + +- That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total` +- On a hunch, I tried with a GIN index: + +```console +localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB +``` + +- This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total` + - So clearly the GiST index is better for this task + - I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken): + +```console +localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index... +``` + +- This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total` +- I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB): + +```console +localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem'; + setting │ unit +─────────┼────── + 4096 │ kB +(1 row) +``` + +- After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace +- Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html) + - I used this to fetch the open access status for each DOI from Unpaywall +- First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression: + +```python +unpaywall_baseurl = 'https://api.unpaywall.org/v2/' +email = "a.orth+unpaywall@cgiar.org" +doi = value.replace("https://doi.org/", "") +request_url = unpaywall_baseurl + doi + '?email=' + email + +return request_url +``` + +- Then create a new column based on fetching the values in that column. I called it "unpaywall_status" +- Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']` + - I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones +- I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts + - The syntax was hairy because it's marked up with tags like ``, but this got me most of the way there: + +```console +value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml() +value.replace("","").replace("", "") +value.replace("","").replace("", "").replace("","").replace("", "") +``` + +- I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them +- Meeting with FAO AGRIS team about how to detect duplicates + - They are currently using a sha256 hash on titles, which will work, but will only return exact matches + - I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches +- Meeting with Abenet to discuss CGSpace issues + - She reminded me about needing a metadata field for first author when the affiliation is ILRI + - I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her + - Unless we could do that in AReS reports somehow + diff --git a/docs/2023-03/index.html b/docs/2023-03/index.html index 11f4339ef..aff01e3ec 100644 --- a/docs/2023-03/index.html +++ b/docs/2023-03/index.html @@ -16,7 +16,7 @@ I finally got through with porting the input form from DSpace 6 to DSpace 7 - + @@ -38,9 +38,9 @@ I finally got through with porting the input form from DSpace 6 to DSpace 7 "@type": "BlogPosting", "headline": "March, 2023", "url": "https://alanorth.github.io/cgspace-notes/2023-03/", - "wordCount": "601", + "wordCount": "1336", "datePublished": "2023-03-01T07:58:36+03:00", - "dateModified": "2023-03-07T10:05:12+03:00", + "dateModified": "2023-03-07T17:15:26+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -200,8 +200,16 @@ pd.options.mode.nullable_dtypes = True
  • Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
  • +
  • For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my crossref-doi-lookup.py script +
  • + + +
  • During this process I updated my crossref-doi-lookup.py script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
  • An unscientific comparison of duplicate checking Peter’s file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
  • +

    2023-03-08

    + +
    localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
    + setting │ unit 
    +─────────┼──────
    + 16384   │ 8kB
    +(1 row)
    +
    +
    $ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
    +
    +
    localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
    +
    +
    localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
    +
    +
    localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
    +
    +
    localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
    + setting │ unit 
    +─────────┼──────
    + 4096    │ kB
    +(1 row)
    +
    +
    unpaywall_baseurl = 'https://api.unpaywall.org/v2/'
    +email = "a.orth+unpaywall@cgiar.org"
    +doi = value.replace("https://doi.org/", "")
    +request_url = unpaywall_baseurl + doi + '?email=' + email
    +
    +return request_url
    +
    +
    value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
    +value.replace("<jats:italic>","").replace("</jats:italic>", "")
    +value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index 130efb12a..2e65b06c1 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 9574ed26c..9433eb25c 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 49825a2c3..770676fcd 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 298da6b64..a222b9973 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 1263200bc..d80601517 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 5d0d0a024..b2ae27660 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 13dfcaa74..1dc73c6f3 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index f37efbbb7..fcf1e9881 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index c823b9f7a..ae2f7c725 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/10/index.html b/docs/page/10/index.html index e999f5eb9..60d974a46 100644 --- a/docs/page/10/index.html +++ b/docs/page/10/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 9d80e8f3a..6545f0222 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 5c5e28097..221e5a8c0 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 850f19625..eeba2729f 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 1985b9d43..6c69e5c65 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index c2351e236..56d88998b 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 4f7425333..4e00eec9c 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index f0e9101c2..9d6995b78 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 234e69657..d776c18a3 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 2bc6197b7..fc7ffca33 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/10/index.html b/docs/posts/page/10/index.html index ebff017e5..a8b1753e0 100644 --- a/docs/posts/page/10/index.html +++ b/docs/posts/page/10/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 728d4e9da..634f9ca16 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index d9aab2c6e..a76f8cdf8 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 7e69c0eaa..171df0687 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index d2726db30..fdd1c2b36 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 333611573..495c3a4fa 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 517a80546..89c72b79f 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index c890bfc24..f2922bdd0 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 54627487c..e95b1e5b4 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index f0c2deb1d..49989e83f 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2023-03-07T10:05:12+03:00 + 2023-03-07T17:15:26+03:00 https://alanorth.github.io/cgspace-notes/ - 2023-03-07T10:05:12+03:00 + 2023-03-07T17:15:26+03:00 https://alanorth.github.io/cgspace-notes/2023-03/ - 2023-03-07T10:05:12+03:00 + 2023-03-07T17:15:26+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2023-03-07T10:05:12+03:00 + 2023-03-07T17:15:26+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2023-03-07T10:05:12+03:00 + 2023-03-07T17:15:26+03:00 https://alanorth.github.io/cgspace-notes/2023-02/ 2023-03-01T08:30:25+03:00