mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2023-03-08
This commit is contained in:
@ -77,8 +77,99 @@ $ createdb -h localhost -p 5432 -U postgres -O dspacetest --encoding=UNICODE dsp
|
||||
- After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
|
||||
- After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs
|
||||
- Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)
|
||||
- For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my `crossref-doi-lookup.py` script
|
||||
- After spending some time cleaning the data in OpenRefine I realized we don't get access status from Crossref
|
||||
- We can imply it if the item is Creative Commons, but otherwise I might be able to use [Unpaywall's API](https://unpaywall.org/products/api)
|
||||
- I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA...
|
||||
- During this process I updated my `crossref-doi-lookup.py` script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects
|
||||
- An unscientific comparison of duplicate checking Peter's file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
|
||||
- PostgreSQL 12: `0.11s user 0.04s system 0% cpu 19:24.65 total`
|
||||
- PostgreSQL 14: `0.12s user 0.04s system 0% cpu 18:13.47 total`
|
||||
|
||||
## 2023-03-08
|
||||
|
||||
- I am wondering how to speed up PostgreSQL trgm searches more
|
||||
- I see my local PostgreSQL is using vanilla configuration and I should update some configs:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
|
||||
setting │ unit
|
||||
─────────┼──────
|
||||
16384 │ 8kB
|
||||
(1 row)
|
||||
```
|
||||
|
||||
- I re-created my PostgreSQL 14 container with some extra memory settings:
|
||||
|
||||
```console
|
||||
$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers=1024MB -c random_page_cost=1.1
|
||||
```
|
||||
|
||||
- Then I created a GiST [index on the `metadatavalue` table to try to speed up the trgm similarity operations](https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search):
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
|
||||
```
|
||||
|
||||
- That took a few minutes to build... then the duplicate checker ran in 12 minutes: `0.07s user 0.02s system 0% cpu 12:43.08 total`
|
||||
- On a hunch, I tried with a GIN index:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
|
||||
```
|
||||
|
||||
- This ran in 19 minutes: `0.08s user 0.01s system 0% cpu 19:49.73 total`
|
||||
- So clearly the GiST index is better for this task
|
||||
- I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
|
||||
```
|
||||
|
||||
- This one finished in ten minutes: `0.07s user 0.02s system 0% cpu 10:04.04 total`
|
||||
- I might also want to [increase my `work_mem`](https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm) (default 4MB):
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
|
||||
setting │ unit
|
||||
─────────┼──────
|
||||
4096 │ kB
|
||||
(1 row)
|
||||
```
|
||||
|
||||
- After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace
|
||||
- Wow, I found a [really cool way to fetch URLs in OpenRefine](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html)
|
||||
- I used this to fetch the open access status for each DOI from Unpaywall
|
||||
- First, create a new column called "url" based on the DOI that builds the request URL. I used a Jython expression:
|
||||
|
||||
```python
|
||||
unpaywall_baseurl = 'https://api.unpaywall.org/v2/'
|
||||
email = "a.orth+unpaywall@cgiar.org"
|
||||
doi = value.replace("https://doi.org/", "")
|
||||
request_url = unpaywall_baseurl + doi + '?email=' + email
|
||||
|
||||
return request_url
|
||||
```
|
||||
|
||||
- Then create a new column based on fetching the values in that column. I called it "unpaywall_status"
|
||||
- Then you get a JSON blob in each and you can extract the Open Access status with a GREL like `value.parseJson()['is_oa']`
|
||||
- I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones
|
||||
- I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
|
||||
- The syntax was hairy because it's marked up with tags like `<jats:p>`, but this got me most of the way there:
|
||||
|
||||
```console
|
||||
value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
|
||||
value.replace("<jats:italic>","").replace("</jats:italic>", "")
|
||||
value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")
|
||||
```
|
||||
|
||||
- I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them
|
||||
- Meeting with FAO AGRIS team about how to detect duplicates
|
||||
- They are currently using a sha256 hash on titles, which will work, but will only return exact matches
|
||||
- I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches
|
||||
- Meeting with Abenet to discuss CGSpace issues
|
||||
- She reminded me about needing a metadata field for first author when the affiliation is ILRI
|
||||
- I said I prefer to write a small script for her that will check the first author and first affiliation... I could do it easily in Python, but would need to put a web frontend on it for her
|
||||
- Unless we could do that in AReS reports somehow
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user