diff --git a/content/posts/2021-10.md b/content/posts/2021-10.md index b0d2601bb..dea2f3ee2 100644 --- a/content/posts/2021-10.md +++ b/content/posts/2021-10.md @@ -367,5 +367,105 @@ $ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = - I restarted PostgreSQL (instead of restarting Tomcat), so let's see if that helps - I filed [a bug for the DSpace 6/7 duplicate values metadata import issue](https://github.com/DSpace/DSpace/issues/7989) - I tested the two patches for removing abandoned submissions from the workflow but unfortunately it seems that they are for the configurable aka XML workflow, and we are using the basic workflow +- I discussed PostgreSQL issues with some people on the DSpace Slack + - Looking at postgresqltuner.pl and https://pgtune.leopard.in.ua I realized that there were some settings that I hadn't changed in a few years that I probably need to re-evaluate + - For example, `random_page_cost` is recommended to be 1.1 in the PostgreSQL 10 docs (default is 4.0, but we use 1 since 2017 when it came up in Hacker News) + - Also, `effective_io_concurrency` is recommended to be "hundreds" if you are using an SSD (default is 1) +- I also enabled the `pg_stat_statements` extension to try to understand what queries are being run the most often, and how long they take + +## 2021-10-12 + +- I looked again at the duplicate items query I was doing with trigrams recently and found a few new things + - Looking at the `EXPLAIN ANALYZE` plan for the query I noticed it wasn't using any indexes + - I [read on StackExchange](https://dba.stackexchange.com/questions/103821/best-index-for-similarity-function/103823) that, if we want to make use of indexes, we need to use the similarity operator (`%`), not the function `similarity()` because "index support is bound to operators in Postgres, not to functions" + - A note about the query plan output is that we need to read it from the bottom up! + - So with the similary operator we need to set the threshold like this now: + +```console +localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5; +``` + +- Next I experimented with using GIN or GiST indexes on `metadatavalue`, but they were slower than the existing DSpace indexes + - I tested a few variations of the query I had been using and found it's _much_ faster if I use the similarity operator and keep the condition that object IDs are in the item table... + +```console +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas'; + text_value │ dspace_object_id +────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────── + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc +(1 row) + +Time: 739.948 ms +``` + +- Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing! +- I still don't understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate +- So to summarize, the best to the worst query, all returning the same result: + +```console +localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6; +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas'; + text_value │ dspace_object_id +────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────── + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc +(1 row) + +Time: 683.165 ms +Time: 635.364 ms +Time: 674.666 ms + +localhost/dspace= > DISCARD ALL; +localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6; +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas'; + text_value │ dspace_object_id +────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────── + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc +(1 row) + +Time: 1584.765 ms (00:01.585) +Time: 1665.594 ms (00:01.666) +Time: 1623.726 ms (00:01.624) + +localhost/dspace= > DISCARD ALL; +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6; + text_value │ dspace_object_id +────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────── + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc +(1 row) + +Time: 4028.939 ms (00:04.029) +Time: 4022.239 ms (00:04.022) +Time: 4061.820 ms (00:04.062) + +localhost/dspace= > DISCARD ALL; +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6; + text_value │ dspace_object_id +────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────── + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc +(1 row) + +Time: 4358.713 ms (00:04.359) +Time: 4301.248 ms (00:04.301) +Time: 4417.909 ms (00:04.418) +``` + +## 2021-10-13 + +- I looked into the [REST API issue where fields without qualifiers throw an HTTP 500](https://github.com/DSpace/DSpace/issues/7946) + - The fix is to check if the qualifier is not null AND not empty in dspace-api + - I submitted a fix: https://github.com/DSpace/DSpace/pull/7993 + +## 2021-10-14 + +- Someone in the DSpace community already posted a fix for the DSpace 6/7 duplicate items export bug! + - I tested it and it works so I left feedback: https://github.com/DSpace/DSpace/pull/7995 +- Altmetric support got back to us about the missing DOI–Handle link and said it was due to the TLS certificate chain on CGSpace + - I checked and everything is actually working fine, so it could be their backend servers are old and don't support the new Let's Encrypt trust path + - I asked them to put me in touch with their backend developers directly + +## 2021-10-17 + +- Revert the ssl-cert change on the Ansible infrastructure scripts so that nginx uses a manually generated "snakeoil" TLS certificate + - The ssl-cert one is easier because it's automatic, but they include the hostname in the bogus cert so it's an unecessary leak of information diff --git a/docs/2021-10/index.html b/docs/2021-10/index.html index ca8366f15..f8c47f6dd 100644 --- a/docs/2021-10/index.html +++ b/docs/2021-10/index.html @@ -25,7 +25,7 @@ So we have 1879/7100 (26.46%) matching already - + @@ -56,9 +56,9 @@ So we have 1879/7100 (26.46%) matching already "@type": "BlogPosting", "headline": "October, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-10/", - "wordCount": "2424", + "wordCount": "3240", "datePublished": "2021-10-01T11:14:07+03:00", - "dateModified": "2021-10-10T16:01:27+03:00", + "dateModified": "2021-10-11T20:06:42+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -496,6 +496,120 @@ $ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.p
  • I restarted PostgreSQL (instead of restarting Tomcat), so let’s see if that helps
  • I filed a bug for the DSpace 6/7 duplicate values metadata import issue
  • I tested the two patches for removing abandoned submissions from the workflow but unfortunately it seems that they are for the configurable aka XML workflow, and we are using the basic workflow
  • +
  • I discussed PostgreSQL issues with some people on the DSpace Slack + +
  • +
  • I also enabled the pg_stat_statements extension to try to understand what queries are being run the most often, and how long they take
  • + +

    2021-10-12

    + +
    localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
    +
    +
    localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
    +                                           text_value                                           │           dspace_object_id           
    +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
    + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
    +(1 row)
    +
    +Time: 739.948 ms
    +
    +
    localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
    +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
    +                                           text_value                                           │           dspace_object_id           
    +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
    + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
    +(1 row)
    +
    +Time: 683.165 ms
    +Time: 635.364 ms
    +Time: 674.666 ms
    +
    +localhost/dspace= > DISCARD ALL;
    +localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
    +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
    +                                           text_value                                           │           dspace_object_id           
    +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
    + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
    +(1 row)
    +
    +Time: 1584.765 ms (00:01.585)
    +Time: 1665.594 ms (00:01.666)
    +Time: 1623.726 ms (00:01.624)
    +
    +localhost/dspace= > DISCARD ALL;
    +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
    +                                           text_value                                           │           dspace_object_id           
    +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
    + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
    +(1 row)
    +
    +Time: 4028.939 ms (00:04.029)
    +Time: 4022.239 ms (00:04.022)
    +Time: 4061.820 ms (00:04.062)
    +
    +localhost/dspace= > DISCARD ALL;
    +localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
    +                                           text_value                                           │           dspace_object_id           
    +────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
    + Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
    +(1 row)
    +
    +Time: 4358.713 ms (00:04.359)
    +Time: 4301.248 ms (00:04.301)
    +Time: 4417.909 ms (00:04.418)
    +

    2021-10-13

    + +

    2021-10-14

    + +

    2021-10-17

    + diff --git a/docs/categories/index.html b/docs/categories/index.html index 6cf147913..e3d123e44 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 1806b0452..1b49306f6 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 25a8903bf..f68933518 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 2e1427f71..3996da070 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index be9f3308f..e234084dd 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 018e4fa79..2de753276 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 31ed05c36..2c88b9827 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 17ad1478a..9b9486154 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 6987e6272..55d871ac8 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index c8322fbaf..c76654ed0 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index e4b7c7384..9580c5718 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 30b67000a..a087be4f3 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 43617e40a..e3644f3b3 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 96c78b516..c005856fe 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index eef8a0513..4c7d6dc22 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 270f2bba0..46a789bfe 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 47c0e1f4b..897f2ba1a 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index c6f2939a9..4fc365861 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index ca0fea8ff..c96497759 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index e3c5c8854..ffab34e83 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 5b434edaf..0dbcba01d 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 717cec791..85a6aa14a 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index e52a682f8..786de06b8 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index b6921ff9f..014a1698d 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-10-10T16:01:27+03:00 + 2021-10-11T20:06:42+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-10-10T16:01:27+03:00 + 2021-10-11T20:06:42+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-10-10T16:01:27+03:00 + 2021-10-11T20:06:42+03:00 https://alanorth.github.io/cgspace-notes/2021-10/ - 2021-10-10T16:01:27+03:00 + 2021-10-11T20:06:42+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-10-10T16:01:27+03:00 + 2021-10-11T20:06:42+03:00 https://alanorth.github.io/cgspace-notes/2021-09/ 2021-10-04T11:10:54+03:00