From 403fa49d46a994c050549681900764a13bec041b Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Mon, 14 Dec 2020 19:49:25 +0200 Subject: [PATCH] Update notes for 2020-12-14 --- content/posts/2020-12.md | 110 ++++++++++++++++++++ docs/2020-12/index.html | 132 +++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/sitemap.xml | 10 +- 23 files changed, 264 insertions(+), 28 deletions(-) diff --git a/content/posts/2020-12.md b/content/posts/2020-12.md index 8ec3491e2..15ad3fafb 100644 --- a/content/posts/2020-12.md +++ b/content/posts/2020-12.md @@ -241,4 +241,114 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-final $ curl -XDELETE http://localhost:9200/openrxv-items-temp ``` +- Peter asked me for a list of all submitters and approvers that were active recently on CGSpace + - I can probably extract that from the `dc.description.provenance` field, for example any that contains a 2020 date: + +```console +localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*'; +``` + +## 2020-12-14 + +- The re-harvesting finished last night on AReS but there are no records in the `openrxv-items-final` index + - Strangely, there are 99,000 items in the temp index: + +```console +$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp +{ + "count" : 99992, + "_shards" : { + "skipped" : 0, + "total" : 1, + "failed" : 0, + "successful" : 1 + } +} +``` + +- I'm going to try to [clone](https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-clone-index.html) the temp index to the final one... + - First, set the `openrxv-items-temp` index to block writes (read only) and then clone it to `openrxv-items-final`: + +```console +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final +{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"} +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}' +``` + +- Now I see that the `openrxv-items-final` index has items, but there are still none in AReS Explorer UI! + +```console +$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty' +{ + "count" : 99992, + "_shards" : { + "total" : 1, + "successful" : 1, + "skipped" : 0, + "failed" : 0 + } +} +``` + +- The api logs show this from last night after the harvesting: + +```console +[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest +[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained +[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained +[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called +(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception + at IncomingMessage. (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25) + at IncomingMessage.emit (events.js:326:22) + at endReadableNT (_stream_readable.js:1223:12) + at processTicksAndRejections (internal/process/task_queues.js:84:21) +``` + +- But I'm not sure why the frontend doesn't show any data despite there being documents in the index... +- I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the `openrxv-items` index +- I cloned the `openrxv-items-final` index to `openrxv-items` index and now I see items in the explorer UI +- The PDF report was broken and I looked in the API logs and saw this: + +```console +(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary + at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19) + at processTicksAndRejections (internal/process/task_queues.js:97:5) +``` + +- I installed `unoconv` in the backend api container and now it works... but I wonder why this changed... +- Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week + - Peter noticed that [this item](https://hdl.handle.net/10568/110133) from the [ILRI policy and research briefs](https://cgspace.cgiar.org/handle/10568/24450) collection is missing in AReS, despite it being added one month ago in CGSpace and me harvesting on AReS last night + - The item appears fine in the REST API when I check the items in that collection + - Peter also noticed that [this item](https://hdl.handle.net/10568/110447) appears twice in AReS + - The item is _not_ duplicated on CGSpace or in the REST API + - We noticed that there are 136 items in the ILRI policy and research briefs collection according to AReS, yet on CGSpace there are only 132 + - This is confirmed in the REST API (using [query-json](https://github.com/davesnx/query-json)): + +``` +$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json +$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json +$ query-json '.items | length' /tmp/policy1.json +100 +$ query-json '.items | length' /tmp/policy2.json +32 +``` + +- I realized that the issue of missing/duplicate items in AReS might be because of this [REST API bug that causes /items to return items in non-deterministic order](https://jira.lyrasis.org/browse/DS-3849) +- I decided to cherry-pick the following two patches from DSpace 6.4 into our `6_x-prod` (6.3) branch: + - High CPU usage when calling the collection_id/items REST endpoint + - Jira: https://jira.lyrasis.org/browse/DS-4342 + - c2e6719fa763e291b81b2d61da2f8c758fe38ff3 + - REST API items resource returns items in non-deterministic order + - Jira: https://jira.lyrasis.org/browse/DS-3849 + - 2a2ea0cb5d03e6da9355a2eff12aad667e465433 +- After deploying the REST API fixes I decided to harvest from AReS again to see if the missing and duplicate items get fixed + - I made a backup of the current `openrxv-items-temp` index just in case: + +```console +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14 +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}' +``` + diff --git a/docs/2020-12/index.html b/docs/2020-12/index.html index 8b901386b..427762f51 100644 --- a/docs/2020-12/index.html +++ b/docs/2020-12/index.html @@ -20,7 +20,7 @@ I started processing those (about 411,000 records): - + @@ -46,9 +46,9 @@ I started processing those (about 411,000 records): "@type": "BlogPosting", "headline": "December, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-12/", - "wordCount": "1378", + "wordCount": "2037", "datePublished": "2020-12-01T11:32:54+02:00", - "dateModified": "2020-12-10T23:43:09+02:00", + "dateModified": "2020-12-13T16:16:10+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -364,6 +364,132 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
$ curl -XDELETE http://localhost:9200/openrxv-items-final
 $ curl -XDELETE http://localhost:9200/openrxv-items-temp
+
+
localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
+

2020-12-14

+ +
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
+{
+   "count" : 99992,
+   "_shards" : {
+      "skipped" : 0,
+      "total" : 1,
+      "failed" : 0,
+      "successful" : 1
+   }
+}
+
+
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
+{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+
+
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
+{
+  "count" : 99992,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  }
+}
+
+
[Nest] 92   - 12/13/2020, 1:58:52 PM   [HarvesterService] Starting Harvest
+[Nest] 92   - 12/13/2020, 10:50:20 PM   [FetchConsumer] OnGlobalQueueDrained
+[Nest] 92   - 12/13/2020, 11:00:20 PM   [PluginsConsumer] OnGlobalQueueDrained
+[Nest] 92   - 12/13/2020, 11:00:20 PM   [HarvesterService] reindex function is called
+(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
+    at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
+    at IncomingMessage.emit (events.js:326:22)
+    at endReadableNT (_stream_readable.js:1223:12)
+    at processTicksAndRejections (internal/process/task_queues.js:84:21)
+
+
(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
+    at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
+    at processTicksAndRejections (internal/process/task_queues.js:97:5)
+
+
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
+$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
+$ query-json '.items | length' /tmp/policy1.json
+100
+$ query-json '.items | length' /tmp/policy2.json
+32
+
+
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
 
diff --git a/docs/categories/index.html b/docs/categories/index.html index 20f3c1e0f..dc5fd8090 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index b1ed36e00..13f16b0e8 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 60de3e9c0..ad9f0bf3f 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 9f3b2d367..c30ef4942 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index ce0aff192..84acd820d 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index ec342e1b5..13ff454c8 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 0b6a514a6..b0927d9b1 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 077253f2c..db2a86f3d 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 1521b4c8e..1f1cc8983 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index ffee9787d..dba8579df 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 42b731c7f..362ba51fd 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 5e515503e..668449f25 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 0b9f4033d..5f74603cf 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 7e466f22d..9447796f0 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 9b64074b6..86df01283 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 0bf8e3c9a..396756935 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 9fc289170..b2f8e56cd 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 83d079132..73c23112c 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 2605d2062..452eb9da4 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 696cf2586..aeca6cfc4 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 7a2fe863b..81b5da041 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2020-12-10T23:43:09+02:00 + 2020-12-13T16:16:10+02:00 https://alanorth.github.io/cgspace-notes/ - 2020-12-10T23:43:09+02:00 + 2020-12-13T16:16:10+02:00 https://alanorth.github.io/cgspace-notes/2020-12/ - 2020-12-10T23:43:09+02:00 + 2020-12-13T16:16:10+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-12-10T23:43:09+02:00 + 2020-12-13T16:16:10+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-12-10T23:43:09+02:00 + 2020-12-13T16:16:10+02:00