Update notes for 2020-12-14

This commit is contained in:
2020-12-14 19:49:25 +02:00
parent 20f00d1279
commit 403fa49d46
23 changed files with 264 additions and 28 deletions

View File

@ -241,4 +241,114 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-final
$ curl -XDELETE http://localhost:9200/openrxv-items-temp
```
- Peter asked me for a list of all submitters and approvers that were active recently on CGSpace
- I can probably extract that from the `dc.description.provenance` field, for example any that contains a 2020 date:
```console
localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
```
## 2020-12-14
- The re-harvesting finished last night on AReS but there are no records in the `openrxv-items-final` index
- Strangely, there are 99,000 items in the temp index:
```console
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
{
"count" : 99992,
"_shards" : {
"skipped" : 0,
"total" : 1,
"failed" : 0,
"successful" : 1
}
}
```
- I'm going to try to [clone](https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-clone-index.html) the temp index to the final one...
- First, set the `openrxv-items-temp` index to block writes (read only) and then clone it to `openrxv-items-final`:
```console
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```
- Now I see that the `openrxv-items-final` index has items, but there are still none in AReS Explorer UI!
```console
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
"count" : 99992,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
```
- The api logs show this from last night after the harvesting:
```console
[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
```
- But I'm not sure why the frontend doesn't show any data despite there being documents in the index...
- I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the `openrxv-items` index
- I cloned the `openrxv-items-final` index to `openrxv-items` index and now I see items in the explorer UI
- The PDF report was broken and I looked in the API logs and saw this:
```console
(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
```
- I installed `unoconv` in the backend api container and now it works... but I wonder why this changed...
- Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
- Peter noticed that [this item](https://hdl.handle.net/10568/110133) from the [ILRI policy and research briefs](https://cgspace.cgiar.org/handle/10568/24450) collection is missing in AReS, despite it being added one month ago in CGSpace and me harvesting on AReS last night
- The item appears fine in the REST API when I check the items in that collection
- Peter also noticed that [this item](https://hdl.handle.net/10568/110447) appears twice in AReS
- The item is _not_ duplicated on CGSpace or in the REST API
- We noticed that there are 136 items in the ILRI policy and research briefs collection according to AReS, yet on CGSpace there are only 132
- This is confirmed in the REST API (using [query-json](https://github.com/davesnx/query-json)):
```
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
$ query-json '.items | length' /tmp/policy1.json
100
$ query-json '.items | length' /tmp/policy2.json
32
```
- I realized that the issue of missing/duplicate items in AReS might be because of this [REST API bug that causes /items to return items in non-deterministic order](https://jira.lyrasis.org/browse/DS-3849)
- I decided to cherry-pick the following two patches from DSpace 6.4 into our `6_x-prod` (6.3) branch:
- High CPU usage when calling the collection_id/items REST endpoint
- Jira: https://jira.lyrasis.org/browse/DS-4342
- c2e6719fa763e291b81b2d61da2f8c758fe38ff3
- REST API items resource returns items in non-deterministic order
- Jira: https://jira.lyrasis.org/browse/DS-3849
- 2a2ea0cb5d03e6da9355a2eff12aad667e465433
- After deploying the REST API fixes I decided to harvest from AReS again to see if the missing and duplicate items get fixed
- I made a backup of the current `openrxv-items-temp` index just in case:
```console
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```
<!-- vim: set sw=2 ts=2: -->