Add notes for 2021-03-22

This commit is contained in:
2021-03-23 09:34:40 +02:00
parent ba9cad82a1
commit f135383609
96 changed files with 379 additions and 123 deletions

View File

@ -367,4 +367,137 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
- I also made some minor optimizations in the Pandas code
- I [tagged version 0.4.7 of csv-metadata-quality on GitHub](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.7)
## 2021-03-18
- I added the ability to check for, and fix, "mojibake" characters in csv-metadata-quality
## 2021-03-21
- Last week Atmire asked me which browser I was using to test the duplicate checker, which I had [reported](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934) as not loading
- I tried to load it in Chrome and it works... hmmm
- Back up the current `openrxv-items-final` index to start a fresh AReS Harvest:
```console
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```
- Then start harvesting in the AReS Explorer admin UI
## 2021-03-22
- The harvesting on AReS yesterday completed, but somehow I have twice the number of items:
```console
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
"count" : 206204,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
```
- Hmmm and even my backup index has a strange number of items:
```console
$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
{
"count" : 844,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
```
- I deleted all indexes and re-created the openrxv-items alias:
```console
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
"openrxv-items-temp": {
"aliases": {}
},
"openrxv-items-final": {
"aliases": {
"openrxv-items": {}
}
}
```
- Then I started a new harvesting
- I switched the Node.js in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) to v12 since v10 will cease to be supported soon
- I re-deployed DSpace Test (linode26) with Node.js 12 and restarted the server
- The AReS harvest finally finished, with 1047 pages of items, but the `openrxv-items-final` index is empty and the `openrxv-items-temp` index has a 103,000 items:
```console
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
"count" : 103162,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
```
- I tried to clone the temp index to the final, but got an error:
```console
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}%
```
- I looked in the Docker logs for Elasticsearch and saw a few memory errors:
```console
java.lang.OutOfMemoryError: Java heap space
```
- According to `/usr/share/elasticsearch/config/jvm.options` in the Elasticsearch container the default JVM heap is 1g
- I see the running Java process has `-Xms 1g -Xmx 1g` in its process invocation so I guess that it must be indeed using 1g
- We can [change the heap size with the ES_JAVA_OPTS environment variable](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html)
- Or perhaps better, we should [use a jvm.options.d file](https://www.elastic.co/guide/en/elasticsearch/reference/master/jvm-options.html) because if you use the environment variable it overrides all other JVM options from the default `jvm.options`
- I tried to set memory to 1536m by binding an options file and restarting the container, but it didn't seem to work
- Nevertheless, after restarting I see 103,000 items in the Explorer...
- But the indexes are still kinda messed up... the `openrxv-items` index is an alias of the wrong index!
```console
"openrxv-items-final": {
"aliases": {}
},
"openrxv-items-temp": {
"aliases": {
"openrxv-items": {}
}
},
```
## 2021-03-23
- For reference you can also get the Elasticsearch JVM stats from the API:
```console
$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
```
- I re-deployed AReS with 1.5GB of heap using the `ES_JAVA_OPTS` environment variable
- It turns out that this *is* the recommended way to set the heap: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html
- Then I fixed the aliases to make sure `openrxv-items` was an alias of `openrxv-items-final`, similar to how I did a few weeks ago
- I re-created the temp index:
```console
$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
```
<!-- vim: set sw=2 ts=2: -->