cgspace-notes/content/posts/2020-12.md

397 lines
21 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "December, 2020"
date: 2020-12-01T11:32:54+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2020-12-01
- Atmire responded about the issue with duplicate data in our Solr statistics
- They noticed that some records in the statistics-2015 core haven't been migrated with the AtomicStatisticsUpdateCLI tool yet and assumed that I haven't migrated any of the records yet
- That's strange, as I checked all ten cores and 2015 is the only one with some unmigrated documents, as according to the `cua_version` field
- I started processing those (about 411,000 records):
<!--more-->
```console
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
```
- AReS went down when the `renew-letsencrypt` service stopped the `angular_nginx` container in the pre-update hook and failed to bring it back up
- I ran all system updates on the host and rebooted it and AReS came back up OK
## 2020-12-02
- Udana emailed me yesterday to ask why the CGSpace usage statistics were showing "No Data"
- I noticed a message in the Solr Admin UI that one of the statistics cores failed to load, but it is up and I can query it...
- Nevertheless, I restarted Tomcat a few times to see if all cores would come up without an error message, but had no success (despite that all cores ARE up and I can query them, _sigh_)
- I think I will move all the Solr yearly statistics back into the main statistics core
- Start testing export/import of yearly Solr statistics data into the main statistics core on DSpace Test, for example:
```
$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
```
- I deployed Tomcat 7.0.107 on DSpace Test (CGSpace is still Tomcat 7.0.104)
- I finished migrating all the statistics from the yearly shards back to the main core
## 2020-12-05
- I deleted all the yearly statistics shards and restarted Tomcat on DSpace Test (linode26)
## 2020-12-06
- Looking into the statistics on DSpace Test after I migrated them back to the main core
- All stats are working as expected... indexing time for the DSpace Statistics API is the same... and I don't even see a difference in the JVM or memory stats in Munin other than a minor jump last week when I was processing them
- I will migrate them on CGSpace too I think
- First I will start with the statistics-2010 and statistics-2015 cores because they were the ones that were failing to load recently (despite actually being available in Solr WTF)
![Error message in Solr admin UI about the statistics-2010 core failing to load](/cgspace-notes/2020/12/solr-statistics-2010-failed.png)
- First the 2010 core:
```console
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
```
- Judging by the DSpace logs all these cores had a problem starting up in the last month:
```console
# grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
24 statistics-2010
24 statistics-2015
18 statistics-2016
6 statistics-2018
```
- The message is always this:
```
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
```
- I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
- I removed the statistics-2010, statistics-2015, statistics-2016, and statistics-2018 cores and restarted Tomcat and _all the statistics cores came up OK and the CUA statistics are OK_!
## 2020-12-07
- Run `dspace cleanup -v` on CGSpace to clean up deleted bitstreams
- Atmire sent a [pull request](https://github.com/ilri/DSpace/pull/457) to address the duplicate owningComm and owningColl
- Built and deployed it on DSpace Test but I am not sure how to run it yet
- I sent feedback to Atmire on their tracker: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
- Abenet and Tezira are having issues with committing to the archive in their workflow
- I looked at the server and indeed the locks and transactions are back up:
![PostgreSQL Transactions day](/cgspace-notes/2020/12/postgres_transactions_ALL-day.png)
![PostgreSQL Locks day](/cgspace-notes/2020/12/postgres_locks_ALL-day.png)
![PostgreSQL Locks day](/cgspace-notes/2020/12/postgres_querylength_ALL-day.png)
![PostgreSQL Connections day](/cgspace-notes/2020/12/postgres_connections_ALL-day.png)
- There are apparently 1,700 locks right now:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1739
```
## 2020-12-08
- Atmire sent some instructions for using the DeduplicateValuesProcessor
- I modified `atmire-cua-update.xml` as they instructed, but I get a million errors like this when I run AtomicStatisticsUpdateCLI with that configuration:
```
Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: java.lang.UnsupportedOperationException
at org.apache.solr.common.SolrDocument$1.entrySet(SolrDocument.java:256)
at java.util.HashMap.putMapEntries(HashMap.java:512)
at java.util.HashMap.<init>(HashMap.java:490)
at com.atmire.statistics.util.update.atomic.record.Record.getFieldValuesMap(SourceFile:86)
at com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor.process(SourceFile:38)
at com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor.visit(SourceFile:34)
at com.atmire.statistics.util.update.atomic.record.UsageRecord.accept(SourceFile:23)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:301)
... 10 more
```
- I sent some feedback to Atmire
- They responded with an updated CUA (6.x-4.1.10-ilri-RC7) that has a fix for the duplicates processor _and_ a possible fix for the database locking issues (a bug in CUASolrLoggerServiceImpl that causes an infinite loop and a Tomcat timeout)
- I deployed the changes on DSpace Test and CGSpace, hopefully it will fix both issues!
- In other news, after I restarted Tomcat on CGSpace the statistics-2013 core didn't come back up properly, so I exported it and imported it into the main statistics core like I did for the others a few days ago
- Sync DSpace Test with CGSpace's Solr, PostgreSQL database, and assetstore...
## 2020-12-09
- I was running the AtomicStatisticsUpdateCLI to remove duplicates on DSpace Test but it failed near the end of the statistics core (after 20 hours or so) with a memory error:
```
Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
Run 1 —  67% — 10,000/14,935 docs — 6m 6s — 6m 6s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.noggit.CharArr.toString(CharArr.java:164)
```
- I increased the JVM heap to 2048m and tried again, but it failed with a memory error again...
- I increased the JVM heap to 4096m and tried again, but it failed with another error:
```
Successfully finished updating Solr Storage Reports | Wed Dec 09 15:53:40 CET 2020
Exception: parsing error
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: parsing error
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:530)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: org.apache.solr.common.SolrException: parsing error
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:45)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
... 14 more
Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 8192; actual size: 2843)
at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:125)
at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:152)
...
```
## 2020-12-10
- The statistics-2019 core finished processing the duplicate removal so I started the statistics-2017 core
- Peter asked me to add ONE HEALTH to ILRI subjects on CGSpace
- A few items that got "lost" after approval during the database issues earlier this week seem to have gone back into their workflows
- Abenet approved them again and they got new handles, phew
- Abenet was having an issue with the date filter on AReS and it turns out that it's the same `.keyword` issue I had noticed before that causes the filter to stop working
- I fixed the filter to use the correct field name and filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/63
- I checked the Solr statistics on DSpace Test to see if the Atmire duplicates remover was working, but now I see a comical amount of duplicates...
![Solr stats with dozens of duplicates](/cgspace-notes/2020/12/solr-stats-duplicates.png)
- I sent feedback about this to Atmire
- I will re-sync the Solr stats from CGSpace so we can try again...
- In other news, it has been a few days since we deployed the fix for the database locking issue and things seem much better now:
![PostgreSQL connections all week](/cgspace-notes/2020/12/postgres_connections_ALL-week.png)
![PostgreSQL locks all week](/cgspace-notes/2020/12/postgres_locks_ALL-week.png)
## 2020-12-13
- I tried to harvest a few times on OpenRXV in the last few days and every time it appends all the new records to the items index instead of overwriting it:
![OpenRXV duplicates](/cgspace-notes/2020/12/openrxv-duplicates.png)
- I can see it in the `openrxv-items-final` index:
```console
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
{
"_shards" : {
"failed" : 0,
"skipped" : 0,
"successful" : 1,
"total" : 1
},
"count" : 299922
}
```
- I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/64
- For now I will try to delete the index and start a re-harvest in the Admin UI:
```
$ curl -XDELETE http://localhost:9200/openrxv-items-final
{"acknowledged":true}%
```
- Moayad said he's working on the harvesting so I stopped it for now to re-deploy his latest changes
- I updated Tomcat to version 7.0.107 on CGSpace (linode18), ran all updates, and restarted the server
- I deleted both items indexes and restarted the harvesting:
```
$ curl -XDELETE http://localhost:9200/openrxv-items-final
$ curl -XDELETE http://localhost:9200/openrxv-items-temp
```
- Peter asked me for a list of all submitters and approvers that were active recently on CGSpace
- I can probably extract that from the `dc.description.provenance` field, for example any that contains a 2020 date:
```console
localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
```
## 2020-12-14
- The re-harvesting finished last night on AReS but there are no records in the `openrxv-items-final` index
- Strangely, there are 99,000 items in the temp index:
```console
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
{
"count" : 99992,
"_shards" : {
"skipped" : 0,
"total" : 1,
"failed" : 0,
"successful" : 1
}
}
```
- I'm going to try to [clone](https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-clone-index.html) the temp index to the final one...
- First, set the `openrxv-items-temp` index to block writes (read only) and then clone it to `openrxv-items-final`:
```console
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```
- Now I see that the `openrxv-items-final` index has items, but there are still none in AReS Explorer UI!
```console
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
"count" : 99992,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
```
- The api logs show this from last night after the harvesting:
```console
[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
```
- But I'm not sure why the frontend doesn't show any data despite there being documents in the index...
- I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the `openrxv-items` index
- I cloned the `openrxv-items-final` index to `openrxv-items` index and now I see items in the explorer UI
- The PDF report was broken and I looked in the API logs and saw this:
```console
(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
```
- I installed `unoconv` in the backend api container and now it works... but I wonder why this changed...
- Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
- Peter noticed that [this item](https://hdl.handle.net/10568/110133) from the [ILRI policy and research briefs](https://cgspace.cgiar.org/handle/10568/24450) collection is missing in AReS, despite it being added one month ago in CGSpace and me harvesting on AReS last night
- The item appears fine in the REST API when I check the items in that collection
- Peter also noticed that [this item](https://hdl.handle.net/10568/110447) appears twice in AReS
- The item is _not_ duplicated on CGSpace or in the REST API
- We noticed that there are 136 items in the ILRI policy and research briefs collection according to AReS, yet on CGSpace there are only 132
- This is confirmed in the REST API (using [query-json](https://github.com/davesnx/query-json)):
```
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
$ query-json '.items | length' /tmp/policy1.json
100
$ query-json '.items | length' /tmp/policy2.json
32
```
- I realized that the issue of missing/duplicate items in AReS might be because of this [REST API bug that causes /items to return items in non-deterministic order](https://jira.lyrasis.org/browse/DS-3849)
- I decided to cherry-pick the following two patches from DSpace 6.4 into our `6_x-prod` (6.3) branch:
- High CPU usage when calling the collection_id/items REST endpoint
- Jira: https://jira.lyrasis.org/browse/DS-4342
- c2e6719fa763e291b81b2d61da2f8c758fe38ff3
- REST API items resource returns items in non-deterministic order
- Jira: https://jira.lyrasis.org/browse/DS-3849
- 2a2ea0cb5d03e6da9355a2eff12aad667e465433
- After deploying the REST API fixes I decided to harvest from AReS again to see if the missing and duplicate items get fixed
- I made a backup of the current `openrxv-items-temp` index just in case:
```console
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```
## 2020-12-15
- After the re-harvest last night there were 200,000 items in the `openrxv-items-temp` index again
- I cleared the core and started a re-harvest, but Peter sent me a bunch of author corrections for CGSpace so I decided to cancel it until after I apply them and re-index Discovery
- I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter's CSVs) and then applied them using the `fix-metadata-values.py` script:
```console
$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
```
- Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:
```console
dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 406
dspace=# COMMIT;
COMMIT
```
- I also updated the Font Awesome icon classes for version 5 syntax:
```console
dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
UPDATE 74
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
UPDATE 74
dspace=# COMMIT;
```
- Then I started a full Discovery re-index:
```console
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
```
- Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow
<!-- vim: set sw=2 ts=2: -->