cgspace-notes/content/posts/2020-12.md

24 KiB

title date author categories
December, 2020 2020-12-01T11:32:54+02:00 Alan Orth
Notes

2020-12-01

  • Atmire responded about the issue with duplicate data in our Solr statistics
    • They noticed that some records in the statistics-2015 core haven't been migrated with the AtomicStatisticsUpdateCLI tool yet and assumed that I haven't migrated any of the records yet
    • That's strange, as I checked all ten cores and 2015 is the only one with some unmigrated documents, as according to the cua_version field
    • I started processing those (about 411,000 records):
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
  • AReS went down when the renew-letsencrypt service stopped the angular_nginx container in the pre-update hook and failed to bring it back up
    • I ran all system updates on the host and rebooted it and AReS came back up OK

2020-12-02

  • Udana emailed me yesterday to ask why the CGSpace usage statistics were showing "No Data"
    • I noticed a message in the Solr Admin UI that one of the statistics cores failed to load, but it is up and I can query it...
    • Nevertheless, I restarted Tomcat a few times to see if all cores would come up without an error message, but had no success (despite that all cores ARE up and I can query them, sigh)
    • I think I will move all the Solr yearly statistics back into the main statistics core
  • Start testing export/import of yearly Solr statistics data into the main statistics core on DSpace Test, for example:
$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
  • I deployed Tomcat 7.0.107 on DSpace Test (CGSpace is still Tomcat 7.0.104)
  • I finished migrating all the statistics from the yearly shards back to the main core

2020-12-05

  • I deleted all the yearly statistics shards and restarted Tomcat on DSpace Test (linode26)

2020-12-06

  • Looking into the statistics on DSpace Test after I migrated them back to the main core
    • All stats are working as expected... indexing time for the DSpace Statistics API is the same... and I don't even see a difference in the JVM or memory stats in Munin other than a minor jump last week when I was processing them
  • I will migrate them on CGSpace too I think
    • First I will start with the statistics-2010 and statistics-2015 cores because they were the ones that were failing to load recently (despite actually being available in Solr WTF)

Error message in Solr admin UI about the statistics-2010 core failing to load

  • First the 2010 core:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
  • Judging by the DSpace logs all these cores had a problem starting up in the last month:
# grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
     24 statistics-2010
     24 statistics-2015
     18 statistics-2016
      6 statistics-2018
  • The message is always this:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
  • I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
    • I removed the statistics-2010, statistics-2015, statistics-2016, and statistics-2018 cores and restarted Tomcat and all the statistics cores came up OK and the CUA statistics are OK!

2020-12-07

  • Run dspace cleanup -v on CGSpace to clean up deleted bitstreams
  • Atmire sent a pull request to address the duplicate owningComm and owningColl
  • Abenet and Tezira are having issues with committing to the archive in their workflow
    • I looked at the server and indeed the locks and transactions are back up:

PostgreSQL Transactions day PostgreSQL Locks day PostgreSQL Locks day PostgreSQL Connections day

  • There are apparently 1,700 locks right now:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1739

2020-12-08

  • Atmire sent some instructions for using the DeduplicateValuesProcessor
    • I modified atmire-cua-update.xml as they instructed, but I get a million errors like this when I run AtomicStatisticsUpdateCLI with that configuration:
Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: java.lang.UnsupportedOperationException
        at org.apache.solr.common.SolrDocument$1.entrySet(SolrDocument.java:256)
        at java.util.HashMap.putMapEntries(HashMap.java:512)
        at java.util.HashMap.<init>(HashMap.java:490)
        at com.atmire.statistics.util.update.atomic.record.Record.getFieldValuesMap(SourceFile:86)
        at com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor.process(SourceFile:38)
        at com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor.visit(SourceFile:34)
        at com.atmire.statistics.util.update.atomic.record.UsageRecord.accept(SourceFile:23)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:301)
        ... 10 more
  • I sent some feedback to Atmire
    • They responded with an updated CUA (6.x-4.1.10-ilri-RC7) that has a fix for the duplicates processor and a possible fix for the database locking issues (a bug in CUASolrLoggerServiceImpl that causes an infinite loop and a Tomcat timeout)
    • I deployed the changes on DSpace Test and CGSpace, hopefully it will fix both issues!
  • In other news, after I restarted Tomcat on CGSpace the statistics-2013 core didn't come back up properly, so I exported it and imported it into the main statistics core like I did for the others a few days ago
  • Sync DSpace Test with CGSpace's Solr, PostgreSQL database, and assetstore...

2020-12-09

  • I was running the AtomicStatisticsUpdateCLI to remove duplicates on DSpace Test but it failed near the end of the statistics core (after 20 hours or so) with a memory error:
Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
Run 1 —  67% — 10,000/14,935 docs — 6m 6s — 6m 6s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.noggit.CharArr.toString(CharArr.java:164)
  • I increased the JVM heap to 2048m and tried again, but it failed with a memory error again...
  • I increased the JVM heap to 4096m and tried again, but it failed with another error:
Successfully finished updating Solr Storage Reports | Wed Dec 09 15:53:40 CET 2020
Exception: parsing error
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: parsing error
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:530)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
        at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
        at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: org.apache.solr.common.SolrException: parsing error
        at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:45)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
        ... 14 more
Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 8192; actual size: 2843)
        at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
        at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
        at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
        at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
        at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:125)
        at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:152)
...

2020-12-10

  • The statistics-2019 core finished processing the duplicate removal so I started the statistics-2017 core
  • Peter asked me to add ONE HEALTH to ILRI subjects on CGSpace
  • A few items that got "lost" after approval during the database issues earlier this week seem to have gone back into their workflows
    • Abenet approved them again and they got new handles, phew
  • Abenet was having an issue with the date filter on AReS and it turns out that it's the same .keyword issue I had noticed before that causes the filter to stop working
  • I checked the Solr statistics on DSpace Test to see if the Atmire duplicates remover was working, but now I see a comical amount of duplicates...

Solr stats with dozens of duplicates

  • I sent feedback about this to Atmire
  • I will re-sync the Solr stats from CGSpace so we can try again...
  • In other news, it has been a few days since we deployed the fix for the database locking issue and things seem much better now:

PostgreSQL connections all week PostgreSQL locks all week

2020-12-13

  • I tried to harvest a few times on OpenRXV in the last few days and every time it appends all the new records to the items index instead of overwriting it:

OpenRXV duplicates

  • I can see it in the openrxv-items-final index:
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
{
   "_shards" : {
      "failed" : 0,
      "skipped" : 0,
      "successful" : 1,
      "total" : 1
   },
   "count" : 299922
}
$ curl -XDELETE http://localhost:9200/openrxv-items-final
{"acknowledged":true}%
  • Moayad said he's working on the harvesting so I stopped it for now to re-deploy his latest changes
  • I updated Tomcat to version 7.0.107 on CGSpace (linode18), ran all updates, and restarted the server
  • I deleted both items indexes and restarted the harvesting:
$ curl -XDELETE http://localhost:9200/openrxv-items-final
$ curl -XDELETE http://localhost:9200/openrxv-items-temp
  • Peter asked me for a list of all submitters and approvers that were active recently on CGSpace
    • I can probably extract that from the dc.description.provenance field, for example any that contains a 2020 date:
localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';

2020-12-14

  • The re-harvesting finished last night on AReS but there are no records in the openrxv-items-final index
    • Strangely, there are 99,000 items in the temp index:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
{
   "count" : 99992,
   "_shards" : {
      "skipped" : 0,
      "total" : 1,
      "failed" : 0,
      "successful" : 1
   }
}
  • I'm going to try to clone the temp index to the final one...
    • First, set the openrxv-items-temp index to block writes (read only) and then clone it to openrxv-items-final:
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
  • Now I see that the openrxv-items-final index has items, but there are still none in AReS Explorer UI!
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
  "count" : 99992,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
  • The api logs show this from last night after the harvesting:
[Nest] 92   - 12/13/2020, 1:58:52 PM   [HarvesterService] Starting Harvest
[Nest] 92   - 12/13/2020, 10:50:20 PM   [FetchConsumer] OnGlobalQueueDrained
[Nest] 92   - 12/13/2020, 11:00:20 PM   [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92   - 12/13/2020, 11:00:20 PM   [HarvesterService] reindex function is called
(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
    at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
    at IncomingMessage.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
  • But I'm not sure why the frontend doesn't show any data despite there being documents in the index...
  • I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the openrxv-items index
  • I cloned the openrxv-items-final index to openrxv-items index and now I see items in the explorer UI
  • The PDF report was broken and I looked in the API logs and saw this:
(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
    at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
  • I installed unoconv in the backend api container and now it works... but I wonder why this changed...
  • Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
    • Peter noticed that this item from the ILRI policy and research briefs collection is missing in AReS, despite it being added one month ago in CGSpace and me harvesting on AReS last night
      • The item appears fine in the REST API when I check the items in that collection
    • Peter also noticed that this item appears twice in AReS
      • The item is not duplicated on CGSpace or in the REST API
    • We noticed that there are 136 items in the ILRI policy and research briefs collection according to AReS, yet on CGSpace there are only 132
      • This is confirmed in the REST API (using query-json):
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
$ query-json '.items | length' /tmp/policy1.json
100
$ query-json '.items | length' /tmp/policy2.json
32
  • I realized that the issue of missing/duplicate items in AReS might be because of this REST API bug that causes /items to return items in non-deterministic order
  • I decided to cherry-pick the following two patches from DSpace 6.4 into our 6_x-prod (6.3) branch:
  • After deploying the REST API fixes I decided to harvest from AReS again to see if the missing and duplicate items get fixed
    • I made a backup of the current openrxv-items-temp index just in case:
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

2020-12-15

  • After the re-harvest last night there were 200,000 items in the openrxv-items-temp index again
    • I cleared the core and started a re-harvest, but Peter sent me a bunch of author corrections for CGSpace so I decided to cancel it until after I apply them and re-index Discovery
  • I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter's CSVs) and then applied them using the fix-metadata-values.py script:
$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
  • Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:
dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 406
dspace=# COMMIT;
COMMIT
  • I also updated the Font Awesome icon classes for version 5 syntax:
dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
UPDATE 74
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
UPDATE 74
dspace=# COMMIT;
  • Then I started a full Discovery re-index:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    265m11.224s
user    171m29.141s
sys     2m41.097s
  • Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow
  • Yesterday Atmire responded about the owningComm and owningColl duplicates in Solr saying they didn't see any anymore...
    • Indeed I spent a few minutes looking randomly and I didn't find any either...
    • I did, however, see lots of duplicates in countryCode_search, countryCode_ngram, ip_search, ip_ngram, userAgent_search, userAgent_ngram, referrer_search, referrer_ngram fields
    • I sent feedback to them
  • On the database locking front we haven't had issues in over a week and the Munin graphs look normal:

PostgreSQL connections all week PostgreSQL locks all week

  • After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the openrxv-items-temp index was empty and that the backup index I made yesterday was still there:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
{
  "acknowledged" : true
}
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
  "count" : 0,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
$ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
{
  "count" : 99992,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

2020-12-16

  • The harvesting on AReS finished last night so this morning I manually cloned the openrxv-items-temp index to openrxv-items
    • First check the number of items in the temp index, then set it to read only, then delete the items index, then delete the temp index:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 100046,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -s -X POST "http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty"
$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
  "count" : 100046,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
  • Interestingly the item that we noticed was duplicated now only appears once
  • The missing item is still missing