Atmire responded about the issue with duplicate data in our Solr statistics
They noticed that some records in the statistics-2015 core haven’t been migrated with the AtomicStatisticsUpdateCLI tool yet and assumed that I haven’t migrated any of the records yet
That’s strange, as I checked all ten cores and 2015 is the only one with some unmigrated documents, as according to the cua_version field
I started processing those (about 411,000 records):
Atmire responded about the issue with duplicate data in our Solr statistics
They noticed that some records in the statistics-2015 core haven’t been migrated with the AtomicStatisticsUpdateCLI tool yet and assumed that I haven’t migrated any of the records yet
That’s strange, as I checked all ten cores and 2015 is the only one with some unmigrated documents, as according to the cua_version field
I started processing those (about 411,000 records):
<li>Atmire responded about the issue with duplicate data in our Solr statistics
<ul>
<li>They noticed that some records in the statistics-2015 core haven’t been migrated with the AtomicStatisticsUpdateCLI tool yet and assumed that I haven’t migrated any of the records yet</li>
<li>That’s strange, as I checked all ten cores and 2015 is the only one with some unmigrated documents, as according to the <code>cua_version</code> field</li>
<li>I started processing those (about 411,000 records):</li>
<li>AReS went down when the <code>renew-letsencrypt</code> service stopped the <code>angular_nginx</code> container in the pre-update hook and failed to bring it back up
<ul>
<li>I ran all system updates on the host and rebooted it and AReS came back up OK</li>
<li>Udana emailed me yesterday to ask why the CGSpace usage statistics were showing “No Data”
<ul>
<li>I noticed a message in the Solr Admin UI that one of the statistics cores failed to load, but it is up and I can query it…</li>
<li>Nevertheless, I restarted Tomcat a few times to see if all cores would come up without an error message, but had no success (despite that all cores ARE up and I can query them, <em>sigh</em>)</li>
<li>I think I will move all the Solr yearly statistics back into the main statistics core</li>
</ul>
</li>
<li>Start testing export/import of yearly Solr statistics data into the main statistics core on DSpace Test, for example:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
<li>I deployed Tomcat 7.0.107 on DSpace Test (CGSpace is still Tomcat 7.0.104)</li>
<li>I finished migrating all the statistics from the yearly shards back to the main core</li>
</ul>
<h2id="2020-12-05">2020-12-05</h2>
<ul>
<li>I deleted all the yearly statistics shards and restarted Tomcat on DSpace Test (linode26)</li>
</ul>
<h2id="2020-12-06">2020-12-06</h2>
<ul>
<li>Looking into the statistics on DSpace Test after I migrated them back to the main core
<ul>
<li>All stats are working as expected… indexing time for the DSpace Statistics API is the same… and I don’t even see a difference in the JVM or memory stats in Munin other than a minor jump last week when I was processing them</li>
</ul>
</li>
<li>I will migrate them on CGSpace too I think
<ul>
<li>First I will start with the statistics-2010 and statistics-2015 cores because they were the ones that were failing to load recently (despite actually being available in Solr WTF)</li>
</ul>
</li>
</ul>
<p><imgsrc="/cgspace-notes/2020/12/solr-statistics-2010-failed.png"alt="Error message in Solr admin UI about the statistics-2010 core failing to load"></p>
<li>I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
<ul>
<li>I removed the statistics-2010, statistics-2015, statistics-2016, and statistics-2018 cores and restarted Tomcat and <em>all the statistics cores came up OK and the CUA statistics are OK</em>!</li>
<li>Run <code>dspace cleanup -v</code> on CGSpace to clean up deleted bitstreams</li>
<li>Atmire sent a <ahref="https://github.com/ilri/DSpace/pull/457">pull request</a> to address the duplicate owningComm and owningColl
<ul>
<li>Built and deployed it on DSpace Test but I am not sure how to run it yet</li>
<li>I sent feedback to Atmire on their tracker: <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
</ul>
</li>
<li>Abenet and Tezira are having issues with committing to the archive in their workflow
<ul>
<li>I looked at the server and indeed the locks and transactions are back up:</li>
<li>There are apparently 1,700 locks right now:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1739
</code></pre><h2id="2020-12-08">2020-12-08</h2>
<ul>
<li>Atmire sent some instructions for using the DeduplicateValuesProcessor
<ul>
<li>I modified <code>atmire-cua-update.xml</code> as they instructed, but I get a million errors like this when I run AtomicStatisticsUpdateCLI with that configuration:</li>
</ul>
</li>
</ul>
<pre><code>Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
<li>They responded with an updated CUA (6.x-4.1.10-ilri-RC7) that has a fix for the duplicates processor <em>and</em> a possible fix for the database locking issues (a bug in CUASolrLoggerServiceImpl that causes an infinite loop and a Tomcat timeout)</li>
<li>I deployed the changes on DSpace Test and CGSpace, hopefully it will fix both issues!</li>
</ul>
</li>
<li>In other news, after I restarted Tomcat on CGSpace the statistics-2013 core didn’t come back up properly, so I exported it and imported it into the main statistics core like I did for the others a few days ago</li>
<li>Sync DSpace Test with CGSpace’s Solr, PostgreSQL database, and assetstore…</li>
</ul>
<h2id="2020-12-09">2020-12-09</h2>
<ul>
<li>I was running the AtomicStatisticsUpdateCLI to remove duplicates on DSpace Test but it failed near the end of the statistics core (after 20 hours or so) with a memory error:</li>
</ul>
<pre><code>Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
<li>The statistics-2019 core finished processing the duplicate removal so I started the statistics-2017 core</li>
<li>Peter asked me to add ONE HEALTH to ILRI subjects on CGSpace</li>
<li>A few items that got “lost” after approval during the database issues earlier this week seem to have gone back into their workflows
<ul>
<li>Abenet approved them again and they got new handles, phew</li>
</ul>
</li>
<li>Abenet was having an issue with the date filter on AReS and it turns out that it’s the same <code>.keyword</code> issue I had noticed before that causes the filter to stop working
<ul>
<li>I fixed the filter to use the correct field name and filed a bug on OpenRXV: <ahref="https://github.com/ilri/OpenRXV/issues/63">https://github.com/ilri/OpenRXV/issues/63</a></li>
</ul>
</li>
<li>I checked the Solr statistics on DSpace Test to see if the Atmire duplicates remover was working, but now I see a comical amount of duplicates…</li>
</ul>
<p><imgsrc="/cgspace-notes/2020/12/solr-stats-duplicates.png"alt="Solr stats with dozens of duplicates"></p>
<ul>
<li>I sent feedback about this to Atmire</li>
<li>I will re-sync the Solr stats from CGSpace so we can try again…</li>
<li>In other news, it has been a few days since we deployed the fix for the database locking issue and things seem much better now:</li>
</ul>
<p><imgsrc="/cgspace-notes/2020/12/postgres_connections_ALL-week.png"alt="PostgreSQL connections all week">
<imgsrc="/cgspace-notes/2020/12/postgres_locks_ALL-week.png"alt="PostgreSQL locks all week"></p>
<li>I tried to harvest a few times on OpenRXV in the last few days and every time it appends all the new records to the items index instead of overwriting it:</li>
<li>Peter asked me for a list of all submitters and approvers that were active recently on CGSpace
<ul>
<li>I can probably extract that from the <code>dc.description.provenance</code> field, for example any that contains a 2020 date:</li>
</ul>
</li>
</ul>
<pre><codeclass="language-console"data-lang="console">localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
</code></pre><h2id="2020-12-14">2020-12-14</h2>
<ul>
<li>The re-harvesting finished last night on AReS but there are no records in the <code>openrxv-items-final</code> index
<ul>
<li>Strangely, there are 99,000 items in the temp index:</li>
<li>I’m going to try to <ahref="https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-clone-index.html">clone</a> the temp index to the final one…
<ul>
<li>First, set the <code>openrxv-items-temp</code> index to block writes (read only) and then clone it to <code>openrxv-items-final</code>:</li>
at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
</code></pre><ul>
<li>But I’m not sure why the frontend doesn’t show any data despite there being documents in the index…</li>
<li>I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the <code>openrxv-items</code> index</li>
<li>I cloned the <code>openrxv-items-final</code> index to <code>openrxv-items</code> index and now I see items in the explorer UI</li>
<li>The PDF report was broken and I looked in the API logs and saw this:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
</code></pre><ul>
<li>I installed <code>unoconv</code> in the backend api container and now it works… but I wonder why this changed…</li>
<li>Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
<ul>
<li>Peter noticed that <ahref="https://hdl.handle.net/10568/110133">this item</a> from the <ahref="https://cgspace.cgiar.org/handle/10568/24450">ILRI policy and research briefs</a> collection is missing in AReS, despite it being added one month ago in CGSpace and me harvesting on AReS last night
<ul>
<li>The item appears fine in the REST API when I check the items in that collection</li>
</ul>
</li>
<li>Peter also noticed that <ahref="https://hdl.handle.net/10568/110447">this item</a> appears twice in AReS
<ul>
<li>The item is <em>not</em> duplicated on CGSpace or in the REST API</li>
</ul>
</li>
<li>We noticed that there are 136 items in the ILRI policy and research briefs collection according to AReS, yet on CGSpace there are only 132
<ul>
<li>This is confirmed in the REST API (using <ahref="https://github.com/davesnx/query-json">query-json</a>):</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre><code>$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
$ query-json '.items | length' /tmp/policy1.json
100
$ query-json '.items | length' /tmp/policy2.json
32
</code></pre><ul>
<li>I realized that the issue of missing/duplicate items in AReS might be because of this <ahref="https://jira.lyrasis.org/browse/DS-3849">REST API bug that causes /items to return items in non-deterministic order</a></li>
<li>I decided to cherry-pick the following two patches from DSpace 6.4 into our <code>6_x-prod</code> (6.3) branch:
<ul>
<li>High CPU usage when calling the collection_id/items REST endpoint
<li>After the re-harvest last night there were 200,000 items in the <code>openrxv-items-temp</code> index again
<ul>
<li>I cleared the core and started a re-harvest, but Peter sent me a bunch of author corrections for CGSpace so I decided to cancel it until after I apply them and re-index Discovery</li>
</ul>
</li>
<li>I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter’s CSVs) and then applied them using the <code>fix-metadata-values.py</code> script:</li>
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 406
dspace=# COMMIT;
COMMIT
</code></pre><ul>
<li>I also updated the Font Awesome icon classes for version 5 syntax:</li>
<li>Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow</li>
<li>Yesterday Atmire responded about the owningComm and owningColl duplicates in Solr saying they didn’t see any anymore…
<ul>
<li>Indeed I spent a few minutes looking randomly and I didn’t find any either…</li>
<li>I did, however, see lots of duplicates in countryCode_search, countryCode_ngram, ip_search, ip_ngram, userAgent_search, userAgent_ngram, referrer_search, referrer_ngram fields</li>
<li>I sent feedback to them</li>
</ul>
</li>
<li>On the database locking front we haven’t had issues in over a week and the Munin graphs look normal:</li>
</ul>
<p><imgsrc="/cgspace-notes/2020/12/postgres_connections_ALL-week2.png"alt="PostgreSQL connections all week">
<imgsrc="/cgspace-notes/2020/12/postgres_locks_ALL-week2.png"alt="PostgreSQL locks all week"></p>
<ul>
<li>After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the <code>openrxv-items-temp</code> index was empty and that the backup index I made yesterday was still there:</li>
<li>The harvesting on AReS finished last night so this morning I manually cloned the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>
<ul>
<li>First check the number of items in the temp index, then set it to read only, then delete the items index, then delete the temp index:</li>
<li>I see some errors from CUA in our Tomcat logs:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
Error while updating
java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:241)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
...
</code></pre><ul>
<li>I sent the full stack to Atmire to investigate
<ul>
<li>I know we’ve had thisi “Multiple update components target the same field” error in the past with DSpace 5.x and Atmire said it was harmless, but would nevertheless be fixed in a future update</li>
</ul>
</li>
<li>I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author’s names, but it throws an error…</li>
- I added support for indexing community views and downloads to [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api)
- I still have to add the API endpoints to make the stats available
- Also, I played a little bit with Swagger via [falcon-swagger-ui](https://github.com/rdidyk/falcon-swagger-ui) and I think I can get that working for better API documentation / testing
- Atmire sent some feedback on the DeduplicateValuesProcessor
- They confirm that it should process _all_ duplicates, not just those in `owningComm` and `owningColl`
- They asked me to try it again on DSpace Test now that I've resync'd the Solr statistics cores from production
- I started processing the statistics core on DSpace Test
## 2020-12-20
- The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end: