mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-18 11:12:18 +01:00
37 KiB
37 KiB
title | date | author | categories | |
---|---|---|---|---|
December, 2020 | 2020-12-01T11:32:54+02:00 | Alan Orth |
|
2020-12-01
- Atmire responded about the issue with duplicate data in our Solr statistics
- They noticed that some records in the statistics-2015 core haven't been migrated with the AtomicStatisticsUpdateCLI tool yet and assumed that I haven't migrated any of the records yet
- That's strange, as I checked all ten cores and 2015 is the only one with some unmigrated documents, as according to the
cua_version
field - I started processing those (about 411,000 records):
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
- AReS went down when the
renew-letsencrypt
service stopped theangular_nginx
container in the pre-update hook and failed to bring it back up- I ran all system updates on the host and rebooted it and AReS came back up OK
2020-12-02
- Udana emailed me yesterday to ask why the CGSpace usage statistics were showing "No Data"
- I noticed a message in the Solr Admin UI that one of the statistics cores failed to load, but it is up and I can query it...
- Nevertheless, I restarted Tomcat a few times to see if all cores would come up without an error message, but had no success (despite that all cores ARE up and I can query them, sigh)
- I think I will move all the Solr yearly statistics back into the main statistics core
- Start testing export/import of yearly Solr statistics data into the main statistics core on DSpace Test, for example:
$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
- I deployed Tomcat 7.0.107 on DSpace Test (CGSpace is still Tomcat 7.0.104)
- I finished migrating all the statistics from the yearly shards back to the main core
2020-12-05
- I deleted all the yearly statistics shards and restarted Tomcat on DSpace Test (linode26)
2020-12-06
- Looking into the statistics on DSpace Test after I migrated them back to the main core
- All stats are working as expected... indexing time for the DSpace Statistics API is the same... and I don't even see a difference in the JVM or memory stats in Munin other than a minor jump last week when I was processing them
- I will migrate them on CGSpace too I think
- First I will start with the statistics-2010 and statistics-2015 cores because they were the ones that were failing to load recently (despite actually being available in Solr WTF)
- First the 2010 core:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
- Judging by the DSpace logs all these cores had a problem starting up in the last month:
# grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
24 statistics-2010
24 statistics-2015
18 statistics-2016
6 statistics-2018
- The message is always this:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
- I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
- I removed the statistics-2010, statistics-2015, statistics-2016, and statistics-2018 cores and restarted Tomcat and all the statistics cores came up OK and the CUA statistics are OK!
2020-12-07
- Run
dspace cleanup -v
on CGSpace to clean up deleted bitstreams - Atmire sent a pull request to address the duplicate owningComm and owningColl
- Built and deployed it on DSpace Test but I am not sure how to run it yet
- I sent feedback to Atmire on their tracker: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
- Abenet and Tezira are having issues with committing to the archive in their workflow
- I looked at the server and indeed the locks and transactions are back up:
- There are apparently 1,700 locks right now:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1739
2020-12-08
- Atmire sent some instructions for using the DeduplicateValuesProcessor
- I modified
atmire-cua-update.xml
as they instructed, but I get a million errors like this when I run AtomicStatisticsUpdateCLI with that configuration:
- I modified
Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: java.lang.UnsupportedOperationException
at org.apache.solr.common.SolrDocument$1.entrySet(SolrDocument.java:256)
at java.util.HashMap.putMapEntries(HashMap.java:512)
at java.util.HashMap.<init>(HashMap.java:490)
at com.atmire.statistics.util.update.atomic.record.Record.getFieldValuesMap(SourceFile:86)
at com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor.process(SourceFile:38)
at com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor.visit(SourceFile:34)
at com.atmire.statistics.util.update.atomic.record.UsageRecord.accept(SourceFile:23)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:301)
... 10 more
- I sent some feedback to Atmire
- They responded with an updated CUA (6.x-4.1.10-ilri-RC7) that has a fix for the duplicates processor and a possible fix for the database locking issues (a bug in CUASolrLoggerServiceImpl that causes an infinite loop and a Tomcat timeout)
- I deployed the changes on DSpace Test and CGSpace, hopefully it will fix both issues!
- In other news, after I restarted Tomcat on CGSpace the statistics-2013 core didn't come back up properly, so I exported it and imported it into the main statistics core like I did for the others a few days ago
- Sync DSpace Test with CGSpace's Solr, PostgreSQL database, and assetstore...
2020-12-09
- I was running the AtomicStatisticsUpdateCLI to remove duplicates on DSpace Test but it failed near the end of the statistics core (after 20 hours or so) with a memory error:
Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
Run 1 — 67% — 10,000/14,935 docs — 6m 6s — 6m 6s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.noggit.CharArr.toString(CharArr.java:164)
- I increased the JVM heap to 2048m and tried again, but it failed with a memory error again...
- I increased the JVM heap to 4096m and tried again, but it failed with another error:
Successfully finished updating Solr Storage Reports | Wed Dec 09 15:53:40 CET 2020
Exception: parsing error
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: parsing error
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:530)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: org.apache.solr.common.SolrException: parsing error
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:45)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
... 14 more
Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 8192; actual size: 2843)
at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:125)
at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:152)
...
2020-12-10
- The statistics-2019 core finished processing the duplicate removal so I started the statistics-2017 core
- Peter asked me to add ONE HEALTH to ILRI subjects on CGSpace
- A few items that got "lost" after approval during the database issues earlier this week seem to have gone back into their workflows
- Abenet approved them again and they got new handles, phew
- Abenet was having an issue with the date filter on AReS and it turns out that it's the same
.keyword
issue I had noticed before that causes the filter to stop working- I fixed the filter to use the correct field name and filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/63
- I checked the Solr statistics on DSpace Test to see if the Atmire duplicates remover was working, but now I see a comical amount of duplicates...
- I sent feedback about this to Atmire
- I will re-sync the Solr stats from CGSpace so we can try again...
- In other news, it has been a few days since we deployed the fix for the database locking issue and things seem much better now:
2020-12-13
- I tried to harvest a few times on OpenRXV in the last few days and every time it appends all the new records to the items index instead of overwriting it:
- I can see it in the
openrxv-items-final
index:
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
{
"_shards" : {
"failed" : 0,
"skipped" : 0,
"successful" : 1,
"total" : 1
},
"count" : 299922
}
- I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/64
- For now I will try to delete the index and start a re-harvest in the Admin UI:
$ curl -XDELETE http://localhost:9200/openrxv-items-final
{"acknowledged":true}%
- Moayad said he's working on the harvesting so I stopped it for now to re-deploy his latest changes
- I updated Tomcat to version 7.0.107 on CGSpace (linode18), ran all updates, and restarted the server
- I deleted both items indexes and restarted the harvesting:
$ curl -XDELETE http://localhost:9200/openrxv-items-final
$ curl -XDELETE http://localhost:9200/openrxv-items-temp
- Peter asked me for a list of all submitters and approvers that were active recently on CGSpace
- I can probably extract that from the
dc.description.provenance
field, for example any that contains a 2020 date:
- I can probably extract that from the
localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
2020-12-14
- The re-harvesting finished last night on AReS but there are no records in the
openrxv-items-final
index- Strangely, there are 99,000 items in the temp index:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
{
"count" : 99992,
"_shards" : {
"skipped" : 0,
"total" : 1,
"failed" : 0,
"successful" : 1
}
}
- I'm going to try to clone the temp index to the final one...
- First, set the
openrxv-items-temp
index to block writes (read only) and then clone it toopenrxv-items-final
:
- First, set the
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
- Now I see that the
openrxv-items-final
index has items, but there are still none in AReS Explorer UI!
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
"count" : 99992,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
- The api logs show this from last night after the harvesting:
[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
at IncomingMessage.<anonymous> (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
- But I'm not sure why the frontend doesn't show any data despite there being documents in the index...
- I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the
openrxv-items
index - I cloned the
openrxv-items-final
index toopenrxv-items
index and now I see items in the explorer UI - The PDF report was broken and I looked in the API logs and saw this:
(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
- I installed
unoconv
in the backend api container and now it works... but I wonder why this changed... - Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
- Peter noticed that this item from the ILRI policy and research briefs collection is missing in AReS, despite it being added one month ago in CGSpace and me harvesting on AReS last night
- The item appears fine in the REST API when I check the items in that collection
- Peter also noticed that this item appears twice in AReS
- The item is not duplicated on CGSpace or in the REST API
- We noticed that there are 136 items in the ILRI policy and research briefs collection according to AReS, yet on CGSpace there are only 132
- This is confirmed in the REST API (using query-json):
- Peter noticed that this item from the ILRI policy and research briefs collection is missing in AReS, despite it being added one month ago in CGSpace and me harvesting on AReS last night
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
$ query-json '.items | length' /tmp/policy1.json
100
$ query-json '.items | length' /tmp/policy2.json
32
- I realized that the issue of missing/duplicate items in AReS might be because of this REST API bug that causes /items to return items in non-deterministic order
- I decided to cherry-pick the following two patches from DSpace 6.4 into our
6_x-prod
(6.3) branch:- High CPU usage when calling the collection_id/items REST endpoint
- Jira: https://jira.lyrasis.org/browse/DS-4342
- c2e6719fa763e291b81b2d61da2f8c758fe38ff3
- REST API items resource returns items in non-deterministic order
- Jira: https://jira.lyrasis.org/browse/DS-3849
- 2a2ea0cb5d03e6da9355a2eff12aad667e465433
- High CPU usage when calling the collection_id/items REST endpoint
- After deploying the REST API fixes I decided to harvest from AReS again to see if the missing and duplicate items get fixed
- I made a backup of the current
openrxv-items-temp
index just in case:
- I made a backup of the current
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
2020-12-15
- After the re-harvest last night there were 200,000 items in the
openrxv-items-temp
index again- I cleared the core and started a re-harvest, but Peter sent me a bunch of author corrections for CGSpace so I decided to cancel it until after I apply them and re-index Discovery
- I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter's CSVs) and then applied them using the
fix-metadata-values.py
script:
$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
- Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:
dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 406
dspace=# COMMIT;
COMMIT
- I also updated the Font Awesome icon classes for version 5 syntax:
dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
UPDATE 74
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
UPDATE 74
dspace=# COMMIT;
- Then I started a full Discovery re-index:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 265m11.224s
user 171m29.141s
sys 2m41.097s
- Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow
- Yesterday Atmire responded about the owningComm and owningColl duplicates in Solr saying they didn't see any anymore...
- Indeed I spent a few minutes looking randomly and I didn't find any either...
- I did, however, see lots of duplicates in countryCode_search, countryCode_ngram, ip_search, ip_ngram, userAgent_search, userAgent_ngram, referrer_search, referrer_ngram fields
- I sent feedback to them
- On the database locking front we haven't had issues in over a week and the Munin graphs look normal:
- After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the
openrxv-items-temp
index was empty and that the backup index I made yesterday was still there:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
{
"acknowledged" : true
}
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
"count" : 0,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
$ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
{
"count" : 99992,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
2020-12-16
- The harvesting on AReS finished last night so this morning I manually cloned the
openrxv-items-temp
index toopenrxv-items
- First check the number of items in the temp index, then set it to read only, then delete the items index, then delete the temp index:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
"count" : 100046,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -s -X POST "http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty"
$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
"count" : 100046,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
- Interestingly the item that we noticed was duplicated now only appears once
- The missing item is still missing
- Jane Poole noticed that the "previous page" and "next page" buttons are not working on AReS
- I filed a bug on GitHub: https://github.com/ilri/OpenRXV/issues/65
- Generate a list of submitters and approvers active in the last months using the Provenance field on CGSpace:
$ psql -h localhost -U postgres dspace -c "SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'" > /tmp/provenance.txt
$ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E "( on |checksum)" | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq > /tmp/recent-submitters-approvers.csv
- Peter wanted it to send some mail to the users...
2020-12-17
- I see some errors from CUA in our Tomcat logs:
Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
Error while updating
java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:241)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
...
- I sent the full stack to Atmire to investigate
- I know we've had this "Multiple update components target the same field" error in the past with DSpace 5.x and Atmire said it was harmless, but would nevertheless be fixed in a future update
- I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author's names, but it throws an error...
$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
Exception: null
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
at com.google.common.collect.Iterators.concat(Iterators.java:464)
at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
at org.dspace.app.bulkedit.MetadataExport.<init>(MetadataExport.java:77)
at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
- I did it via CSV with
fix-metadata-values.py
instead:
$ cat 2020-12-17-update-ILRI-author.csv
dc.contributor.author,correct
"Padmakumar, V.P.","Varijakshapanicker, Padmakumar"
$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
- Abenet needed a list of all 2020 outputs from the Livestock CRP that were Limited Access
- I exported the community from CGSpace and used
csvcut
andcsvgrep
to get a list:
- I exported the community from CGSpace and used
$ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 > /tmp/limited-2020.csv
2020-12-18
- I added support for indexing community views and downloads to dspace-statistics-api
- I still have to add the API endpoints to make the stats available
- Also, I played a little bit with Swagger via falcon-swagger-ui and I think I can get that working for better API documentation / testing
- Atmire sent some feedback on the DeduplicateValuesProcessor
- They confirm that it should process all duplicates, not just those in
owningComm
andowningColl
- They asked me to try it again on DSpace Test now that I've resync'd the Solr statistics cores from production
- I started processing the statistics core on DSpace Test
- They confirm that it should process all duplicates, not just those in
2020-12-20
- The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end:
...
Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at org.noggit.CharArr.toString(CharArr.java:164)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:599)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:180)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:360)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:219)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:374)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:125)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:43)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
- That was with a JVM heap of 512m
- I looked in Solr and found dozens of duplicates of each field again...
- I sent feedback to Atmire
- I finished the technical work on adding community and collection support to the DSpace Statistics API
- I still need to update
the testsas well as the documentation
- I still need to update
- I started a harvesting of AReS
2020-12-21
- The AReS harvest finished this morning and I moved the Elasticsearch index manually
- First, check the number of records in the temp index to make sure it seems complete and not with double data:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
"count" : 100135,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
- Then delete the old backup and clone the current items index as a backup:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
- Then delete the current items index and clone it from temp:
$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
2020-12-22
- I finished getting the Swagger UI integrated into the dspace-statistics-api
- I can't figure out how to get it to work on the server without hard-coding all the paths
- Falcon is smart about its own routes, so I can retrieve the
openapi.json
file OK, but the paths in the OpenAPI schema are relative to the base URL, which isdspacetest.cgiar.org
- Abenet told me about a bug with shared links and strange values in the top counters
- I took a video reproducing the issue and filed a bug on the GitHub: https://github.com/ilri/OpenRXV/issues/66
2020-12-23
- Finalize Swagger UI support in the dspace-statistics-api
- I had to do some last minute changes to get it to work in both production and local development environments
2020-12-27
- More finishing touches on paging and versioning of the dspace-statistics-api
- I tagged v1.4.0 and released it on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.0
- I deployed it on DSpace Test and CGSpace
2020-12-28
- Peter noticed that the Atmire CUA stats on CGSpace weren't working
- I looked in Solr Admin UI and saw that the statistics-2012 core failed to load:
statistics-2012: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
- I exported the 2012 stats from the year core and imported them to the main statistics core with solr-import-export-json:
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
- I decided to do the same for the remaining 2011, 2014, 2017, and 2019 cores...
2020-12-29
- Start a fresh re-index on AReS, since it's been over a week since the last time
- Before then I cleared the old
openrxv-items-temp
index and made a backup of the currentopenrxv-items
index:
- Before then I cleared the old
$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
"count" : 100135,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-29
$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
2020-12-30
- The indexing on AReS finished so I cloned the
openrxv-items-temp
index toopenrxv-items
and deleted the backup index:
$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-29?pretty'