CGSpace DSpace 6 Upgrade
Notes about the DSpace 6 upgrade on CGSpace in 2020-11.
- Re-import OAI with clean index
- Processing Solr statistics with solr-upgrade-statistics-6x
- Processing Solr statistics with AtomicStatisticsUpdateCLI
Re-import OAI with clean index
After the upgrade is complete, re-index all items into OAI with a clean index:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace oai -c import
The process ran out of memory several times so I had to keep trying again with more JVM heap memory.
Processing Solr Statistics With solr-upgrade-statistics-6x
After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with solr-upgrade-statistics-6x
to migrate all IDs to UUIDs.
statistics
First process the current year’s statistics core:
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
3,817,407 Bistream View
1,693,443 Item View
105,974 Collection View
62,383 Community View
163,192 Community Search
162,581 Collection Search
470,288 Unexpected Type & Full Site
--------------------------------------
6,475,268 TOTAL
=================================================================
After several rounds of processing it finished. Here are some statistics about unmigrated documents:
- 227,000:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 471,000:
id:/.+-unmigrated/
- 698,000:
*:* NOT id:/.{36}/
- Majority are
type: 5
(aka SITE, according toConstants.java
) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2019
Processing the statistics-2019 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
5,569,344 Bistream View
2,179,105 Item View
117,194 Community View
104,091 Collection View
774,138 Community Search
568,347 Collection Search
1,482,620 Unexpected Type & Full Site
--------------------------------------
10,794,839 TOTAL
=================================================================
After several rounds of processing it finished. Here are some statistics about unmigrated documents:
- 2,690,309:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 1,494,587:
id:/.+-unmigrated/
- 4,184,896:
*:* NOT id:/.{36}/
- 4,172,929 are
type: 5
(aka SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2018
Processing the statistics-2018 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
...
=================================================================
*** Statistics Records with Legacy Id ***
3,561,532 Bistream View
1,129,326 Item View
97,401 Community View
63,508 Collection View
207,827 Community Search
43,752 Collection Search
457,820 Unexpected Type & Full Site
--------------------------------------
5,561,166 TOTAL
=================================================================
After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
Eventually the processing finished. Here are some statistics about unmigrated documents:
- 365,473:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 546,955:
id:/.+-unmigrated/
- 923,158:
*:* NOT id:/.{36}/
- 823,293: are
type: 5
so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2017
Processing the statistics-2017 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
...
=================================================================
*** Statistics Records with Legacy Id ***
2,529,208 Bistream View
1,618,717 Item View
144,945 Community View
74,249 Collection View
479,647 Community Search
114,658 Collection Search
852,215 Unexpected Type & Full Site
--------------------------------------
5,813,639 TOTAL
=================================================================
Eventually the processing finished. Here are some statistics about unmigrated documents:
- 808,309:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 893,868:
id:/.+-unmigrated/
- 1,702,177:
*:* NOT id:/.{36}/
- 1,660,524 are
type: 5
(SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2016
Processing the statistics-2016 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
...
=================================================================
*** Statistics Records with Legacy Id ***
1,765,924 Bistream View
1,151,575 Item View
187,110 Community View
51,204 Collection View
347,382 Community Search
66,605 Collection Search
620,298 Unexpected Type & Full Site
--------------------------------------
4,190,098 TOTAL
=================================================================
- 849,408:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 627,747:
id:/.+-unmigrated/
- 1,477,155:
*:* NOT id:/.{36}/
- 1,469,706 are
type: 5
(SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2015
Processing the statistics-2015 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
...
=================================================================
*** Statistics Records with Legacy Id ***
990,916 Bistream View
506,070 Item View
116,153 Community View
33,282 Collection View
21,062 Community Search
10,788 Collection Search
52,107 Unexpected Type & Full Site
--------------------------------------
1,730,378 TOTAL
=================================================================
Summary of stats after processing:
- 195,293:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 67,146:
id:/.+-unmigrated/
- 262,439:
*:* NOT id:/.{36}/
- 247,400 are
type: 5
(SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2014
Processing the statistics-2014 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
...
=================================================================
*** Statistics Records with Legacy Id ***
2,381,603 Item View
1,323,357 Bistream View
501,545 Community View
247,805 Collection View
250 Collection Search
188 Community Search
50 Item Search
10,918 Unexpected Type & Full Site
--------------------------------------
4,465,716 TOTAL
=================================================================
Summary of unmigrated documents after processing:
- 182,131:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 39,947:
id:/.+-unmigrated/
- 222,078:
*:* NOT id:/.{36}/
- 188,791 are
type: 5
(SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2013
Processing the statistics-2013 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
...
=================================================================
*** Statistics Records with Legacy Id ***
2,352,124 Item View
1,117,676 Bistream View
575,711 Community View
171,639 Collection View
248 Item Search
7 Collection Search
5 Community Search
1,452 Unexpected Type & Full Site
--------------------------------------
4,218,862 TOTAL
=================================================================
Summary of unmigrated docs after processing:
- 2,548 :
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 29,772:
id:/.+-unmigrated/
- 32,320:
*:* NOT id:/.{36}/
- 15,691 are
type: 5
(SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2012
Processing the statistics-2012 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
...
=================================================================
*** Statistics Records with Legacy Id ***
2,229,332 Item View
913,577 Bistream View
215,577 Collection View
104,734 Community View
--------------------------------------
3,463,220 TOTAL
=================================================================
Summary of unmigrated docs after processing:
- 0:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 33,161:
id:/.+-unmigrated/
- 33,161:
*:* NOT id:/.{36}/
- 33,161 are
type: 3
(COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2011
Processing the statistics-2011 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
...
=================================================================
*** Statistics Records with Legacy Id ***
904,896 Item View
385,789 Bistream View
154,356 Collection View
62,978 Community View
--------------------------------------
1,508,019 TOTAL
=================================================================
Summary of unmigrated docs after processing:
- 0:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 17,551:
id:/.+-unmigrated/
- 17,551:
*:* NOT id:/.{36}/
- 12,116 are
type: 3
(COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
statistics-2010
Processing the statistics-2010 core:
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
...
=================================================================
*** Statistics Records with Legacy Id ***
26,067 Item View
15,615 Bistream View
4,116 Collection View
1,094 Community View
--------------------------------------
46,892 TOTAL
=================================================================
Summary of unmigrated docs after processing:
- 0:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 1,012:
id:/.+-unmigrated/
- 1,012:
*:* NOT id:/.{36}/
- 654 are
type: 3
(COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
Processing Solr statistics with AtomicStatisticsUpdateCLI
On 2020-11-18 I finished processing the Solr statistics with solr-upgrade-statistics-6x and I started processing them with AtomicStatisticsUpdateCLI.
statistics
First the current year’s statistics core, in 12-hour batches:
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
It took ~38 hours to finish processing this core.
statistics-2019
The statistics-2019 core, in 12-hour batches:
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2019
It took ~32 hours to finish processing this core.
statistics-2018
The statistics-2018 core, in 12-hour batches:
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2018
It took ~28 hours to finish processing this core.