CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

CGSpace DSpace 6 Upgrade

Notes about the DSpace 6 upgrade on CGSpace in 2020-11.

Processing Solr Statistics With solr-upgrade-statistics-6x

After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with solr-upgrade-statistics-6x to migrate all IDs to UUIDs.

statistics

First process the current year’s statistics core:

$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
        *** Statistics Records with Legacy Id ***

           3,817,407    Bistream View
           1,693,443    Item View
             105,974    Collection View
              62,383    Community View
             163,192    Community Search
             162,581    Collection Search
             470,288    Unexpected Type & Full Site
        --------------------------------------
           6,475,268    TOTAL
=================================================================

After several rounds of processing it finished. Here are some statistics about unmigrated documents:

  • 227,000: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 471,000: id:/.+-unmigrated/
  • 698,000: *:* NOT id:/.{36}/
  • Majority are type: 5 (aka SITE, according to Constants.java) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"

statistics-2019

Processing the statistics-2019 core:

$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
        *** Statistics Records with Legacy Id ***

           5,569,344    Bistream View
           2,179,105    Item View
             117,194    Community View
             104,091    Collection View
             774,138    Community Search
             568,347    Collection Search
           1,482,620    Unexpected Type & Full Site
        --------------------------------------
          10,794,839    TOTAL
=================================================================

After several rounds of processing it finished. Here are some statistics about unmigrated documents:

  • 2,690,309: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 1,494,587: id:/.+-unmigrated/
  • 4,184,896: *:* NOT id:/.{36}/
  • 4,172,929 are type: 5 (aka SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"

statistics-2018

Processing the statistics-2018 core:

$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
...
=================================================================
        *** Statistics Records with Legacy Id ***

           3,561,532    Bistream View
           1,129,326    Item View
              97,401    Community View
              63,508    Collection View
             207,827    Community Search
              43,752    Collection Search
             457,820    Unexpected Type & Full Site
        --------------------------------------
           5,561,166    TOTAL
=================================================================

After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:

$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018

Eventually the processing finished. Here are some statistics about unmigrated documents:

  • 365,473: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 546,955: id:/.+-unmigrated/
  • 923,158: *:* NOT id:/.{36}/
  • 823,293: are type: 5 so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"

statistics-2017

Processing the statistics-2017 core:

$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
...
=================================================================
        *** Statistics Records with Legacy Id ***

           2,529,208    Bistream View
           1,618,717    Item View
             144,945    Community View
              74,249    Collection View
             479,647    Community Search
             114,658    Collection Search
             852,215    Unexpected Type & Full Site
        --------------------------------------
           5,813,639    TOTAL
=================================================================

Eventually the processing finished. Here are some statistics about unmigrated documents:

  • 808,309: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 893,868: id:/.+-unmigrated/
  • 1,702,177: *:* NOT id:/.{36}/
  • 1,660,524 are type: 5 (SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"

statistics-2016

Processing the statistics-2016 core:

$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
...
=================================================================
        *** Statistics Records with Legacy Id ***

           1,765,924    Bistream View
           1,151,575    Item View
             187,110    Community View
              51,204    Collection View
             347,382    Community Search
              66,605    Collection Search
             620,298    Unexpected Type & Full Site
        --------------------------------------
           4,190,098    TOTAL
=================================================================
  • 849,408: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 627,747: id:/.+-unmigrated/
  • 1,477,155: *:* NOT id:/.{36}/
  • 1,469,706 are type: 5 (SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"

statistics-2015

Processing the statistics-2015 core:

$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
...
=================================================================
        *** Statistics Records with Legacy Id ***

             990,916    Bistream View
             506,070    Item View
             116,153    Community View
              33,282    Collection View
              21,062    Community Search
              10,788    Collection Search
              52,107    Unexpected Type & Full Site
        --------------------------------------
           1,730,378    TOTAL
=================================================================

Summary of stats after processing:

  • 195,293: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 67,146: id:/.+-unmigrated/
  • 262,439: *:* NOT id:/.{36}/
  • 247,400 are type: 5 (SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"

statistics-2014

Processing the statistics-2014 core:

$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
...
=================================================================
        *** Statistics Records with Legacy Id ***

           2,381,603    Item View
           1,323,357    Bistream View
             501,545    Community View
             247,805    Collection View
                 250    Collection Search
                 188    Community Search
                  50    Item Search
              10,918    Unexpected Type & Full Site
        --------------------------------------
           4,465,716    TOTAL
=================================================================

Summary of unmigrated documents after processing:

  • 182,131: (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 39,947: id:/.+-unmigrated/
  • 222,078: *:* NOT id:/.{36}/
  • 188,791 are type: 5 (SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"

statistics-2013

Processing the statistics-2013 core:

$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
...
=================================================================
        *** Statistics Records with Legacy Id ***

           2,352,124    Item View
           1,117,676    Bistream View
             575,711    Community View
             171,639    Collection View
                 248    Item Search
                   7    Collection Search
                   5    Community Search
               1,452    Unexpected Type & Full Site
        --------------------------------------
           4,218,862    TOTAL
=================================================================

Summary of unmigrated docs after processing:

  • 2,548 : (*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
  • 29,772: id:/.+-unmigrated/
  • 32,320: *:* NOT id:/.{36}/
  • 15,691 are type: 5 (SITE) so we can purge them:
$ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"