CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2020

2020-02-02

  • Continue working on porting CGSpace’s DSpace 5 code to DSpace 6.3 that I started yesterday
    • Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database
    • I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks
    • Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff
    • The code finally builds and runs with a fresh install
  • Now we don’t specify the build environment because site modification are in local.cfg, so we just build like this:
$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
  • And it seems that we need to enabled pg_crypto now (used for UUIDs):
$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
CREATE EXTENSION pgcrypto;
  • I tried importing a PostgreSQL snapshot from CGSpace and had errors due to missing Atmire database migrations
    • If I try to run dspace database migrate I get the IDs of the migrations that are missing
    • I delete them manually in psql:
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
  • Then I ran dspace database migrate and got an error:
$ ~/dspace63/bin/dspace database migrate

Database URL: jdbc:postgresql://localhost:5432/dspace63?ApplicationName=dspaceCli
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:673)
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:576)
        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:221)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: org.flywaydb.core.internal.dbsupport.FlywaySqlScriptException:
Migration V6.0_2015.03.07__DS-2701_Hibernate_migration.sql failed
-----------------------------------------------------------------
SQL State  : 2BP01
Error Code : 0
Message    : ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
  Detail: view eperson_metadata depends on table metadatavalue column resource_id
  Hint: Use DROP ... CASCADE to drop the dependent objects too.
Location   : org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql (/home/aorth/src/git/DSpace-6.3/file:/home/aorth/dspace63/lib/dspace-api-6.3.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql)
Line       : 391
Statement  : ALTER TABLE metadatavalue DROP COLUMN IF EXISTS resource_id

        at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:117)
        at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:71)
        at org.flywaydb.core.internal.command.DbMigrate.doMigrate(DbMigrate.java:352)
        at org.flywaydb.core.internal.command.DbMigrate.access$1100(DbMigrate.java:47)
        at org.flywaydb.core.internal.command.DbMigrate$4.doInTransaction(DbMigrate.java:308)
        at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
        at org.flywaydb.core.internal.command.DbMigrate.applyMigration(DbMigrate.java:305)
        at org.flywaydb.core.internal.command.DbMigrate.access$1000(DbMigrate.java:47)
        at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:230)
        at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:173)
        at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
        at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:173)
        at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
        at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
        at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
        at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:662)
        ... 8 more
Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
  Detail: view eperson_metadata depends on table metadatavalue column resource_id
  Hint: Use DROP ... CASCADE to drop the dependent objects too.
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
        at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
        at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
        at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
        at org.flywaydb.core.internal.dbsupport.JdbcTemplate.executeStatement(JdbcTemplate.java:238)
        at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:114)
        ... 24 more
  • I think I might need to update the sequences first… nope
  • Perhaps it’s due to some missing bitstream IDs and I need to run dspace cleanup on CGSpace and take a new PostgreSQL dump… nope
  • A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the resource_id column
  • Our database had the same issue, where the eperson_metadata view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:
dspace63=# DROP VIEW eperson_metadata;
DROP VIEW
  • After that the migration was successful and DSpace starts up successfully and begins indexing
    • xmlui, solr, jspui, rest, and oai are working (rest was redirecting to HTTPS, so I set the Tomcat connector to secure="true" and it fixed it on localhost, but caused other issues so I disabled it for now)
    • I started diffing our themes against the Mirage 2 reference theme to capture the latest changes

2020-02-03

  • Update DSpace mimetype fallback images from KDE Breeze Icons project
  • Issues remaining in the DSpace 6 port of our CGSpace 5.x code:
    • Community and collection pages only show one recent submission (seems that there is only one item in Solr?)
    • Community and collection pages have tons of “Browse” buttons that we need to remove
    • Order of navigation elements in right side bar (“My Account” etc, compare to DSpace Test)
    • Home page trail says “CGSpace Home” instead of “CGSpace Home / Community List” (see DSpace Test)
  • There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:
2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception: 
org.dspace.discovery.SearchServiceException: Invalid UUID string: 1
2020-02-03 13:20:20,475 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
  • If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…
  • I dropped all the documents in the search core:
$ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
  • Still didn’t work, so I’m going to try a clean database import and migration:
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspace63                               
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
$ ~/dspace63/bin/dspace database migrate
  • I notice that the indexing doesn’t work correctly if I start it manually with dspace index-discovery -b (search.resourceid becomes an integer!)
    • If I induce an indexing by touching dspace/solr/search/conf/reindex.flag the search.resourceid are all UUIDs…
  • Speaking of database stuff, there was a performance-related update for the indexes that we used in DSpace 5
    • We might want to apply it in DSpace 6, as it was never merged to 6.x, but it helped with the performance of /submissions in XMLUI for us in 2018-03

2020-02-04

  • The indexing issue I was having yesterday seems to only present itself the first time a new installation is running DSpace 6
    • Once the indexing induced by touching dspace/solr/search/conf/reindex.flag has finished, subsequent manual invocations of dspace index-discovery -b work as expected
    • Nevertheless, I sent a message to the dspace-tech mailing list describing the issue to see if anyone has any comments
  • I am seeing that the number of important commits on the unreleased DSpace 6.4 are really numerous and it might be better for us to target that version
    • I did a simple test and it’s easy to rebase my current 6.3 branch on top of the upstream dspace-6_x branch:
$ git checkout -b 6_x-dev64 6_x-dev
$ git rebase -i upstream/dspace-6_x
  • I finally understand why our themes show all the “Browse by” buttons on community and collection pages in DSpace 6.x
    • The code in ./dspace-xmlui/src/main/java/org/dspace/app/xmlui/aspect/browseArtifacts/CommunityBrowse.java iterates over all the browse indexes and prints them when it is called
    • The XMLUI theme code in dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/browse.xsl calls the template because the id of the div matches “aspect.browseArtifacts.CommunityBrowse.list.community-browse”
    • I checked the DRI of a community page on my local 6.x and DSpace Test 5.x by appending ?XML to the URL and I see the ID is missing on DSpace 5.x
    • The issue is the same with the ordering of the “My Account” link, but in Navigation.java
    • I tried modifying preprocess/browse.xsl but it always ends up printing some default list of browse by links…
    • I’m starting to wonder if Atmire’s modules somehow override this, as I don’t see how CommunityBrowse.java can behave like ours on DSpace 5.x unless they have overridden it (as the open source code is the same in 5.x and 6.x)
    • At least the “account” link in the sidebar is overridden in our 5.x branch because Atmire copied a modified Navigation.java to the local xmlui modules folder… so that explains that (and it’s easy to replicate in 6.x)

2020-02-05

  • UptimeRobot told me that AReS Explorer crashed last night, so I logged into it, ran all updates, and rebooted it
  • Testing Discovery indexing speed on my local DSpace 6.3:
$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3771.78s user 93.63s system 41% cpu 2:34:19.53 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3360.28s user 82.63s system 38% cpu 2:30:22.07 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  4678.72s user 138.87s system 42% cpu 3:08:35.72 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3334.19s user 86.54s system 35% cpu 2:41:56.73 total
  • DSpace 5.8 was taking about 1 hour (or less on this laptop), so this is 2-3 times longer!
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  299.53s user 69.67s system 20% cpu 30:34.47 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  270.31s user 69.88s system 19% cpu 29:01.38 total
  • Checking out the DSpace 6.x REST API query client
    • There is a tutorial that explains how it works and I see it is very powerful because you can export a CSV of results in order to fix and re-upload them with batch import!
    • Custom queries can be added in dspace-rest/src/main/webapp/static/reports/restQueryReport.js
  • I noticed two new bots in the logs with the following user agents:
    • Jersey/2.6 (HttpUrlConnection 1.8.0_152)
    • magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)
  • I filed an issue to add Jersey to the COUNTER-Robots list
  • Peter noticed that the statlets on community, collection, and item pages aren’t working on CGSpace
    • I thought it might be related to the fact that the yearly sharding didn’t complete successfully this year so the statistics-2019 core is empty
    • I removed the statistics-2019 core and had to restart Tomcat like six times before all cores would load properly (ugh!!!!)
    • After that the statlets were working properly…
  • Run all system updates on DSpace Test (linode19) and restart it

2020-02-06

  • I sent a mail to the dspace-tech mailing list asking about slow Discovery indexing speed in DSpace 6
  • I destroyed my PostgreSQL 9.6 containers and re-created them using PostgreSQL 10 to see if there are any speedups with DSpace 6.x:
$ podman pull postgres:10-alpine
$ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:10-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspace63                               
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
  • I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my check-spider-hits.sh script:
$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
  • I noticed another user agen in the logs that we should add to the list:
ReactorNetty/0.9.2.RELEASE
$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
$ ls -lh /tmp/statistics-2019-01.json
-rw-rw-r-- 1 aorth aorth 3.7G Feb  6 09:26 /tmp/statistics-2019-01.json
  • Then I tested importing this by creating a new core in my development environment:
$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
  • This imports the records into the core, but DSpace can’t see them, and when I restart Tomcat the core is not seen by Solr…
  • I got the core to load by adding it to dspace/solr/solr.xml manually, ie:
  <cores adminPath="/admin/cores">
  ...
    <core name="statistics" instanceDir="statistics" />
    <core name="statistics-2019" instanceDir="statistics">
        <property name="dataDir" value="/home/aorth/dspace/solr/statistics-2019/data" />
    </core>
  ...
  </cores>
  • But I don’t like having to do that… why doesn’t it load automatically?
  • I sent a mail to the dspace-tech mailing list to ask about it
  • Just for fun I tried to load these stats into a Solr 7.7.2 instance using the DSpace 7 solr config:
  • First, create a Solr statistics core using the DSpace 7 config:
$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
  • Then try to import the stats, skipping a shitload of fields that are apparently added to our Solr statistics by Atmire modules:
$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
  • OK that imported! I wonder if it works… maybe I’ll try another day

2020-02-07

$ cd ~/src/git/perf-map-agent
$ cmake     .
$ make
$ ./bin/create-links-in ~/.local/bin
$ export FLAMEGRAPH_DIR=/home/aorth/src/git/FlameGraph
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
$ ~/dspace63/bin/dspace index-discovery -b &
# pid of tomcat java process
$ perf-java-flames 4478
# pid of java indexing process
$ perf-java-flames 11359
  • All Java processes need to have -XX:+PreserveFramePointer if you want to trace their methods
  • I did the same tests against DSpace 5.8 and 6.4-SNAPSHOT’s CLI indexing process and Tomcat process
    • For what it’s worth, it appears all the Hibernate stuff is in the CLI processes, so we don’t need to trace the Tomcat process
  • Here is the flame graph for DSpace 5.8’s dspace index-discovery -b java process:

DSpace 5.8 index-discovery flame graph

  • Here is the flame graph for DSpace 6.4-SNAPSHOT’s dspace index-discovery -b java process:

DSpace 6.4-SNAPSHOT index-discovery flame graph

  • If the width of the stacks indicates time, then it’s clear that Hibernate takes longer…
  • Apparently there is a “flame diff” tool, I wonder if we can use that to compare!

2020-02-09

  • This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:
# CGSpace 5.8
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  385.72s user 131.16s system 19% cpu 43:21.18 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  382.95s user 127.31s system 20% cpu 42:10.07 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  368.56s user 143.97s system 20% cpu 42:22.66 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  360.09s user 104.03s system 19% cpu 39:24.41 total

# Vanilla DSpace 5.10
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b  236.19s user 59.70s system 3% cpu 2:03:31.14 total
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b  232.41s user 50.38s system 3% cpu 2:04:16.00 total

# Vanilla DSpace 6.4-SNAPSHOT
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  5112.96s user 127.80s system 40% cpu 3:36:53.98 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  5112.96s user 127.80s system 40% cpu 3:21:0.0 total
  • I generated better flame graphs for the DSpace indexing process by using perf-record-stack and filtering out the java process:
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export PERF_RECORD_SECONDS=60
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &
# process id of java indexing process (not Tomcat)
$ perf-java-record-stack 169639
$ sudo perf script -i /tmp/perf-169639.data > out.dspace510-1
$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ../FlameGraph/flamegraph.pl --color=java --hash > out.dspace510-1.svg
  • All data recorded on my laptop with the same kernel, same boot, etc.
  • CGSpace 5.8 (with Atmire patches):

DSpace 5.8 (with Atmire modules) index-discovery flame graph

  • Vanilla DSpace 5.10:

Vanilla DSpace 5.10 index-discovery flame graph

  • Vanilla DSpace 6.4-SNAPSHOT:

Vanilla DSpace 6.4-SNAPSHOT index-discovery flame graph

  • I sent my feedback to the dspace-tech mailing list so someone can hopefully comment.