cgspace-notes/docs/2020-02/index.html

698 lines
38 KiB
HTML
Raw Normal View History

2020-02-02 16:15:48 +01:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="February, 2020" />
<meta property="og:description" content="2020-02-02
Continue working on porting CGSpace&rsquo;s DSpace 5 code to DSpace 6.3 that I started yesterday
Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database
I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks
Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff
The code finally builds and runs with a fresh install
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
2020-02-23 08:16:50 +01:00
<meta property="article:modified_time" content="2020-02-19T15:17:32+02:00" />
2020-02-02 16:15:48 +01:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2020"/>
<meta name="twitter:description" content="2020-02-02
Continue working on porting CGSpace&rsquo;s DSpace 5 code to DSpace 6.3 that I started yesterday
Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database
I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks
Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff
The code finally builds and runs with a fresh install
"/>
2020-02-23 08:16:50 +01:00
<meta name="generator" content="Hugo 0.65.2" />
2020-02-02 16:15:48 +01:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "February, 2020",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
2020-02-23 08:16:50 +01:00
"wordCount": "3245",
2020-02-02 16:15:48 +01:00
"datePublished": "2020-02-02T11:56:30+02:00",
2020-02-23 08:16:50 +01:00
"dateModified": "2020-02-19T15:17:32+02:00",
2020-02-02 16:15:48 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-02/">
<title>February, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.90e14c13cee52929ac33e1c21694a3cc95063a194eb22aad9f7976434e1a9125.js" integrity="sha256-kOFME87lKSmsM&#43;HCFpSjzJUGOhlOsiqtn3l2Q04akSU=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-02/">February, 2020</a></h2>
<p class="blog-post-meta"><time datetime="2020-02-02T11:56:30&#43;02:00">Sun Feb 02, 2020</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-02-02">2020-02-02</h2>
<ul>
<li>Continue working on porting CGSpace&rsquo;s DSpace 5 code to DSpace 6.3 that I started yesterday
<ul>
<li>Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database</li>
<li>I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks</li>
<li>Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff</li>
<li>The code finally builds and runs with a fresh install</li>
</ul>
</li>
</ul>
<ul>
<li>Now we don&rsquo;t specify the build environment because site modification are in <code>local.cfg</code>, so we just build like this:</li>
</ul>
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
</code></pre><ul>
<li>And it seems that we need to enabled <code>pg_crypto</code> now (used for UUIDs):</li>
</ul>
<pre><code>$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>I tried importing a PostgreSQL snapshot from CGSpace and had errors due to missing Atmire database migrations
<ul>
<li>If I try to run <code>dspace database migrate</code> I get the IDs of the migrations that are missing</li>
<li>I delete them manually in psql:</li>
</ul>
</li>
</ul>
<pre><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
</code></pre><ul>
<li>Then I ran <code>dspace database migrate</code> and got an error:</li>
</ul>
<pre><code>$ ~/dspace63/bin/dspace database migrate
Database URL: jdbc:postgresql://localhost:5432/dspace63?ApplicationName=dspaceCli
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:673)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:576)
at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:221)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: org.flywaydb.core.internal.dbsupport.FlywaySqlScriptException:
Migration V6.0_2015.03.07__DS-2701_Hibernate_migration.sql failed
-----------------------------------------------------------------
SQL State : 2BP01
Error Code : 0
Message : ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
Detail: view eperson_metadata depends on table metadatavalue column resource_id
Hint: Use DROP ... CASCADE to drop the dependent objects too.
Location : org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql (/home/aorth/src/git/DSpace-6.3/file:/home/aorth/dspace63/lib/dspace-api-6.3.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql)
Line : 391
Statement : ALTER TABLE metadatavalue DROP COLUMN IF EXISTS resource_id
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:117)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:71)
at org.flywaydb.core.internal.command.DbMigrate.doMigrate(DbMigrate.java:352)
at org.flywaydb.core.internal.command.DbMigrate.access$1100(DbMigrate.java:47)
at org.flywaydb.core.internal.command.DbMigrate$4.doInTransaction(DbMigrate.java:308)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.applyMigration(DbMigrate.java:305)
at org.flywaydb.core.internal.command.DbMigrate.access$1000(DbMigrate.java:47)
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:230)
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:173)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:173)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:662)
... 8 more
Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
Detail: view eperson_metadata depends on table metadatavalue column resource_id
Hint: Use DROP ... CASCADE to drop the dependent objects too.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.flywaydb.core.internal.dbsupport.JdbcTemplate.executeStatement(JdbcTemplate.java:238)
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:114)
... 24 more
</code></pre><ul>
<li>I think I might need to update the sequences first&hellip; nope</li>
<li>Perhaps it&rsquo;s due to some missing bitstream IDs and I need to run <code>dspace cleanup</code> on CGSpace and take a new PostgreSQL dump&hellip; nope</li>
<li>A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the <code>resource_id</code> column</li>
<li>Our database had the same issue, where the <code>eperson_metadata</code> view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:</li>
</ul>
<pre><code>dspace63=# DROP VIEW eperson_metadata;
DROP VIEW
</code></pre><ul>
<li>After that the migration was successful and DSpace starts up successfully and begins indexing
<ul>
2020-02-04 07:44:50 +01:00
<li>xmlui, solr, jspui, rest, and oai are working (rest was redirecting to HTTPS, so I set the Tomcat connector to <code>secure=&quot;true&quot;</code> and it fixed it on localhost, but caused other issues so I disabled it for now)</li>
2020-02-02 16:15:48 +01:00
<li>I started diffing our themes against the Mirage 2 reference theme to capture the latest changes</li>
</ul>
</li>
</ul>
2020-02-04 07:44:50 +01:00
<h2 id="2020-02-03">2020-02-03</h2>
<ul>
<li>Update DSpace mimetype fallback images from <a href="https://github.com/KDE/breeze-icons">KDE Breeze Icons</a> project
<ul>
<li>Our icons are four years old (see <a href="https://alanorth.github.io/dspace-bitstream-icons/">my bitstream icons demo</a>)</li>
</ul>
</li>
<li>Issues remaining in the DSpace 6 port of our CGSpace 5.x code:
<ul>
2020-02-04 12:09:24 +01:00
<li><input checked="" disabled="" type="checkbox">Community and collection pages only show one recent submission (seems that there is only one item in Solr?)</li>
2020-02-05 12:50:58 +01:00
<li><input checked="" disabled="" type="checkbox">Community and collection pages have tons of &ldquo;Browse&rdquo; buttons that we need to remove</li>
2020-02-04 17:45:07 +01:00
<li><input checked="" disabled="" type="checkbox">Order of navigation elements in right side bar (&ldquo;My Account&rdquo; etc, compare to DSpace Test)</li>
2020-02-04 07:44:50 +01:00
<li><input disabled="" type="checkbox">Home page trail says &ldquo;CGSpace Home&rdquo; instead of &ldquo;CGSpace Home / Community List&rdquo; (see DSpace Test)</li>
</ul>
</li>
<li>There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:</li>
</ul>
<pre><code>2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception:
org.dspace.discovery.SearchServiceException: Invalid UUID string: 1
2020-02-03 13:20:20,475 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
</code></pre><ul>
<li>If I look in Solr&rsquo;s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now&hellip;</li>
<li>I dropped all the documents in the search core:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
</code></pre><ul>
<li>Still didn&rsquo;t work, so I&rsquo;m going to try a clean database import and migration:</li>
</ul>
<pre><code>$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
$ ~/dspace63/bin/dspace database migrate
</code></pre><ul>
<li>I notice that the indexing doesn&rsquo;t work correctly if I start it manually with <code>dspace index-discovery -b</code> (search.resourceid becomes an integer!)
<ul>
<li>If I induce an indexing by touching <code>dspace/solr/search/conf/reindex.flag</code> the search.resourceid are all UUIDs&hellip;</li>
</ul>
</li>
<li>Speaking of database stuff, there was a performance-related update for the <a href="https://github.com/DSpace/DSpace/pull/1791/">indexes that we used in DSpace 5</a>
<ul>
<li>We might want to <a href="https://github.com/DSpace/DSpace/pull/1792">apply it in DSpace 6</a>, as it was never merged to 6.x, but it helped with the performance of <code>/submissions</code> in XMLUI for us in <a href="/cgspace-notes/2018-03/">2018-03</a></li>
</ul>
</li>
</ul>
2020-02-04 12:09:24 +01:00
<h2 id="2020-02-04">2020-02-04</h2>
<ul>
<li>The indexing issue I was having yesterday seems to only present itself the first time a new installation is running DSpace 6
<ul>
<li>Once the indexing induced by touching <code>dspace/solr/search/conf/reindex.flag</code> has finished, subsequent manual invocations of <code>dspace index-discovery -b</code> work as expected</li>
<li>Nevertheless, I sent a message to the dspace-tech mailing list describing the issue to see if anyone has any comments</li>
</ul>
</li>
2020-02-04 14:52:38 +01:00
<li>I am seeing that the number of important commits on the unreleased DSpace 6.4 are really numerous and it might be better for us to target that version
<ul>
<li>I did a simple test and it&rsquo;s easy to rebase my current 6.3 branch on top of the upstream <code>dspace-6_x</code> branch:</li>
</ul>
</li>
</ul>
<pre><code>$ git checkout -b 6_x-dev64 6_x-dev
$ git rebase -i upstream/dspace-6_x
</code></pre><ul>
<li>I finally understand why our themes show all the &ldquo;Browse by&rdquo; buttons on community and collection pages in DSpace 6.x
<ul>
<li>The code in <code>./dspace-xmlui/src/main/java/org/dspace/app/xmlui/aspect/browseArtifacts/CommunityBrowse.java</code> iterates over all the browse indexes and prints them when it is called</li>
<li>The XMLUI theme code in <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/browse.xsl</code> calls the template because the id of the div matches &ldquo;aspect.browseArtifacts.CommunityBrowse.list.community-browse&rdquo;</li>
<li>I checked the DRI of a community page on my local 6.x and DSpace Test 5.x by appending <code>?XML</code> to the URL and I see the ID is missing on DSpace 5.x</li>
<li>The issue is the same with the ordering of the &ldquo;My Account&rdquo; link, but in Navigation.java</li>
<li>I tried modifying <code>preprocess/browse.xsl</code> but it always ends up printing some default list of browse by links&hellip;</li>
<li>I&rsquo;m starting to wonder if Atmire&rsquo;s modules somehow override this, as I don&rsquo;t see how <code>CommunityBrowse.java</code> can behave like ours on DSpace 5.x unless they have overridden it (as the open source code is the same in 5.x and 6.x)</li>
2020-02-04 17:45:07 +01:00
<li>At least the &ldquo;account&rdquo; link in the sidebar is overridden in our 5.x branch because Atmire copied a modified <code>Navigation.java</code> to the local xmlui modules folder&hellip; so that explains that (and it&rsquo;s easy to replicate in 6.x)</li>
2020-02-04 14:52:38 +01:00
</ul>
</li>
2020-02-04 12:09:24 +01:00
</ul>
2020-02-05 12:50:58 +01:00
<h2 id="2020-02-05">2020-02-05</h2>
<ul>
<li>UptimeRobot told me that AReS Explorer crashed last night, so I logged into it, ran all updates, and rebooted it</li>
<li>Testing Discovery indexing speed on my local DSpace 6.3:</li>
</ul>
<pre><code>$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3771.78s user 93.63s system 41% cpu 2:34:19.53 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3360.28s user 82.63s system 38% cpu 2:30:22.07 total
2020-02-06 09:01:17 +01:00
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 4678.72s user 138.87s system 42% cpu 3:08:35.72 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3334.19s user 86.54s system 35% cpu 2:41:56.73 total
2020-02-05 12:50:58 +01:00
</code></pre><ul>
2020-02-06 09:01:17 +01:00
<li>DSpace 5.8 was taking about 1 hour (or less on this laptop), so this is 2-3 times longer!</li>
2020-02-05 12:50:58 +01:00
</ul>
<pre><code>$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 299.53s user 69.67s system 20% cpu 30:34.47 total
2020-02-06 09:01:17 +01:00
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 270.31s user 69.88s system 19% cpu 29:01.38 total
2020-02-05 17:58:04 +01:00
</code></pre><ul>
<li>Checking out the DSpace 6.x REST API query client
<ul>
<li>There is a <a href="https://terrywbrady.github.io/restReportTutorial/intro">tutorial</a> that explains how it works and I see it is very powerful because you can export a CSV of results in order to fix and re-upload them with batch import!</li>
<li>Custom queries can be added in <code>dspace-rest/src/main/webapp/static/reports/restQueryReport.js</code></li>
</ul>
</li>
<li>I noticed two new bots in the logs with the following user agents:
<ul>
<li><code>Jersey/2.6 (HttpUrlConnection 1.8.0_152)</code></li>
<li><code>magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)</code></li>
</ul>
</li>
<li>I filed an <a href="https://github.com/atmire/COUNTER-Robots/issues/30">issue to add Jersey to the COUNTER-Robots</a> list</li>
<li>Peter noticed that the statlets on community, collection, and item pages aren&rsquo;t working on CGSpace
<ul>
<li>I thought it might be related to the fact that the yearly sharding didn&rsquo;t complete successfully this year so the <code>statistics-2019</code> core is empty</li>
<li>I removed the <code>statistics-2019</code> core and had to restart Tomcat like six times before all cores would load properly (ugh!!!!)</li>
<li>After that the statlets were working properly&hellip;</li>
</ul>
</li>
<li>Run all system updates on DSpace Test (linode19) and restart it</li>
</ul>
2020-02-06 09:01:17 +01:00
<h2 id="2020-02-06">2020-02-06</h2>
<ul>
<li>I sent a mail to the dspace-tech mailing list asking about slow Discovery indexing speed in DSpace 6</li>
<li>I destroyed my PostgreSQL 9.6 containers and re-created them using PostgreSQL 10 to see if there are any speedups with DSpace 6.x:</li>
</ul>
<pre><code>$ podman pull postgres:10-alpine
$ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:10-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
</code></pre><ul>
<li>I purged ~33,000 hits from the &ldquo;Jersey/2.6&rdquo; bot in CGSpace&rsquo;s statistics using my <code>check-spider-hits.sh</code> script:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s &quot;statistics-${year}&quot; -u http://localhost:8081/solr; done
2020-02-06 11:47:25 +01:00
</code></pre><ul>
<li>I noticed another user agen in the logs that we should add to the list:</li>
</ul>
<pre><code>ReactorNetty/0.9.2.RELEASE
</code></pre><ul>
<li>I made <a href="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
<li>I found a <a href="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to workfor exporting our 2019 stats from the large statistics core!</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
$ ls -lh /tmp/statistics-2019-01.json
-rw-rw-r-- 1 aorth aorth 3.7G Feb 6 09:26 /tmp/statistics-2019-01.json
</code></pre><ul>
<li>Then I tested importing this by creating a new core in my development environment:</li>
</ul>
2020-02-06 15:54:41 +01:00
<pre><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace/solr/statistics&amp;dataDir=/home/aorth/dspace/solr/statistics-2019/data'
2020-02-06 11:47:25 +01:00
$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
</code></pre><ul>
<li>This imports the records into the core, but DSpace can&rsquo;t see them, and when I restart Tomcat the core is not seen by Solr&hellip;</li>
2020-02-06 15:54:41 +01:00
<li>I got the core to load by adding it to <code>dspace/solr/solr.xml</code> manually, ie:</li>
</ul>
<pre><code> &lt;cores adminPath=&quot;/admin/cores&quot;&gt;
...
&lt;core name=&quot;statistics&quot; instanceDir=&quot;statistics&quot; /&gt;
&lt;core name=&quot;statistics-2019&quot; instanceDir=&quot;statistics&quot;&gt;
&lt;property name=&quot;dataDir&quot; value=&quot;/home/aorth/dspace/solr/statistics-2019/data&quot; /&gt;
&lt;/core&gt;
...
&lt;/cores&gt;
</code></pre><ul>
<li>But I don&rsquo;t like having to do that&hellip; why doesn&rsquo;t it load automatically?</li>
<li>I sent a mail to the dspace-tech mailing list to ask about it</li>
<li>Just for fun I tried to load these stats into a Solr 7.7.2 instance using the DSpace 7 solr config:</li>
<li>First, create a Solr statistics core using the DSpace 7 config:</li>
</ul>
<pre><code>$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
</code></pre><ul>
<li>Then try to import the stats, skipping a shitload of fields that are apparently added to our Solr statistics by Atmire modules:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
</code></pre><ul>
<li>OK that imported! I wonder if it works&hellip; maybe I&rsquo;ll try another day</li>
2020-02-06 11:47:25 +01:00
</ul>
2020-02-07 13:44:08 +01:00
<h2 id="2020-02-07">2020-02-07</h2>
<ul>
<li>I did some investigation into DSpace indexing performance using flame graphs
<ul>
<li>Excellent introduction: <a href="http://www.brendangregg.com/flamegraphs.html">http://www.brendangregg.com/flamegraphs.html</a></li>
<li>Using flame graphs with java: <a href="https://netflixtechblog.com/java-in-flames-e763b3d32166">https://netflixtechblog.com/java-in-flames-e763b3d32166</a></li>
<li>Fantastic wrapper scripts for doing perf on Java processes: <a href="https://github.com/jvm-profiling-tools/perf-map-agent">https://github.com/jvm-profiling-tools/perf-map-agent</a></li>
</ul>
</li>
</ul>
<pre><code>$ cd ~/src/git/perf-map-agent
$ cmake .
$ make
$ ./bin/create-links-in ~/.local/bin
$ export FLAMEGRAPH_DIR=/home/aorth/src/git/FlameGraph
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export JAVA_OPTS=&quot;-XX:+PreserveFramePointer&quot;
$ ~/dspace63/bin/dspace index-discovery -b &amp;
# pid of tomcat java process
$ perf-java-flames 4478
# pid of java indexing process
$ perf-java-flames 11359
</code></pre><ul>
<li>All Java processes need to have <code>-XX:+PreserveFramePointer</code> if you want to trace their methods</li>
<li>I did the same tests against DSpace 5.8 and 6.4-SNAPSHOT&rsquo;s CLI indexing process and Tomcat process
<ul>
<li>For what it&rsquo;s worth, it appears all the Hibernate stuff is in the CLI processes, so we don&rsquo;t need to trace the Tomcat process</li>
</ul>
</li>
<li>Here is the flame graph for DSpace 5.8&rsquo;s <code>dspace index-discovery -b</code> java process:</li>
</ul>
<p><img src="/cgspace-notes/2020/02/flamegraph-java-cli-dspace58.svg" alt="DSpace 5.8 index-discovery flame graph"></p>
<ul>
<li>Here is the flame graph for DSpace 6.4-SNAPSHOT&rsquo;s <code>dspace index-discovery -b</code> java process:</li>
</ul>
<p><img src="/cgspace-notes/2020/02/flamegraph-java-cli-dspace64-snapshot.svg" alt="DSpace 6.4-SNAPSHOT index-discovery flame graph"></p>
<ul>
<li>If the width of the stacks indicates time, then it&rsquo;s clear that Hibernate takes longer&hellip;</li>
<li>Apparently there is a &ldquo;flame diff&rdquo; tool, I wonder if we can use that to compare!</li>
</ul>
2020-02-09 13:08:16 +01:00
<h2 id="2020-02-09">2020-02-09</h2>
<ul>
<li>This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:</li>
</ul>
<pre><code># CGSpace 5.8
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 385.72s user 131.16s system 19% cpu 43:21.18 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 382.95s user 127.31s system 20% cpu 42:10.07 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 368.56s user 143.97s system 20% cpu 42:22.66 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 360.09s user 104.03s system 19% cpu 39:24.41 total
# Vanilla DSpace 5.10
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 236.19s user 59.70s system 3% cpu 2:03:31.14 total
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 232.41s user 50.38s system 3% cpu 2:04:16.00 total
# Vanilla DSpace 6.4-SNAPSHOT
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:36:53.98 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:21:0.0 total
</code></pre><ul>
<li>I generated better flame graphs for the DSpace indexing process by using <code>perf-record-stack</code> and filtering out the java process:</li>
</ul>
<pre><code>$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export PERF_RECORD_SECONDS=60
$ export JAVA_OPTS=&quot;-XX:+PreserveFramePointer&quot;
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &amp;
# process id of java indexing process (not Tomcat)
$ perf-java-record-stack 169639
$ sudo perf script -i /tmp/perf-169639.data &gt; out.dspace510-1
$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ../FlameGraph/flamegraph.pl --color=java --hash &gt; out.dspace510-1.svg
</code></pre><ul>
<li>All data recorded on my laptop with the same kernel, same boot, etc.</li>
<li>CGSpace 5.8 (with Atmire patches):</li>
</ul>
<p><img src="/cgspace-notes/2020/02/out.dspace58-2.svg" alt="DSpace 5.8 (with Atmire modules) index-discovery flame graph"></p>
<ul>
<li>Vanilla DSpace 5.10:</li>
</ul>
<p><img src="/cgspace-notes/2020/02/out.dspace510-3.svg" alt="Vanilla DSpace 5.10 index-discovery flame graph"></p>
<ul>
<li>Vanilla DSpace 6.4-SNAPSHOT:</li>
</ul>
<p><img src="/cgspace-notes/2020/02/out.dspace64-3.svg" alt="Vanilla DSpace 6.4-SNAPSHOT index-discovery flame graph"></p>
2020-02-09 14:55:49 +01:00
<ul>
<li>I sent my feedback to the dspace-tech mailing list so someone can hopefully comment.</li>
2020-02-09 16:34:12 +01:00
<li>Last week Peter asked Sisay to upload some items to CGSpace in the GENNOVATE collection (part of Gender CRP)
<ul>
<li>He uploaded them here: <a href="https://cgspace.cgiar.org/handle/10568/105926">https://cgspace.cgiar.org/handle/10568/105926</a></li>
<li>On a whim I checked and found five duplicates there, which means Sisay didn&rsquo;t even check</li>
</ul>
</li>
2020-02-09 14:55:49 +01:00
</ul>
2020-02-11 11:52:11 +01:00
<h2 id="2020-02-10">2020-02-10</h2>
<ul>
<li>Follow up with <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">Atmire about DSpace 6.x upgrade</a>
<ul>
<li>I raised the issue of targetting 6.4-SNAPSHOT as well as the Discovery indexing performance issues in 6.x</li>
</ul>
</li>
</ul>
<h2 id="2020-02-11">2020-02-11</h2>
<ul>
<li>Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-02-11-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using <code>fix-metadata-values.py</code>:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
</code></pre><ul>
<li>On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
<ul>
<li>I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:</li>
</ul>
</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Staver, Charles&quot;,charles staver: 0000-0002-4532-6077
&quot;Staver, C.&quot;,charles staver: 0000-0002-4532-6077
&quot;Fungo, R.&quot;,Robert Fungo: 0000-0002-4264-6905
&quot;Remans, R.&quot;,Roseline Remans: 0000-0003-3659-8529
&quot;Remans, Roseline&quot;,Roseline Remans: 0000-0003-3659-8529
&quot;Rietveld A.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Rietveld, A.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Rietveld, A.M.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Rietveld, Anne M.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Fongar, A.&quot;,Andrea Fongar: 0000-0003-2084-1571
&quot;Müller, Anna&quot;,Anna Müller: 0000-0003-3120-8560
&quot;Müller, A.&quot;,Anna Müller: 0000-0003-3120-8560
</code></pre><ul>
2020-02-17 07:42:15 +01:00
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 144 ORCID iDs to items on CGSpace!</li>
2020-02-11 11:52:11 +01:00
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
2020-02-17 07:42:15 +01:00
</code></pre><ul>
<li>Minor updates to all Python utility scripts in the CGSpace git repository</li>
<li>Update the spider agent patterns in CGSpace <code>5_x-prod</code> branch from the latest <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> project
<ul>
<li>I ran the <code>check-spider-hits.sh</code> script with the updated file and purged 6,000 hits from our Solr statistics core on CGSpace</li>
</ul>
</li>
</ul>
<h2 id="2020-02-12">2020-02-12</h2>
<ul>
<li>Follow up with people about AReS funding for next phase</li>
<li>Peter asked about the &ldquo;stats&rdquo; and &ldquo;summary&rdquo; reports that he had requested in December
<ul>
<li>I opened a <a href="https://github.com/ilri/AReS/issues/13">new issue on AReS for the &ldquo;summary&rdquo; report</a></li>
</ul>
</li>
<li>Peter asked me to update John McIntire&rsquo;s name format on CGSpace so I ran the following PostgreSQL query:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
UPDATE 26
</code></pre><h2 id="2020-02-17">2020-02-17</h2>
<ul>
<li>A few days ago Atmire responded to my question about DSpace 6.4-SNAPSHOT saying that they can only confirm that 6.3 works with their modules
<ul>
<li>I responded to say that we agree to target 6.3, but that I will cherry-pick important patches from the <code>dspace-6_x</code> branch at our own responsibility</li>
</ul>
</li>
<li>Send a message to dspace-devel asking them to tag DSpace 6.4</li>
2020-02-19 14:17:32 +01:00
<li>Udana from IWMI asked about the OAI base URL for their community on CGSpace
<ul>
<li>I think it should be this: <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=com_10568_16814">https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=com_10568_16814</a></li>
</ul>
</li>
</ul>
<h2 id="2020-02-19">2020-02-19</h2>
<ul>
<li>I noticed a thread on the mailing list about the Tomcat header size and Solr max boolean clauses error
<ul>
<li>The solution is to do as we have done and increase the headers / boolean clauses, or to simply <a href="https://wiki.lyrasis.org/display/DSPACE/TechnicalFaq#TechnicalFAQ-I'mgetting%22SolrException:BadRequest%22followedbyalongqueryora%22tooManyClauses%22Exception">disable access rights awareness</a> in Discovery</li>
<li>I applied the fix to the <code>5_x-prod</code> branch and cherry-picked it to <code>6_x-dev</code></li>
</ul>
</li>
<li>Upgrade Tomcat from 7.0.99 to 7.0.100 in <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Upgrade PostgreSQL JDBC driver from 42.2.9 to 42.2.10 in <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Run Tomcat and PostgreSQL JDBC driver updates on DSpace Test (linode19)</li>
2020-02-17 07:42:15 +01:00
</ul>
2020-02-23 08:16:50 +01:00
<h2 id="2020-02-23">2020-02-23</h2>
<ul>
<li>I see a new spider in the nginx logs on CGSpace:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
</code></pre><ul>
<li>I think this should be covered by the <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> patterns for the statistics at least&hellip;</li>
<li>I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
</code></pre><ul>
<li>Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent</li>
<li>I will add the IP addresses to the nginx badbots list</li>
<li>Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)</li>
<li>Deploy latest <code>5_x-prod</code> branch on CGSpace (linode18)</li>
<li>Run all system updates on CGSpace (linode18) server and reboot it
<ul>
<li>After the server came back up Tomcat started, but there were errors loading some Solr statistics cores</li>
<li>Luckily after restarting Tomcat once more they all came back up</li>
</ul>
</li>
</ul>
2020-02-17 07:42:15 +01:00
<!-- raw HTML omitted -->
2020-02-02 16:15:48 +01:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2020-02/">February, 2020</a></li>
<li><a href="/cgspace-notes/2020-01/">January, 2020</a></li>
<li><a href="/cgspace-notes/2019-12/">December, 2019</a></li>
<li><a href="/cgspace-notes/2019-11/">November, 2019</a></li>
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>