mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
865 lines
49 KiB
HTML
865 lines
49 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en" >
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
||
<meta property="og:title" content="February, 2020" />
|
||
<meta property="og:description" content="2020-02-02
|
||
|
||
Continue working on porting CGSpace’s DSpace 5 code to DSpace 6.3 that I started yesterday
|
||
|
||
Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database
|
||
I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks
|
||
Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff
|
||
The code finally builds and runs with a fresh install
|
||
|
||
|
||
" />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
|
||
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
|
||
<meta property="article:modified_time" content="2020-02-23T09:16:50+02:00" />
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="February, 2020"/>
|
||
<meta name="twitter:description" content="2020-02-02
|
||
|
||
Continue working on porting CGSpace’s DSpace 5 code to DSpace 6.3 that I started yesterday
|
||
|
||
Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database
|
||
I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks
|
||
Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff
|
||
The code finally builds and runs with a fresh install
|
||
|
||
|
||
"/>
|
||
<meta name="generator" content="Hugo 0.65.3" />
|
||
|
||
|
||
|
||
<script type="application/ld+json">
|
||
{
|
||
"@context": "http://schema.org",
|
||
"@type": "BlogPosting",
|
||
"headline": "February, 2020",
|
||
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
|
||
"wordCount": "4210",
|
||
"datePublished": "2020-02-02T11:56:30+02:00",
|
||
"dateModified": "2020-02-23T09:16:50+02:00",
|
||
"author": {
|
||
"@type": "Person",
|
||
"name": "Alan Orth"
|
||
},
|
||
"keywords": "Notes"
|
||
}
|
||
</script>
|
||
|
||
|
||
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-02/">
|
||
|
||
<title>February, 2020 | CGSpace Notes</title>
|
||
|
||
|
||
<!-- combined, minified CSS -->
|
||
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
|
||
|
||
|
||
<!-- minified Font Awesome for SVG icons -->
|
||
|
||
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.90e14c13cee52929ac33e1c21694a3cc95063a194eb22aad9f7976434e1a9125.js" integrity="sha256-kOFME87lKSmsM+HCFpSjzJUGOhlOsiqtn3l2Q04akSU=" crossorigin="anonymous"></script>
|
||
|
||
<!-- RSS 2.0 feed -->
|
||
|
||
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
</div>
|
||
</header>
|
||
|
||
|
||
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-02/">February, 2020</a></h2>
|
||
<p class="blog-post-meta"><time datetime="2020-02-02T11:56:30+02:00">Sun Feb 02, 2020</time> by Alan Orth in
|
||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
|
||
|
||
|
||
</p>
|
||
</header>
|
||
<h2 id="2020-02-02">2020-02-02</h2>
|
||
<ul>
|
||
<li>Continue working on porting CGSpace’s DSpace 5 code to DSpace 6.3 that I started yesterday
|
||
<ul>
|
||
<li>Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database</li>
|
||
<li>I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks</li>
|
||
<li>Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff</li>
|
||
<li>The code finally builds and runs with a fresh install</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<ul>
|
||
<li>Now we don’t specify the build environment because site modification are in <code>local.cfg</code>, so we just build like this:</li>
|
||
</ul>
|
||
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
|
||
</code></pre><ul>
|
||
<li>And it seems that we need to enabled <code>pg_crypto</code> now (used for UUIDs):</li>
|
||
</ul>
|
||
<pre><code>$ psql -h localhost -U postgres dspace63
|
||
dspace63=# CREATE EXTENSION pgcrypto;
|
||
CREATE EXTENSION pgcrypto;
|
||
</code></pre><ul>
|
||
<li>I tried importing a PostgreSQL snapshot from CGSpace and had errors due to missing Atmire database migrations
|
||
<ul>
|
||
<li>If I try to run <code>dspace database migrate</code> I get the IDs of the migrations that are missing</li>
|
||
<li>I delete them manually in psql:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
|
||
</code></pre><ul>
|
||
<li>Then I ran <code>dspace database migrate</code> and got an error:</li>
|
||
</ul>
|
||
<pre><code>$ ~/dspace63/bin/dspace database migrate
|
||
|
||
Database URL: jdbc:postgresql://localhost:5432/dspace63?ApplicationName=dspaceCli
|
||
Migrating database to latest version... (Check dspace logs for details)
|
||
Migration exception:
|
||
java.sql.SQLException: Flyway migration error occurred
|
||
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:673)
|
||
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:576)
|
||
at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:221)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||
Caused by: org.flywaydb.core.internal.dbsupport.FlywaySqlScriptException:
|
||
Migration V6.0_2015.03.07__DS-2701_Hibernate_migration.sql failed
|
||
-----------------------------------------------------------------
|
||
SQL State : 2BP01
|
||
Error Code : 0
|
||
Message : ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
|
||
Detail: view eperson_metadata depends on table metadatavalue column resource_id
|
||
Hint: Use DROP ... CASCADE to drop the dependent objects too.
|
||
Location : org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql (/home/aorth/src/git/DSpace-6.3/file:/home/aorth/dspace63/lib/dspace-api-6.3.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql)
|
||
Line : 391
|
||
Statement : ALTER TABLE metadatavalue DROP COLUMN IF EXISTS resource_id
|
||
|
||
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:117)
|
||
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:71)
|
||
at org.flywaydb.core.internal.command.DbMigrate.doMigrate(DbMigrate.java:352)
|
||
at org.flywaydb.core.internal.command.DbMigrate.access$1100(DbMigrate.java:47)
|
||
at org.flywaydb.core.internal.command.DbMigrate$4.doInTransaction(DbMigrate.java:308)
|
||
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
|
||
at org.flywaydb.core.internal.command.DbMigrate.applyMigration(DbMigrate.java:305)
|
||
at org.flywaydb.core.internal.command.DbMigrate.access$1000(DbMigrate.java:47)
|
||
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:230)
|
||
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:173)
|
||
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
|
||
at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:173)
|
||
at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
|
||
at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
|
||
at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
|
||
at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
|
||
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:662)
|
||
... 8 more
|
||
Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
|
||
Detail: view eperson_metadata depends on table metadatavalue column resource_id
|
||
Hint: Use DROP ... CASCADE to drop the dependent objects too.
|
||
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
|
||
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
|
||
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
|
||
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
|
||
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
|
||
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
|
||
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
|
||
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
|
||
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
|
||
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
|
||
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
|
||
at org.flywaydb.core.internal.dbsupport.JdbcTemplate.executeStatement(JdbcTemplate.java:238)
|
||
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:114)
|
||
... 24 more
|
||
</code></pre><ul>
|
||
<li>I think I might need to update the sequences first… nope</li>
|
||
<li>Perhaps it’s due to some missing bitstream IDs and I need to run <code>dspace cleanup</code> on CGSpace and take a new PostgreSQL dump… nope</li>
|
||
<li>A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the <code>resource_id</code> column</li>
|
||
<li>Our database had the same issue, where the <code>eperson_metadata</code> view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:</li>
|
||
</ul>
|
||
<pre><code>dspace63=# DROP VIEW eperson_metadata;
|
||
DROP VIEW
|
||
</code></pre><ul>
|
||
<li>After that the migration was successful and DSpace starts up successfully and begins indexing
|
||
<ul>
|
||
<li>xmlui, solr, jspui, rest, and oai are working (rest was redirecting to HTTPS, so I set the Tomcat connector to <code>secure="true"</code> and it fixed it on localhost, but caused other issues so I disabled it for now)</li>
|
||
<li>I started diffing our themes against the Mirage 2 reference theme to capture the latest changes</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-02-03">2020-02-03</h2>
|
||
<ul>
|
||
<li>Update DSpace mimetype fallback images from <a href="https://github.com/KDE/breeze-icons">KDE Breeze Icons</a> project
|
||
<ul>
|
||
<li>Our icons are four years old (see <a href="https://alanorth.github.io/dspace-bitstream-icons/">my bitstream icons demo</a>)</li>
|
||
</ul>
|
||
</li>
|
||
<li>Issues remaining in the DSpace 6 port of our CGSpace 5.x code:
|
||
<ul>
|
||
<li><input checked="" disabled="" type="checkbox">Community and collection pages only show one recent submission (seems that there is only one item in Solr?)</li>
|
||
<li><input checked="" disabled="" type="checkbox">Community and collection pages have tons of “Browse” buttons that we need to remove</li>
|
||
<li><input checked="" disabled="" type="checkbox">Order of navigation elements in right side bar (“My Account” etc, compare to DSpace Test)</li>
|
||
<li><input disabled="" type="checkbox">Home page trail says “CGSpace Home” instead of “CGSpace Home / Community List” (see DSpace Test)</li>
|
||
</ul>
|
||
</li>
|
||
<li>There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:</li>
|
||
</ul>
|
||
<pre><code>2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception:
|
||
org.dspace.discovery.SearchServiceException: Invalid UUID string: 1
|
||
2020-02-03 13:20:20,475 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
|
||
org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
|
||
</code></pre><ul>
|
||
<li>If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…</li>
|
||
<li>I dropped all the documents in the search core:</li>
|
||
</ul>
|
||
<pre><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
|
||
</code></pre><ul>
|
||
<li>Still didn’t work, so I’m going to try a clean database import and migration:</li>
|
||
</ul>
|
||
<pre><code>$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
|
||
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
|
||
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
|
||
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
|
||
$ psql -h localhost -U postgres dspace63
|
||
dspace63=# CREATE EXTENSION pgcrypto;
|
||
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
|
||
dspace63=# DROP VIEW eperson_metadata;
|
||
dspace63=# \q
|
||
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
|
||
$ ~/dspace63/bin/dspace database migrate
|
||
</code></pre><ul>
|
||
<li>I notice that the indexing doesn’t work correctly if I start it manually with <code>dspace index-discovery -b</code> (search.resourceid becomes an integer!)
|
||
<ul>
|
||
<li>If I induce an indexing by touching <code>dspace/solr/search/conf/reindex.flag</code> the search.resourceid are all UUIDs…</li>
|
||
</ul>
|
||
</li>
|
||
<li>Speaking of database stuff, there was a performance-related update for the <a href="https://github.com/DSpace/DSpace/pull/1791/">indexes that we used in DSpace 5</a>
|
||
<ul>
|
||
<li>We might want to <a href="https://github.com/DSpace/DSpace/pull/1792">apply it in DSpace 6</a>, as it was never merged to 6.x, but it helped with the performance of <code>/submissions</code> in XMLUI for us in <a href="/cgspace-notes/2018-03/">2018-03</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-02-04">2020-02-04</h2>
|
||
<ul>
|
||
<li>The indexing issue I was having yesterday seems to only present itself the first time a new installation is running DSpace 6
|
||
<ul>
|
||
<li>Once the indexing induced by touching <code>dspace/solr/search/conf/reindex.flag</code> has finished, subsequent manual invocations of <code>dspace index-discovery -b</code> work as expected</li>
|
||
<li>Nevertheless, I sent a message to the dspace-tech mailing list describing the issue to see if anyone has any comments</li>
|
||
</ul>
|
||
</li>
|
||
<li>I am seeing that the number of important commits on the unreleased DSpace 6.4 are really numerous and it might be better for us to target that version
|
||
<ul>
|
||
<li>I did a simple test and it’s easy to rebase my current 6.3 branch on top of the upstream <code>dspace-6_x</code> branch:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code>$ git checkout -b 6_x-dev64 6_x-dev
|
||
$ git rebase -i upstream/dspace-6_x
|
||
</code></pre><ul>
|
||
<li>I finally understand why our themes show all the “Browse by” buttons on community and collection pages in DSpace 6.x
|
||
<ul>
|
||
<li>The code in <code>./dspace-xmlui/src/main/java/org/dspace/app/xmlui/aspect/browseArtifacts/CommunityBrowse.java</code> iterates over all the browse indexes and prints them when it is called</li>
|
||
<li>The XMLUI theme code in <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/browse.xsl</code> calls the template because the id of the div matches “aspect.browseArtifacts.CommunityBrowse.list.community-browse”</li>
|
||
<li>I checked the DRI of a community page on my local 6.x and DSpace Test 5.x by appending <code>?XML</code> to the URL and I see the ID is missing on DSpace 5.x</li>
|
||
<li>The issue is the same with the ordering of the “My Account” link, but in Navigation.java</li>
|
||
<li>I tried modifying <code>preprocess/browse.xsl</code> but it always ends up printing some default list of browse by links…</li>
|
||
<li>I’m starting to wonder if Atmire’s modules somehow override this, as I don’t see how <code>CommunityBrowse.java</code> can behave like ours on DSpace 5.x unless they have overridden it (as the open source code is the same in 5.x and 6.x)</li>
|
||
<li>At least the “account” link in the sidebar is overridden in our 5.x branch because Atmire copied a modified <code>Navigation.java</code> to the local xmlui modules folder… so that explains that (and it’s easy to replicate in 6.x)</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-02-05">2020-02-05</h2>
|
||
<ul>
|
||
<li>UptimeRobot told me that AReS Explorer crashed last night, so I logged into it, ran all updates, and rebooted it</li>
|
||
<li>Testing Discovery indexing speed on my local DSpace 6.3:</li>
|
||
</ul>
|
||
<pre><code>$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
|
||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3771.78s user 93.63s system 41% cpu 2:34:19.53 total
|
||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3360.28s user 82.63s system 38% cpu 2:30:22.07 total
|
||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 4678.72s user 138.87s system 42% cpu 3:08:35.72 total
|
||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3334.19s user 86.54s system 35% cpu 2:41:56.73 total
|
||
</code></pre><ul>
|
||
<li>DSpace 5.8 was taking about 1 hour (or less on this laptop), so this is 2-3 times longer!</li>
|
||
</ul>
|
||
<pre><code>$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
|
||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 299.53s user 69.67s system 20% cpu 30:34.47 total
|
||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 270.31s user 69.88s system 19% cpu 29:01.38 total
|
||
</code></pre><ul>
|
||
<li>Checking out the DSpace 6.x REST API query client
|
||
<ul>
|
||
<li>There is a <a href="https://terrywbrady.github.io/restReportTutorial/intro">tutorial</a> that explains how it works and I see it is very powerful because you can export a CSV of results in order to fix and re-upload them with batch import!</li>
|
||
<li>Custom queries can be added in <code>dspace-rest/src/main/webapp/static/reports/restQueryReport.js</code></li>
|
||
</ul>
|
||
</li>
|
||
<li>I noticed two new bots in the logs with the following user agents:
|
||
<ul>
|
||
<li><code>Jersey/2.6 (HttpUrlConnection 1.8.0_152)</code></li>
|
||
<li><code>magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)</code></li>
|
||
</ul>
|
||
</li>
|
||
<li>I filed an <a href="https://github.com/atmire/COUNTER-Robots/issues/30">issue to add Jersey to the COUNTER-Robots</a> list</li>
|
||
<li>Peter noticed that the statlets on community, collection, and item pages aren’t working on CGSpace
|
||
<ul>
|
||
<li>I thought it might be related to the fact that the yearly sharding didn’t complete successfully this year so the <code>statistics-2019</code> core is empty</li>
|
||
<li>I removed the <code>statistics-2019</code> core and had to restart Tomcat like six times before all cores would load properly (ugh!!!!)</li>
|
||
<li>After that the statlets were working properly…</li>
|
||
</ul>
|
||
</li>
|
||
<li>Run all system updates on DSpace Test (linode19) and restart it</li>
|
||
</ul>
|
||
<h2 id="2020-02-06">2020-02-06</h2>
|
||
<ul>
|
||
<li>I sent a mail to the dspace-tech mailing list asking about slow Discovery indexing speed in DSpace 6</li>
|
||
<li>I destroyed my PostgreSQL 9.6 containers and re-created them using PostgreSQL 10 to see if there are any speedups with DSpace 6.x:</li>
|
||
</ul>
|
||
<pre><code>$ podman pull postgres:10-alpine
|
||
$ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:10-alpine
|
||
$ createuser -h localhost -U postgres --pwprompt dspacetest
|
||
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
|
||
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
|
||
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
|
||
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
|
||
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
|
||
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
|
||
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
|
||
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
|
||
$ psql -h localhost -U postgres dspace63
|
||
dspace63=# CREATE EXTENSION pgcrypto;
|
||
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
|
||
dspace63=# DROP VIEW eperson_metadata;
|
||
dspace63=# \q
|
||
</code></pre><ul>
|
||
<li>I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my <code>check-spider-hits.sh</code> script:</li>
|
||
</ul>
|
||
<pre><code>$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
|
||
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
|
||
</code></pre><ul>
|
||
<li>I noticed another user agen in the logs that we should add to the list:</li>
|
||
</ul>
|
||
<pre><code>ReactorNetty/0.9.2.RELEASE
|
||
</code></pre><ul>
|
||
<li>I made <a href="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
|
||
<li>I found a <a href="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to workfor exporting our 2019 stats from the large statistics core!</li>
|
||
</ul>
|
||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
|
||
$ ls -lh /tmp/statistics-2019-01.json
|
||
-rw-rw-r-- 1 aorth aorth 3.7G Feb 6 09:26 /tmp/statistics-2019-01.json
|
||
</code></pre><ul>
|
||
<li>Then I tested importing this by creating a new core in my development environment:</li>
|
||
</ul>
|
||
<pre><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
|
||
$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
|
||
</code></pre><ul>
|
||
<li>This imports the records into the core, but DSpace can’t see them, and when I restart Tomcat the core is not seen by Solr…</li>
|
||
<li>I got the core to load by adding it to <code>dspace/solr/solr.xml</code> manually, ie:</li>
|
||
</ul>
|
||
<pre><code> <cores adminPath="/admin/cores">
|
||
...
|
||
<core name="statistics" instanceDir="statistics" />
|
||
<core name="statistics-2019" instanceDir="statistics">
|
||
<property name="dataDir" value="/home/aorth/dspace/solr/statistics-2019/data" />
|
||
</core>
|
||
...
|
||
</cores>
|
||
</code></pre><ul>
|
||
<li>But I don’t like having to do that… why doesn’t it load automatically?</li>
|
||
<li>I sent a mail to the dspace-tech mailing list to ask about it</li>
|
||
<li>Just for fun I tried to load these stats into a Solr 7.7.2 instance using the DSpace 7 solr config:</li>
|
||
<li>First, create a Solr statistics core using the DSpace 7 config:</li>
|
||
</ul>
|
||
<pre><code>$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
|
||
</code></pre><ul>
|
||
<li>Then try to import the stats, skipping a shitload of fields that are apparently added to our Solr statistics by Atmire modules:</li>
|
||
</ul>
|
||
<pre><code>$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||
</code></pre><ul>
|
||
<li>OK that imported! I wonder if it works… maybe I’ll try another day</li>
|
||
</ul>
|
||
<h2 id="2020-02-07">2020-02-07</h2>
|
||
<ul>
|
||
<li>I did some investigation into DSpace indexing performance using flame graphs
|
||
<ul>
|
||
<li>Excellent introduction: <a href="http://www.brendangregg.com/flamegraphs.html">http://www.brendangregg.com/flamegraphs.html</a></li>
|
||
<li>Using flame graphs with java: <a href="https://netflixtechblog.com/java-in-flames-e763b3d32166">https://netflixtechblog.com/java-in-flames-e763b3d32166</a></li>
|
||
<li>Fantastic wrapper scripts for doing perf on Java processes: <a href="https://github.com/jvm-profiling-tools/perf-map-agent">https://github.com/jvm-profiling-tools/perf-map-agent</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code>$ cd ~/src/git/perf-map-agent
|
||
$ cmake .
|
||
$ make
|
||
$ ./bin/create-links-in ~/.local/bin
|
||
$ export FLAMEGRAPH_DIR=/home/aorth/src/git/FlameGraph
|
||
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
|
||
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
|
||
$ ~/dspace63/bin/dspace index-discovery -b &
|
||
# pid of tomcat java process
|
||
$ perf-java-flames 4478
|
||
# pid of java indexing process
|
||
$ perf-java-flames 11359
|
||
</code></pre><ul>
|
||
<li>All Java processes need to have <code>-XX:+PreserveFramePointer</code> if you want to trace their methods</li>
|
||
<li>I did the same tests against DSpace 5.8 and 6.4-SNAPSHOT’s CLI indexing process and Tomcat process
|
||
<ul>
|
||
<li>For what it’s worth, it appears all the Hibernate stuff is in the CLI processes, so we don’t need to trace the Tomcat process</li>
|
||
</ul>
|
||
</li>
|
||
<li>Here is the flame graph for DSpace 5.8’s <code>dspace index-discovery -b</code> java process:</li>
|
||
</ul>
|
||
<p><img src="/cgspace-notes/2020/02/flamegraph-java-cli-dspace58.svg" alt="DSpace 5.8 index-discovery flame graph"></p>
|
||
<ul>
|
||
<li>Here is the flame graph for DSpace 6.4-SNAPSHOT’s <code>dspace index-discovery -b</code> java process:</li>
|
||
</ul>
|
||
<p><img src="/cgspace-notes/2020/02/flamegraph-java-cli-dspace64-snapshot.svg" alt="DSpace 6.4-SNAPSHOT index-discovery flame graph"></p>
|
||
<ul>
|
||
<li>If the width of the stacks indicates time, then it’s clear that Hibernate takes longer…</li>
|
||
<li>Apparently there is a “flame diff” tool, I wonder if we can use that to compare!</li>
|
||
</ul>
|
||
<h2 id="2020-02-09">2020-02-09</h2>
|
||
<ul>
|
||
<li>This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:</li>
|
||
</ul>
|
||
<pre><code># CGSpace 5.8
|
||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 385.72s user 131.16s system 19% cpu 43:21.18 total
|
||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 382.95s user 127.31s system 20% cpu 42:10.07 total
|
||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 368.56s user 143.97s system 20% cpu 42:22.66 total
|
||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 360.09s user 104.03s system 19% cpu 39:24.41 total
|
||
|
||
# Vanilla DSpace 5.10
|
||
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 236.19s user 59.70s system 3% cpu 2:03:31.14 total
|
||
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 232.41s user 50.38s system 3% cpu 2:04:16.00 total
|
||
|
||
# Vanilla DSpace 6.4-SNAPSHOT
|
||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:36:53.98 total
|
||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:21:0.0 total
|
||
</code></pre><ul>
|
||
<li>I generated better flame graphs for the DSpace indexing process by using <code>perf-record-stack</code> and filtering out the java process:</li>
|
||
</ul>
|
||
<pre><code>$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
|
||
$ export PERF_RECORD_SECONDS=60
|
||
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
|
||
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &
|
||
# process id of java indexing process (not Tomcat)
|
||
$ perf-java-record-stack 169639
|
||
$ sudo perf script -i /tmp/perf-169639.data > out.dspace510-1
|
||
$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ../FlameGraph/flamegraph.pl --color=java --hash > out.dspace510-1.svg
|
||
</code></pre><ul>
|
||
<li>All data recorded on my laptop with the same kernel, same boot, etc.</li>
|
||
<li>CGSpace 5.8 (with Atmire patches):</li>
|
||
</ul>
|
||
<p><img src="/cgspace-notes/2020/02/out.dspace58-2.svg" alt="DSpace 5.8 (with Atmire modules) index-discovery flame graph"></p>
|
||
<ul>
|
||
<li>Vanilla DSpace 5.10:</li>
|
||
</ul>
|
||
<p><img src="/cgspace-notes/2020/02/out.dspace510-3.svg" alt="Vanilla DSpace 5.10 index-discovery flame graph"></p>
|
||
<ul>
|
||
<li>Vanilla DSpace 6.4-SNAPSHOT:</li>
|
||
</ul>
|
||
<p><img src="/cgspace-notes/2020/02/out.dspace64-3.svg" alt="Vanilla DSpace 6.4-SNAPSHOT index-discovery flame graph"></p>
|
||
<ul>
|
||
<li>I sent my feedback to the dspace-tech mailing list so someone can hopefully comment.</li>
|
||
<li>Last week Peter asked Sisay to upload some items to CGSpace in the GENNOVATE collection (part of Gender CRP)
|
||
<ul>
|
||
<li>He uploaded them here: <a href="https://cgspace.cgiar.org/handle/10568/105926">https://cgspace.cgiar.org/handle/10568/105926</a></li>
|
||
<li>On a whim I checked and found five duplicates there, which means Sisay didn’t even check</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-02-10">2020-02-10</h2>
|
||
<ul>
|
||
<li>Follow up with <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">Atmire about DSpace 6.x upgrade</a>
|
||
<ul>
|
||
<li>I raised the issue of targetting 6.4-SNAPSHOT as well as the Discovery indexing performance issues in 6.x</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-02-11">2020-02-11</h2>
|
||
<ul>
|
||
<li>Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:</li>
|
||
</ul>
|
||
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
|
||
$ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
|
||
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
|
||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
||
</code></pre><ul>
|
||
<li>Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using <code>fix-metadata-values.py</code>:</li>
|
||
</ul>
|
||
<pre><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
|
||
</code></pre><ul>
|
||
<li>On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
|
||
<ul>
|
||
<li>I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code>dc.contributor.author,cg.creator.id
|
||
"Staver, Charles",charles staver: 0000-0002-4532-6077
|
||
"Staver, C.",charles staver: 0000-0002-4532-6077
|
||
"Fungo, R.",Robert Fungo: 0000-0002-4264-6905
|
||
"Remans, R.",Roseline Remans: 0000-0003-3659-8529
|
||
"Remans, Roseline",Roseline Remans: 0000-0003-3659-8529
|
||
"Rietveld A.",Anne Rietveld: 0000-0002-9400-9473
|
||
"Rietveld, A.",Anne Rietveld: 0000-0002-9400-9473
|
||
"Rietveld, A.M.",Anne Rietveld: 0000-0002-9400-9473
|
||
"Rietveld, Anne M.",Anne Rietveld: 0000-0002-9400-9473
|
||
"Fongar, A.",Andrea Fongar: 0000-0003-2084-1571
|
||
"Müller, Anna",Anna Müller: 0000-0003-3120-8560
|
||
"Müller, A.",Anna Müller: 0000-0003-3120-8560
|
||
</code></pre><ul>
|
||
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 144 ORCID iDs to items on CGSpace!</li>
|
||
</ul>
|
||
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
|
||
</code></pre><ul>
|
||
<li>Minor updates to all Python utility scripts in the CGSpace git repository</li>
|
||
<li>Update the spider agent patterns in CGSpace <code>5_x-prod</code> branch from the latest <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> project
|
||
<ul>
|
||
<li>I ran the <code>check-spider-hits.sh</code> script with the updated file and purged 6,000 hits from our Solr statistics core on CGSpace</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-02-12">2020-02-12</h2>
|
||
<ul>
|
||
<li>Follow up with people about AReS funding for next phase</li>
|
||
<li>Peter asked about the “stats” and “summary” reports that he had requested in December
|
||
<ul>
|
||
<li>I opened a <a href="https://github.com/ilri/AReS/issues/13">new issue on AReS for the “summary” report</a></li>
|
||
</ul>
|
||
</li>
|
||
<li>Peter asked me to update John McIntire’s name format on CGSpace so I ran the following PostgreSQL query:</li>
|
||
</ul>
|
||
<pre><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
|
||
UPDATE 26
|
||
</code></pre><h2 id="2020-02-17">2020-02-17</h2>
|
||
<ul>
|
||
<li>A few days ago Atmire responded to my question about DSpace 6.4-SNAPSHOT saying that they can only confirm that 6.3 works with their modules
|
||
<ul>
|
||
<li>I responded to say that we agree to target 6.3, but that I will cherry-pick important patches from the <code>dspace-6_x</code> branch at our own responsibility</li>
|
||
</ul>
|
||
</li>
|
||
<li>Send a message to dspace-devel asking them to tag DSpace 6.4</li>
|
||
<li>Udana from IWMI asked about the OAI base URL for their community on CGSpace
|
||
<ul>
|
||
<li>I think it should be this: <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814">https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-02-19">2020-02-19</h2>
|
||
<ul>
|
||
<li>I noticed a thread on the mailing list about the Tomcat header size and Solr max boolean clauses error
|
||
<ul>
|
||
<li>The solution is to do as we have done and increase the headers / boolean clauses, or to simply <a href="https://wiki.lyrasis.org/display/DSPACE/TechnicalFaq#TechnicalFAQ-I'mgetting%22SolrException:BadRequest%22followedbyalongqueryora%22tooManyClauses%22Exception">disable access rights awareness</a> in Discovery</li>
|
||
<li>I applied the fix to the <code>5_x-prod</code> branch and cherry-picked it to <code>6_x-dev</code></li>
|
||
</ul>
|
||
</li>
|
||
<li>Upgrade Tomcat from 7.0.99 to 7.0.100 in <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
|
||
<li>Upgrade PostgreSQL JDBC driver from 42.2.9 to 42.2.10 in <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
|
||
<li>Run Tomcat and PostgreSQL JDBC driver updates on DSpace Test (linode19)</li>
|
||
</ul>
|
||
<h2 id="2020-02-23">2020-02-23</h2>
|
||
<ul>
|
||
<li>I see a new spider in the nginx logs on CGSpace:</li>
|
||
</ul>
|
||
<pre><code>Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
|
||
</code></pre><ul>
|
||
<li>I think this should be covered by the <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> patterns for the statistics at least…</li>
|
||
<li>I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:</li>
|
||
</ul>
|
||
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
|
||
</code></pre><ul>
|
||
<li>Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent</li>
|
||
<li>I will add the IP addresses to the nginx badbots list</li>
|
||
<li>31.6.77.23 is in the UK and judging by its DNS it belongs to a <a href="https://www.bronco.co.uk/">web marketing company called Bronco</a>
|
||
<ul>
|
||
<li>I looked for its DNS entry in Solr statistics and found a few hundred thousand over the years:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
|
||
<?xml version="1.0" encoding="UTF-8"?>
|
||
<response>
|
||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">dns:/squeeze3.bronco.co.uk./</str><str name="rows">0</str></lst></lst><result name="response" numFound="86044" start="0"></result>
|
||
</response>
|
||
</code></pre><ul>
|
||
<li>The totals in each core are:
|
||
<ul>
|
||
<li>statistics: 86044</li>
|
||
<li>statistics-2018: 65144</li>
|
||
<li>statistics-2017: 79405</li>
|
||
<li>statistics-2016: 121316</li>
|
||
<li>statistics-2015: 30720</li>
|
||
<li>statistics-2014: 4524</li>
|
||
<li>… so about 387,000 hits!</li>
|
||
</ul>
|
||
</li>
|
||
<li>I will purge them from each core one by one, ie:</li>
|
||
</ul>
|
||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
|
||
$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
|
||
</code></pre><ul>
|
||
<li>Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)</li>
|
||
<li>Deploy latest <code>5_x-prod</code> branch on CGSpace (linode18)</li>
|
||
<li>Run all system updates on CGSpace (linode18) server and reboot it
|
||
<ul>
|
||
<li>After the server came back up Tomcat started, but there were errors loading some Solr statistics cores</li>
|
||
<li>Luckily after restarting Tomcat once more they all came back up</li>
|
||
</ul>
|
||
</li>
|
||
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
|
||
</ul>
|
||
<pre><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||
Detail: Key (bitstream_id)=(183996) is still referenced from table "bundle".
|
||
</code></pre><ul>
|
||
<li>The solution is, as always:</li>
|
||
</ul>
|
||
<pre><code># su - postgres
|
||
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
|
||
UPDATE 1
|
||
</code></pre><ul>
|
||
<li>Аdd one more new Bioversity ORCID iD to the controlled vocabulary on CGSpace</li>
|
||
<li>Felix Shaw from Earlham emailed me to ask about his admin account on DSpace Test
|
||
<ul>
|
||
<li>His old one got lost when I re-sync’d DSpace Test with CGSpace a few weeks ago</li>
|
||
<li>I added a new account for him and added it to the Administrators group:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
|
||
</code></pre><ul>
|
||
<li>For some reason the Atmire Content and Usage Analysis (CUA) module’s Usage Statistics is drawing blank graphs
|
||
<ul>
|
||
<li>I looked in the dspace.log and see:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code>2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
|
||
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoClassDefFoundError: Could not
|
||
initialize class org.jfree.chart.JFreeChart
|
||
</code></pre><ul>
|
||
<li>The same error happens on DSpace Test, but graphs are working on my local instance
|
||
<ul>
|
||
<li>The only thing I’ve changed recently is the Tomcat version, but it’s working locally…</li>
|
||
<li>I see the following file on my local instance, CGSpace, and DSpace Test: <code>dspace/webapps/xmlui/WEB-INF/lib/jfreechart-1.0.5.jar</code></li>
|
||
<li>I deployed Tomcat 7.0.99 on DSpace Test but the JFreeChart classs still can’t be found…</li>
|
||
<li>So it must be somthing with the library search path…</li>
|
||
<li>Strange it works with Tomcat 7.0.100 on my local machine</li>
|
||
</ul>
|
||
</li>
|
||
<li>I copied the <code>jfreechart-1.0.5.jar</code> file to the Tomcat lib folder and then there was a different error when I loaded Atmire CUA:</li>
|
||
</ul>
|
||
<pre><code>2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request! org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
|
||
</code></pre><ul>
|
||
<li>Some search results suggested commenting out the following line in <code>/etc/java-8-openjdk/accessibility.properties</code>:</li>
|
||
</ul>
|
||
<pre><code>assistive_technologies=org.GNOME.Accessibility.AtkWrapper
|
||
</code></pre><ul>
|
||
<li>And removing the extra jfreechart library and restarting Tomcat I was able to load the usage statistics graph on DSpace Test…
|
||
<ul>
|
||
<li>Hmm, actually I think this is an Java bug, perhaps introduced or at <a href="https://bugs.openjdk.java.net/browse/JDK-8204862">least present in 18.04</a>, with lots of <a href="https://code-maven.com/slides/jenkins-intro/no-graph-error">references</a> to it <a href="https://issues.jenkins-ci.org/browse/JENKINS-39636">happening in other</a> configurations like Debian 9 with Jenkins, etc…</li>
|
||
<li>Apparently if you use the <em>non-headless</em> version of openjdk this doesn’t happen… but that pulls in X11 stuff so no thanks</li>
|
||
<li>Also, I see dozens of occurences of this going back over one month (we have logs for about that period):</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
|
||
dspace.log.2020-01-12:4
|
||
dspace.log.2020-01-13:66
|
||
dspace.log.2020-01-14:4
|
||
dspace.log.2020-01-15:36
|
||
dspace.log.2020-01-16:88
|
||
dspace.log.2020-01-17:4
|
||
dspace.log.2020-01-18:4
|
||
dspace.log.2020-01-19:4
|
||
dspace.log.2020-01-20:4
|
||
dspace.log.2020-01-21:4
|
||
...
|
||
</code></pre><ul>
|
||
<li>I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics…</li>
|
||
<li>On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia’s AReS explorer, but it should only be using REST and therefore no Solr statistics…?</li>
|
||
</ul>
|
||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
|
||
<?xml version="1.0" encoding="UTF-8"?>
|
||
<response>
|
||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">811</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="rows">0</str></lst></lst><result name="response" numFound="5536097" start="0"></result>
|
||
</response>
|
||
</code></pre><ul>
|
||
<li>And there are apparently two million from last month (2020-01):</li>
|
||
</ul>
|
||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
|
||
<?xml version="1.0" encoding="UTF-8"?>
|
||
<response>
|
||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">248</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="fq">dateYearMonth:2020-01</str><str name="rows">0</str></lst></lst><result name="response" numFound="2173455" start="0"></result>
|
||
</response>
|
||
</code></pre><ul>
|
||
<li>But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on <code>/rest</code> and none of which are to XMLUI:</li>
|
||
</ul>
|
||
<pre><code># zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
|
||
84322
|
||
# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
|
||
84322
|
||
</code></pre><ul>
|
||
<li>Either the requests didn’t get logged, or there is some mixup with the Solr documents (fuck!)
|
||
<ul>
|
||
<li>On second inspection, I <em>do</em> see lots of notes here about 34.218.226.147, including 150,000 on one day in October, 2018 alone…</li>
|
||
</ul>
|
||
</li>
|
||
<li>To make matters worse, I see hits from REST in the regular nginx access log!
|
||
<ul>
|
||
<li>I did a few tests and I can’t figure out, but it seems that hits appear in either (not both)</li>
|
||
<li>Also, I see <em>zero</em> hits to <code>/rest</code> in the access.log on DSpace Test (linode19)</li>
|
||
</ul>
|
||
</li>
|
||
<li>Anyways, I faceted by IP in 2020-01 and see:</li>
|
||
</ul>
|
||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
|
||
...
|
||
"172.104.229.92",2686876,
|
||
"34.218.226.147",2173455,
|
||
"163.172.70.248",80945,
|
||
"163.172.71.24",55211,
|
||
"163.172.68.99",38427,
|
||
</code></pre><ul>
|
||
<li>Surprise surprise, the top two IPs are from AReS servers… wtf.</li>
|
||
<li>The next three are from Online in France and they are all using this weird user agent and making tens of thousands of requests to Discovery:</li>
|
||
</ul>
|
||
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||
</code></pre><ul>
|
||
<li>And all the same three are already inflating the statistics for 2020-02… hmmm.</li>
|
||
<li>I need to see why AReS harvesting is inflating the stats, as it should only be making REST requests…</li>
|
||
<li>Shiiiiit, I see 84,000 requests from the AReS IP today alone:</li>
|
||
</ul>
|
||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
|
||
...
|
||
"response":{"numFound":84594,"start":0,"docs":[]
|
||
</code></pre><ul>
|
||
<li>Fuck! And of course the ILRI websites doing their daily REST harvesting are causing issues too, from today alone:</li>
|
||
</ul>
|
||
<pre><code> "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
|
||
"2a01:7e00::f03c:91ff:fe18:7396",26155,
|
||
</code></pre><ul>
|
||
<li>I need to try to make some requests for these URLs and observe if they make a statistics hit:
|
||
<ul>
|
||
<li><code>/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450</code></li>
|
||
<li><code>/rest/handle/10568/28702?expand=all</code></li>
|
||
</ul>
|
||
</li>
|
||
<li>Those are the requests AReS and ILRI servers are making… nearly 150,000 per day!</li>
|
||
</ul>
|
||
<!-- raw HTML omitted -->
|
||
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
|
||
<li><a href="/cgspace-notes/2020-02/">February, 2020</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2020-01/">January, 2020</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2019-12/">December, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2019-11/">November, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
|
||
|
||
<footer class="blog-footer">
|
||
<p dir="auto">
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
|
||
</body>
|
||
|
||
</html>
|