<li>I tried importing a PostgreSQL snapshot from CGSpace and had errors due to missing Atmire database migrations
<ul>
<li>If I try to run <code>dspace database migrate</code> I get the IDs of the migrations that are missing</li>
<li>I delete them manually in psql:</li>
</ul>
</li>
</ul>
<pre><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
</code></pre><ul>
<li>Then I ran <code>dspace database migrate</code> and got an error:</li>
Statement : ALTER TABLE metadatavalue DROP COLUMN IF EXISTS resource_id
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:117)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:71)
at org.flywaydb.core.internal.command.DbMigrate.doMigrate(DbMigrate.java:352)
at org.flywaydb.core.internal.command.DbMigrate.access$1100(DbMigrate.java:47)
at org.flywaydb.core.internal.command.DbMigrate$4.doInTransaction(DbMigrate.java:308)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.applyMigration(DbMigrate.java:305)
at org.flywaydb.core.internal.command.DbMigrate.access$1000(DbMigrate.java:47)
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:230)
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:173)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:173)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:662)
... 8 more
Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
Detail: view eperson_metadata depends on table metadatavalue column resource_id
Hint: Use DROP ... CASCADE to drop the dependent objects too.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.flywaydb.core.internal.dbsupport.JdbcTemplate.executeStatement(JdbcTemplate.java:238)
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:114)
... 24 more
</code></pre><ul>
<li>I think I might need to update the sequences first… nope</li>
<li>Perhaps it’s due to some missing bitstream IDs and I need to run <code>dspace cleanup</code> on CGSpace and take a new PostgreSQL dump… nope</li>
<li>A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the <code>resource_id</code> column</li>
<li>Our database had the same issue, where the <code>eperson_metadata</code> view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:</li>
</ul>
<pre><code>dspace63=# DROP VIEW eperson_metadata;
DROP VIEW
</code></pre><ul>
<li>After that the migration was successful and DSpace starts up successfully and begins indexing
<li>xmlui, solr, jspui, rest, and oai are working (rest was redirecting to HTTPS, so I set the Tomcat connector to <code>secure="true"</code> and it fixed it on localhost, but caused other issues so I disabled it for now)</li>
<li><inputchecked=""disabled=""type="checkbox">Community and collection pages only show one recent submission (seems that there is only one item in Solr?)</li>
<li><inputchecked=""disabled=""type="checkbox">Order of navigation elements in right side bar (“My Account” etc, compare to DSpace Test)</li>
<li><inputdisabled=""type="checkbox">Home page trail says “CGSpace Home” instead of “CGSpace Home / Community List” (see DSpace Test)</li>
</ul>
</li>
<li>There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:</li>
<li>If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…</li>
<li>I dropped all the documents in the search core:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
</code></pre><ul>
<li>Still didn’t work, so I’m going to try a clean database import and migration:</li>
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
<li>I notice that the indexing doesn’t work correctly if I start it manually with <code>dspace index-discovery -b</code> (search.resourceid becomes an integer!)
<ul>
<li>If I induce an indexing by touching <code>dspace/solr/search/conf/reindex.flag</code> the search.resourceid are all UUIDs…</li>
</ul>
</li>
<li>Speaking of database stuff, there was a performance-related update for the <ahref="https://github.com/DSpace/DSpace/pull/1791/">indexes that we used in DSpace 5</a>
<ul>
<li>We might want to <ahref="https://github.com/DSpace/DSpace/pull/1792">apply it in DSpace 6</a>, as it was never merged to 6.x, but it helped with the performance of <code>/submissions</code> in XMLUI for us in <ahref="/cgspace-notes/2018-03/">2018-03</a></li>
<li>The indexing issue I was having yesterday seems to only present itself the first time a new installation is running DSpace 6
<ul>
<li>Once the indexing induced by touching <code>dspace/solr/search/conf/reindex.flag</code> has finished, subsequent manual invocations of <code>dspace index-discovery -b</code> work as expected</li>
<li>Nevertheless, I sent a message to the dspace-tech mailing list describing the issue to see if anyone has any comments</li>
<li>I am seeing that the number of important commits on the unreleased DSpace 6.4 are really numerous and it might be better for us to target that version
<ul>
<li>I did a simple test and it’s easy to rebase my current 6.3 branch on top of the upstream <code>dspace-6_x</code> branch:</li>
</ul>
</li>
</ul>
<pre><code>$ git checkout -b 6_x-dev64 6_x-dev
$ git rebase -i upstream/dspace-6_x
</code></pre><ul>
<li>I finally understand why our themes show all the “Browse by” buttons on community and collection pages in DSpace 6.x
<ul>
<li>The code in <code>./dspace-xmlui/src/main/java/org/dspace/app/xmlui/aspect/browseArtifacts/CommunityBrowse.java</code> iterates over all the browse indexes and prints them when it is called</li>
<li>The XMLUI theme code in <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/browse.xsl</code> calls the template because the id of the div matches “aspect.browseArtifacts.CommunityBrowse.list.community-browse”</li>
<li>I checked the DRI of a community page on my local 6.x and DSpace Test 5.x by appending <code>?XML</code> to the URL and I see the ID is missing on DSpace 5.x</li>
<li>The issue is the same with the ordering of the “My Account” link, but in Navigation.java</li>
<li>I tried modifying <code>preprocess/browse.xsl</code> but it always ends up printing some default list of browse by links…</li>
<li>I’m starting to wonder if Atmire’s modules somehow override this, as I don’t see how <code>CommunityBrowse.java</code> can behave like ours on DSpace 5.x unless they have overridden it (as the open source code is the same in 5.x and 6.x)</li>
<li>At least the “account” link in the sidebar is overridden in our 5.x branch because Atmire copied a modified <code>Navigation.java</code> to the local xmlui modules folder… so that explains that (and it’s easy to replicate in 6.x)</li>
<li>Checking out the DSpace 6.x REST API query client
<ul>
<li>There is a <ahref="https://terrywbrady.github.io/restReportTutorial/intro">tutorial</a> that explains how it works and I see it is very powerful because you can export a CSV of results in order to fix and re-upload them with batch import!</li>
<li>Custom queries can be added in <code>dspace-rest/src/main/webapp/static/reports/restQueryReport.js</code></li>
</ul>
</li>
<li>I noticed two new bots in the logs with the following user agents:
<li><code>magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)</code></li>
</ul>
</li>
<li>I filed an <ahref="https://github.com/atmire/COUNTER-Robots/issues/30">issue to add Jersey to the COUNTER-Robots</a> list</li>
<li>Peter noticed that the statlets on community, collection, and item pages aren’t working on CGSpace
<ul>
<li>I thought it might be related to the fact that the yearly sharding didn’t complete successfully this year so the <code>statistics-2019</code> core is empty</li>
<li>I removed the <code>statistics-2019</code> core and had to restart Tomcat like six times before all cores would load properly (ugh!!!!)</li>
<li>After that the statlets were working properly…</li>
</ul>
</li>
<li>Run all system updates on DSpace Test (linode19) and restart it</li>
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
</code></pre><ul>
<li>I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my <code>check-spider-hits.sh</code> script:</li>
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
<li>I noticed another user agen in the logs that we should add to the list:</li>
</ul>
<pre><code>ReactorNetty/0.9.2.RELEASE
</code></pre><ul>
<li>I made <ahref="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
<li>I found a <ahref="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to workfor exporting our 2019 stats from the large statistics core!</li>
<li>Using flame graphs with java: <ahref="https://netflixtechblog.com/java-in-flames-e763b3d32166">https://netflixtechblog.com/java-in-flames-e763b3d32166</a></li>
<li>Fantastic wrapper scripts for doing perf on Java processes: <ahref="https://github.com/jvm-profiling-tools/perf-map-agent">https://github.com/jvm-profiling-tools/perf-map-agent</a></li>
<li>This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:</li>
</ul>
<pre><code># CGSpace 5.8
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 385.72s user 131.16s system 19% cpu 43:21.18 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 382.95s user 127.31s system 20% cpu 42:10.07 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 368.56s user 143.97s system 20% cpu 42:22.66 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 360.09s user 104.03s system 19% cpu 39:24.41 total
# Vanilla DSpace 5.10
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 236.19s user 59.70s system 3% cpu 2:03:31.14 total
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 232.41s user 50.38s system 3% cpu 2:04:16.00 total
# Vanilla DSpace 6.4-SNAPSHOT
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:36:53.98 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:21:0.0 total
</code></pre><ul>
<li>I generated better flame graphs for the DSpace indexing process by using <code>perf-record-stack</code> and filtering out the java process:</li>
<li>Follow up with <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">Atmire about DSpace 6.x upgrade</a>
<ul>
<li>I raised the issue of targetting 6.4-SNAPSHOT as well as the Discovery indexing performance issues in 6.x</li>
</ul>
</li>
</ul>
<h2id="2020-02-11">2020-02-11</h2>
<ul>
<li>Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:</li>
<li>Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using <code>fix-metadata-values.py</code>:</li>
<li>Minor updates to all Python utility scripts in the CGSpace git repository</li>
<li>Update the spider agent patterns in CGSpace <code>5_x-prod</code> branch from the latest <ahref="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> project
<ul>
<li>I ran the <code>check-spider-hits.sh</code> script with the updated file and purged 6,000 hits from our Solr statistics core on CGSpace</li>
</ul>
</li>
</ul>
<h2id="2020-02-12">2020-02-12</h2>
<ul>
<li>Follow up with people about AReS funding for next phase</li>
<li>Peter asked about the “stats” and “summary” reports that he had requested in December
<ul>
<li>I opened a <ahref="https://github.com/ilri/AReS/issues/13">new issue on AReS for the “summary” report</a></li>
</ul>
</li>
<li>Peter asked me to update John McIntire’s name format on CGSpace so I ran the following PostgreSQL query:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
UPDATE 26
</code></pre><h2id="2020-02-17">2020-02-17</h2>
<ul>
<li>A few days ago Atmire responded to my question about DSpace 6.4-SNAPSHOT saying that they can only confirm that 6.3 works with their modules
<ul>
<li>I responded to say that we agree to target 6.3, but that I will cherry-pick important patches from the <code>dspace-6_x</code> branch at our own responsibility</li>
</ul>
</li>
<li>Send a message to dspace-devel asking them to tag DSpace 6.4</li>
<li>Udana from IWMI asked about the OAI base URL for their community on CGSpace
<ul>
<li>I think it should be this: <ahref="https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814">https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814</a></li>
</ul>
</li>
</ul>
<h2id="2020-02-19">2020-02-19</h2>
<ul>
<li>I noticed a thread on the mailing list about the Tomcat header size and Solr max boolean clauses error
<ul>
<li>The solution is to do as we have done and increase the headers / boolean clauses, or to simply <ahref="https://wiki.lyrasis.org/display/DSPACE/TechnicalFaq#TechnicalFAQ-I'mgetting%22SolrException:BadRequest%22followedbyalongqueryora%22tooManyClauses%22Exception">disable access rights awareness</a> in Discovery</li>
<li>I applied the fix to the <code>5_x-prod</code> branch and cherry-picked it to <code>6_x-dev</code></li>
</ul>
</li>
<li>Upgrade Tomcat from 7.0.99 to 7.0.100 in <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Upgrade PostgreSQL JDBC driver from 42.2.9 to 42.2.10 in <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Run Tomcat and PostgreSQL JDBC driver updates on DSpace Test (linode19)</li>
<li>I think this should be covered by the <ahref="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> patterns for the statistics at least…</li>
<li>I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
</code></pre><ul>
<li>Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent</li>
<li>I will add the IP addresses to the nginx badbots list</li>
<li>And removing the extra jfreechart library and restarting Tomcat I was able to load the usage statistics graph on DSpace Test…
<ul>
<li>Hmm, actually I think this is an Java bug, perhaps introduced or at <ahref="https://bugs.openjdk.java.net/browse/JDK-8204862">least present in 18.04</a>, with lots of <ahref="https://code-maven.com/slides/jenkins-intro/no-graph-error">references</a> to it <ahref="https://issues.jenkins-ci.org/browse/JENKINS-39636">happening in other</a> configurations like Debian 9 with Jenkins, etc…</li>
<li>Apparently if you use the <em>non-headless</em> version of openjdk this doesn’t happen… but that pulls in X11 stuff so no thanks</li>
<li>Also, I see dozens of occurences of this going back over one month (we have logs for about that period):</li>
</ul>
</li>
</ul>
<pre><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
dspace.log.2020-01-12:4
dspace.log.2020-01-13:66
dspace.log.2020-01-14:4
dspace.log.2020-01-15:36
dspace.log.2020-01-16:88
dspace.log.2020-01-17:4
dspace.log.2020-01-18:4
dspace.log.2020-01-19:4
dspace.log.2020-01-20:4
dspace.log.2020-01-21:4
...
</code></pre><ul>
<li>I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics…</li>
<li>On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia’s AReS explorer, but it should only be using REST and therefore no Solr statistics…?</li>
<li>But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on <code>/rest</code> and none of which are to XMLUI:</li>