<li>Add tests for the new <code>/items</code> POST handlers to the DSpace 6.x branch of my <ahref="https://github.com/ilri/dspace-statistics-api/tree/v6_x">dspace-statistics-api</a>
<ul>
<li>It took a bit of extra work because I had to learn how to mock the responses for when Solr is not available</li>
<li>Tag and release version 1.3.0 on GitHub: <ahref="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0">https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0</a></li>
2020-10-06 21:36:04,143 ERROR org.dspace.storage.rdbms.DatabaseRegistryUpdater @ Error attempting to update Bitstream Format and/or Metadata Registries
org.hibernate.exception.ConstraintViolationException: could not execute batch
at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:129)
at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:49)
at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:124)
at org.hibernate.engine.jdbc.batch.internal.BatchingBatch.performExecution(BatchingBatch.java:122)
at org.hibernate.engine.jdbc.batch.internal.BatchingBatch.doExecuteBatch(BatchingBatch.java:101)
at org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl.execute(AbstractBatchImpl.java:161)
at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.executeBatch(JdbcCoordinatorImpl.java:207)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:390)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:304)
at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:349)
at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:56)
at org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.hibernate.context.internal.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:352)
at com.sun.proxy.$Proxy162.flush(Unknown Source)
at org.dspace.core.HibernateDBConnection.commit(HibernateDBConnection.java:83)
at org.dspace.core.Context.commit(Context.java:435)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.administer.MetadataImporter.loadRegistry(MetadataImporter.java:164)
at org.dspace.storage.rdbms.DatabaseRegistryUpdater.updateRegistries(DatabaseRegistryUpdater.java:72)
at org.dspace.storage.rdbms.DatabaseRegistryUpdater.afterMigrate(DatabaseRegistryUpdater.java:121)
at org.flywaydb.core.internal.command.DbMigrate$3.doInTransaction(DbMigrate.java:250)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:246)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:663)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:575)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:551)
at org.dspace.core.Context.<clinit>(Context.java:103)
at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5197)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5720)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1016)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:992)
</code></pre><ul>
<li>I checked the database migrations with <code>dspace database info</code> and they were all OK
<ul>
<li>Then I restarted the Tomcat again and it started up OK…</li>
</ul>
</li>
<li>There were two issues I had reported to Atmire last month:
<ul>
<li>Importing items from the command line throws a <code>NullPointerException</code> from <code>com.atmire.dspace.cua.CUASolrLoggerServiceImpl</code> for every item, but the item still gets imported</li>
<li>No results for author name in Listing and Reports, despite there being hits in Discovery search</li>
</ul>
</li>
<li>To test the first one I imported a very simple CSV file with one item with minimal data
<ul>
<li>There is a new error now (but the item does get imported):</li>
New item: aff5e78d-87c9-438d-94f8-1050b649961c (10568/108548)
+ New owning collection (10568/3): ILRI articles in journals
+ Added (dc.contributor.author): Orth, Alan
+ Added (dc.date.issued): 2020-09-01
+ Added (dc.title): Testing CUA import NPE
Tue Oct 06 22:06:14 CEST 2020 | Query:containerItem:aff5e78d-87c9-438d-94f8-1050b649961c
Error while updating
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> The requested resource [/solr/update] is not available</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/7.0.104</h3></body></html>
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093)
at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104)
at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
<li>Also, I tested Listings and Reports and there are still no hits for “Orth, Alan” as a contributor, despite there being dozens of items in the repository and the Solr query generated by Listings and Reports actually returning hits:</li>
<li>What is unclear to me is the <code>archived</code> parameter, it seems to do nothing… perhaps it is only used for the <code>/items</code> endpoint when printing information about an item
<ul>
<li>If I submit to a collection that has a workflow, even as a super admin and with “archived=false” in the JSON, the item enters the workflow (“Awaiting editor’s attention”)</li>
<li>If I submit to a new collection without a workflow the item gets archived immediately</li>
<li>I created <ahref="https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a">some notes</a> to share with Salem and Leroy for future reference when we start discussion POSTing items to the REST API</li>
</ul>
</li>
<li>I created an account for Salem on DSpace Test and added it to the submitters group of an ICARDA collection with no other workflow steps so we can see what happens
<ul>
<li>We are curious to see if he gets a UUID when posting from MEL</li>
<li>I did some testing of the DSpace 5 REST API because Salem and I were curious
<ul>
<li>The authentication is a little different (uses a serialized JSON object instead of a form and the token is an HTTP header instead of a cookie):</li>
</ul>
</li>
</ul>
<pre><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
<li>We discussed removing Atmire Listings and Reports from DSpace 6 because we can probably make the same reports in AReS and this module is the one that is currently holding us back from the upgrade</li>
<li>We discussed allowing partners to submit content via the REST API and perhaps making it an extra fee due to the burden it incurs with unfinished submissions, manual duplicate checking, developer support, etc</li>
<li>He was excited about the possibility of using my statistics API for more things on AReS as well as item view pages</li>
</ul>
</li>
<li>Also I fixed a bunch of the CRP mappings in the AReS value mapper and started a fresh re-indexing</li>
</ul>
<h2id="2020-10-12">2020-10-12</h2>
<ul>
<li>Looking at CGSpace’s Solr statistics for 2020-09 and I see:
<ul>
<li><code>RTB website BOT</code>: 212916</li>
<li><code>Java/1.8.0_66</code>: 3122</li>
<li><code>Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1</code>: 614</li>
<li>After a few minutes I saw these four hits in Solr… WTF
<ul>
<li>So is there some issue with DSpace’s parsing of the spider agent files?</li>
<li>I added <code>RTB website BOT</code> to the ilri pattern file, restarted Tomcat, and made four more requests to the bitstream</li>
<li>These four requests were recorded in Solr too, WTF!</li>
<li>It seems like the patterns aren’t working at all…</li>
<li>I decided to try something drastic and removed all pattern files, adding only one single pattern <code>bot</code> to make sure this is not because of a syntax or precedence issue</li>
<li>Now even those four requests were recorded in Solr, WTF!</li>
<li>I will try one last thing, to put a single entry with the exact pattern <code>RTB website BOT</code> in a single spider agents pattern file…</li>
<li>Nope! Still records the hits… WTF</li>
<li>As a last resort I tried to use the vanilla <ahref="https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/config/spiders/agents/example">DSpace 6 <code>example</code> file</a></li>
<li>And the hits still get recorded… WTF</li>
<li>So now I’m wondering if this is because of our custom Atmire shit?</li>
<li>I will have to test on a vanilla DSpace instance I guess before I can complain to the dspace-tech mailing list</li>
</ul>
</li>
<li>I re-factored the <code>check-spider-hits.sh</code> script to read patterns from a text file rather than sed’s stdout, and to properly search for spaces in patterns that use <code>\s</code> because Lucene’s search syntax doesn’t support it (and spaces work just fine)
<li>I added <code>[Ss]pider</code> to the Tomcat Crawler Sessions Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID</li>
<li>I added a few of the patterns from above to our local agents list and ran the <code>check-spider-hits.sh</code> on CGSpace:</li>
<li>We decided to use Title Case for our countries on CGSpace to minimize the need for mapping on AReS</li>
<li>We did some work to add a dozen more mappings for strange and incorrect CRPs on AReS</li>
</ul>
</li>
<li>I can update the country metadata in PostgreSQL like this:</li>
</ul>
<pre><code>dspace=> BEGIN;
dspace=> UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
UPDATE 51756
dspace=> COMMIT;
</code></pre><ul>
<li>I will need to pay special attention to Côte d’Ivoire, Bosnia and Herzegovina, and a few others though… maybe better do search and replace using <code>fix-metadata-values.csv</code>
<ul>
<li>Export a list of distinct values from the database:</li>
</ul>
</li>
</ul>
<pre><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
COPY 195
</code></pre><ul>
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
<ul>
<li>I still had to double check everything to catch some corner cases (Andorra, Timor-leste, etc)</li>
</ul>
</li>
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
<li>It uses a <ahref="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka “lookaround” in PCRE?) to match words that are <em>not</em>“pair”, “displayed”, etc because we don’t want to edit the XML tags themselves…
<ul>
<li>I had to fix a few manually after doing this, as above with PostgreSQL</li>
</ul>
</li>
</ul>
<h2id="2020-10-14">2020-10-14</h2>
<ul>
<li>I discussed the title casing of countries with Abenet and she suggested we also apply title casing to regions
<ul>
<li>I exported the list of regions from the database:</li>
</ul>
</li>
</ul>
<pre><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
COPY 34
</code></pre><ul>
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
<li>After testing the replacements locally I ran them on CGSpace:</li>
<li>I added a dozen or so more mappings to fix some country outliers on AReS
<ul>
<li>I will start a fresh harvest there once the Discovery update is done on CGSpace</li>
</ul>
</li>
<li>I also adjusted my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts to work on DSpace 6 where there is no more <code>resource_type_id</code> field
<ul>
<li>I will need to do it on a few more scripts as well, but I’ll do that after we migrate to DSpace 6 because those scripts are less important</li>
</ul>
</li>
<li>I found a new setting in DSpace 6’s <code>usage-statistics.cfg</code> about case insensitive matching of bots that defaults to false, so I enabled it in our DSpace 6 branch
<ul>
<li>I am curious to see if that resolves the strange issues I noticed yesterday about bot matching of patterns in the spider agents file completely not working</li>