cgspace-notes/docs/2020-10/index.html

535 lines
27 KiB
HTML
Raw Normal View History

2020-10-06 15:59:31 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="October, 2020" />
<meta property="og:description" content="2020-10-06
Add tests for the new /items POST handlers to the DSpace 6.x branch of my dspace-statistics-api
It took a bit of extra work because I had to learn how to mock the responses for when Solr is not available
2020-10-06 22:38:45 +02:00
Tag and release version 1.3.0 on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0
Trying to test the changes Atmire sent last week but I had to re-create my local database from a recent CGSpace dump
During the FlywayDB migration I got an error:
2020-10-06 15:59:31 +02:00
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-10/" />
<meta property="article:published_time" content="2020-10-06T16:55:54+03:00" />
2020-10-12 16:53:24 +02:00
<meta property="article:modified_time" content="2020-10-08T15:54:02+03:00" />
2020-10-06 15:59:31 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2020"/>
<meta name="twitter:description" content="2020-10-06
Add tests for the new /items POST handlers to the DSpace 6.x branch of my dspace-statistics-api
It took a bit of extra work because I had to learn how to mock the responses for when Solr is not available
2020-10-06 22:38:45 +02:00
Tag and release version 1.3.0 on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0
Trying to test the changes Atmire sent last week but I had to re-create my local database from a recent CGSpace dump
During the FlywayDB migration I got an error:
2020-10-06 15:59:31 +02:00
"/>
2020-10-12 16:53:24 +02:00
<meta name="generator" content="Hugo 0.76.3" />
2020-10-06 15:59:31 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-10/",
2020-10-12 16:53:24 +02:00
"wordCount": "1895",
2020-10-06 15:59:31 +02:00
"datePublished": "2020-10-06T16:55:54+03:00",
2020-10-12 16:53:24 +02:00
"dateModified": "2020-10-08T15:54:02+03:00",
2020-10-06 15:59:31 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-10/">
<title>October, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-10/">October, 2020</a></h2>
<p class="blog-post-meta"><time datetime="2020-10-06T16:55:54+03:00">Tue Oct 06, 2020</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-10-06">2020-10-06</h2>
<ul>
<li>Add tests for the new <code>/items</code> POST handlers to the DSpace 6.x branch of my <a href="https://github.com/ilri/dspace-statistics-api/tree/v6_x">dspace-statistics-api</a>
<ul>
<li>It took a bit of extra work because I had to learn how to mock the responses for when Solr is not available</li>
2020-10-06 22:38:45 +02:00
<li>Tag and release version 1.3.0 on GitHub: <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0">https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0</a></li>
2020-10-06 15:59:31 +02:00
</ul>
</li>
2020-10-06 22:38:45 +02:00
<li>Trying to test the changes Atmire sent last week but I had to re-create my local database from a recent CGSpace dump
<ul>
<li>During the FlywayDB migration I got an error:</li>
</ul>
</li>
</ul>
<pre><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
Detail: Key (short_description)=(EPUB) already exists.
2020-10-06 21:36:04,142 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [could not execute batch]
2020-10-06 21:36:04,143 ERROR org.dspace.storage.rdbms.DatabaseRegistryUpdater @ Error attempting to update Bitstream Format and/or Metadata Registries
org.hibernate.exception.ConstraintViolationException: could not execute batch
at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:129)
at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:49)
at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:124)
at org.hibernate.engine.jdbc.batch.internal.BatchingBatch.performExecution(BatchingBatch.java:122)
at org.hibernate.engine.jdbc.batch.internal.BatchingBatch.doExecuteBatch(BatchingBatch.java:101)
at org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl.execute(AbstractBatchImpl.java:161)
at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.executeBatch(JdbcCoordinatorImpl.java:207)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:390)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:304)
at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:349)
at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:56)
at org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.hibernate.context.internal.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:352)
at com.sun.proxy.$Proxy162.flush(Unknown Source)
at org.dspace.core.HibernateDBConnection.commit(HibernateDBConnection.java:83)
at org.dspace.core.Context.commit(Context.java:435)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.administer.MetadataImporter.loadRegistry(MetadataImporter.java:164)
at org.dspace.storage.rdbms.DatabaseRegistryUpdater.updateRegistries(DatabaseRegistryUpdater.java:72)
at org.dspace.storage.rdbms.DatabaseRegistryUpdater.afterMigrate(DatabaseRegistryUpdater.java:121)
at org.flywaydb.core.internal.command.DbMigrate$3.doInTransaction(DbMigrate.java:250)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:246)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:663)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:575)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:551)
at org.dspace.core.Context.&lt;clinit&gt;(Context.java:103)
at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5197)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5720)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1016)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:992)
</code></pre><ul>
<li>I checked the database migrations with <code>dspace database info</code> and they were all OK
<ul>
<li>Then I restarted the Tomcat again and it started up OK&hellip;</li>
</ul>
</li>
<li>There were two issues I had reported to Atmire last month:
<ul>
<li>Importing items from the command line throws a <code>NullPointerException</code> from <code>com.atmire.dspace.cua.CUASolrLoggerServiceImpl</code> for every item, but the item still gets imported</li>
<li>No results for author name in Listing and Reports, despite there being hits in Discovery search</li>
</ul>
</li>
<li>To test the first one I imported a very simple CSV file with one item with minimal data
<ul>
<li>There is a new error now (but the item does get imported):</li>
</ul>
</li>
</ul>
<pre><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
Loading @mire database changes for module MQM
Changes have been processed
-----------------------------------------------------------
New item:
+ New owning collection (10568/3): ILRI articles in journals
+ Add (dc.contributor.author): Orth, Alan
+ Add (dc.date.issued): 2020-09-01
+ Add (dc.title): Testing CUA import NPE
1 item(s) will be changed
Do you want to make these changes? [y/n] y
-----------------------------------------------------------
New item: aff5e78d-87c9-438d-94f8-1050b649961c (10568/108548)
+ New owning collection (10568/3): ILRI articles in journals
+ Added (dc.contributor.author): Orth, Alan
+ Added (dc.date.issued): 2020-09-01
+ Added (dc.title): Testing CUA import NPE
Tue Oct 06 22:06:14 CEST 2020 | Query:containerItem:aff5e78d-87c9-438d-94f8-1050b649961c
Error while updating
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. &lt;!doctype html&gt;&lt;html lang=&quot;en&quot;&gt;&lt;head&gt;&lt;title&gt;HTTP Status 404 Not Found&lt;/title&gt;&lt;style type=&quot;text/css&quot;&gt;body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}&lt;/style&gt;&lt;/head&gt;&lt;body&gt;&lt;h1&gt;HTTP Status 404 Not Found&lt;/h1&gt;&lt;hr class=&quot;line&quot; /&gt;&lt;p&gt;&lt;b&gt;Type&lt;/b&gt; Status Report&lt;/p&gt;&lt;p&gt;&lt;b&gt;Message&lt;/b&gt; The requested resource [/solr/update] is not available&lt;/p&gt;&lt;p&gt;&lt;b&gt;Description&lt;/b&gt; The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.&lt;/p&gt;&lt;hr class=&quot;line&quot; /&gt;&lt;h3&gt;Apache Tomcat/7.0.104&lt;/h3&gt;&lt;/body&gt;&lt;/html&gt;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093)
at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104)
at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
<li>Also, I tested Listings and Reports and there are still no hits for &ldquo;Orth, Alan&rdquo; as a contributor, despite there being dozens of items in the repository and the Solr query generated by Listings and Reports actually returning hits:</li>
</ul>
<pre><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=search.resourcetype:2&amp;fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&amp;fq=dateIssued.year:[2013+TO+2021]&amp;rows=500&amp;wt=javabin&amp;version=2} hits=18 status=0 QTime=10
</code></pre><ul>
<li>Solr returns <code>hits=18</code> for the L&amp;R query, but there are no result shown in the L&amp;R UI</li>
<li>I sent all this feedback to Atmire&hellip;</li>
2020-10-06 15:59:31 +02:00
</ul>
2020-10-07 13:44:39 +02:00
<h2 id="2020-10-07">2020-10-07</h2>
<ul>
<li>Udana from IWMI had asked about stats discrepencies from reports they had generated in previous months or years
<ul>
<li>I told him that we very often purge bots and the number of stats can change drastically</li>
<li>Also, I told him that it is not possible to compare stats from previous exports and that the stats should be taking with a grain of salt</li>
</ul>
</li>
<li>Testing POSTing items to the DSpace 6 REST API
<ul>
<li>We need to authenticate to get a JSESSIONID cookie first:</li>
</ul>
</li>
</ul>
<pre><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
</code></pre><ul>
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
</ul>
<pre><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE &lt; item-object.json
</code></pre><ul>
<li>Format of JSON is:</li>
</ul>
<pre><code>{ &quot;metadata&quot;: [
{
&quot;key&quot;: &quot;dc.title&quot;,
&quot;value&quot;: &quot;Testing REST API post&quot;,
&quot;language&quot;: &quot;en_US&quot;
},
{
&quot;key&quot;: &quot;dc.contributor.author&quot;,
&quot;value&quot;: &quot;Orth, Alan&quot;,
&quot;language&quot;: &quot;en_US&quot;
},
{
&quot;key&quot;: &quot;dc.date.issued&quot;,
&quot;value&quot;: &quot;2020-09-01&quot;,
&quot;language&quot;: &quot;en_US&quot;
}
],
&quot;archived&quot;:&quot;false&quot;,
&quot;withdrawn&quot;:&quot;false&quot;
}
</code></pre><ul>
<li>What is unclear to me is the <code>archived</code> parameter, it seems to do nothing&hellip; perhaps it is only used for the <code>/items</code> endpoint when printing information about an item
<ul>
<li>If I submit to a collection that has a workflow, even as a super admin and with &ldquo;archived=false&rdquo; in the JSON, the item enters the workflow (&ldquo;Awaiting editor&rsquo;s attention&rdquo;)</li>
<li>If I submit to a new collection without a workflow the item gets archived immediately</li>
<li>I created <a href="https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a">some notes</a> to share with Salem and Leroy for future reference when we start discussion POSTing items to the REST API</li>
</ul>
</li>
<li>I created an account for Salem on DSpace Test and added it to the submitters group of an ICARDA collection with no other workflow steps so we can see what happens
<ul>
<li>We are curious to see if he gets a UUID when posting from MEL</li>
</ul>
</li>
2020-10-08 10:15:49 +02:00
<li>I did some tests by adding his account to certain workflow steps and trying to POST the item</li>
<li>Member of collection &ldquo;Submitters&rdquo; step:
<ul>
<li>HTTP Status 401 Unauthorized</li>
<li>The request has not been applied because it lacks valid authentication credentials for the target resource.</li>
</ul>
</li>
<li>Member of collection &ldquo;Accept/Reject&rdquo; step:
<ul>
<li>Same error&hellip;</li>
</ul>
</li>
<li>Member of collection &ldquo;Accept/Reject/Edit Metadata&rdquo; step:
<ul>
<li>Same error&hellip;</li>
</ul>
</li>
<li>Member of collection Administrators with no other workflow steps&hellip;:
<ul>
<li>Posts straight to archive</li>
</ul>
</li>
<li>Member of collection Administrators with empty &ldquo;Accept/Reject/Edit Metadata&rdquo; step:
<ul>
<li>Posts straight to archive</li>
</ul>
</li>
<li>Member of collection Administrators with populated &ldquo;Accept/Reject/Edit Metadata&rdquo; step:
<ul>
<li>Does <em>not</em> post straight to archive, goes to workflow</li>
</ul>
</li>
<li>Note that community administrators have no role in item submission other than being able to create/manage collection groups</li>
2020-10-07 13:44:39 +02:00
</ul>
2020-10-08 14:54:02 +02:00
<h2 id="2020-10-08">2020-10-08</h2>
<ul>
<li>I did some testing of the DSpace 5 REST API because Salem and I were curious
<ul>
<li>The authentication is a little different (uses a serialized JSON object instead of a form and the token is an HTTP header instead of a cookie):</li>
</ul>
</li>
</ul>
<pre><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 &lt; item-object.json
</code></pre><ul>
<li>The item submission works exactly the same as in DSpace 6:</li>
</ul>
<ol>
<li>The submitting user must be a collection admin</li>
<li>If the collection has a workflow the item will enter it and the API returns an item ID</li>
<li>If the collection does not have a workflow then the item is committed to the archive and you get a Handle</li>
</ol>
2020-10-12 16:53:24 +02:00
<h2 id="2020-10-09">2020-10-09</h2>
<ul>
<li>Skype with Peter about AReS and CGSpace
<ul>
<li>We discussed removing Atmire Listings and Reports from DSpace 6 because we can probably make the same reports in AReS and this module is the one that is currently holding us back from the upgrade</li>
<li>We discussed allowing partners to submit content via the REST API and perhaps making it an extra fee due to the burden it incurs with unfinished submissions, manual duplicate checking, developer support, etc</li>
<li>He was excited about the possibility of using my statistics API for more things on AReS as well as item view pages</li>
</ul>
</li>
<li>Also I fixed a bunch of the CRP mappings in the AReS value mapper and started a fresh re-indexing</li>
</ul>
<h2 id="2020-10-12">2020-10-12</h2>
<ul>
<li>Looking at CGSpace&rsquo;s Solr statistics for 2020-09 and I see:
<ul>
<li><code>RTB website BOT</code>: 212916</li>
<li><code>Java/1.8.0_66</code>: 3122</li>
<li><code>Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1</code>: 614</li>
<li><code>omgili/0.5 +http://omgili.com</code>: 272</li>
<li><code>Mozilla/5.0 (compatible; TrendsmapResolver/0.1)</code>: 199</li>
<li><code>Vizzit</code>: 160</li>
<li><code>Scoop.it</code>: 151</li>
</ul>
</li>
<li>I&rsquo;m confused because a pattern for <code>bot</code> has existed in the default DSpace spider agents file forever&hellip;
<ul>
<li>I see 259,000 hits in CGSpace&rsquo;s 2020 Solr core when I search for this: <code>userAgent:/.*[Bb][Oo][Tt].*/</code>
<ul>
<li>This includes 228,000 for <code>RTB website BOT</code> and 18,000 for <code>ILRI Livestock Website Publications importer BOT</code></li>
</ul>
</li>
<li>I made a few requests to DSpace Test with the RTB user agent to see if it gets logged or not:</li>
</ul>
</li>
</ul>
<pre><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
</code></pre><ul>
<li>After a few minutes I saw these four hits in Solr&hellip; WTF
<ul>
<li>So is there some issue with DSpace&rsquo;s parsing of the spider agent files?</li>
<li>I added <code>RTB website BOT</code> to the ilri pattern file, restarted Tomcat, and made four more requests to the bitstream</li>
<li>These four requests were recorded in Solr too, WTF!</li>
<li>It seems like the patterns aren&rsquo;t working at all&hellip;</li>
<li>I decided to try something drastic and removed all pattern files, adding only one single pattern <code>bot</code> to make sure this is not because of a syntax or precedence issue</li>
<li>Now even those four requests were recorded in Solr, WTF!</li>
<li>I will try one last thing, to put a single entry with the exact pattern <code>RTB website BOT</code> in a single spider agents pattern file&hellip;</li>
<li>Nope! Still records the hits&hellip; WTF</li>
<li>As a last resort I tried to use the vanilla <a href="https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/config/spiders/agents/example">DSpace 6 <code>example</code> file</a></li>
<li>And the hits still get recorded&hellip; WTF</li>
<li>So now I&rsquo;m wondering if this is because of our custom Atmire shit?</li>
<li>I will have to test on a vanilla DSpace instance I guess before I can complain to the dspace-tech mailing list</li>
</ul>
</li>
<li>I re-factored the <code>check-spider-hits.sh</code> script to read patterns from a text file rather than sed&rsquo;s stdout, and to properly search for spaces in patterns that use <code>\s</code> because Lucene&rsquo;s search syntax doesn&rsquo;t support it (and spaces work just fine)
<ul>
<li>Reference: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html">https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html</a></li>
<li>Reference: <a href="https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches">https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches</a></li>
</ul>
</li>
<li>I added <code>[Ss]pider</code> to the Tomcat Crawler Sessions Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID</li>
<li>I added a few of the patterns from above to our local agents list and ran the <code>check-spider-hits.sh</code> on CGSpace:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
Purging 228916 hits from RTB website BOT in statistics
Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
Purging 199 hits from [Ss]pider in statistics
Purging 2326 hits from ubermetrics in statistics
Purging 888 hits from omgili\.com in statistics
Purging 1888 hits from TrendsmapResolver in statistics
Purging 3546 hits from Vizzit in statistics
Purging 2127 hits from Scoop\.it in statistics
Total number of bot hits purged: 261258
$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2019 -u http://localhost:8083/solr -p
Purging 2952 hits from TrendsmapResolver in statistics-2019
Purging 4252 hits from Vizzit in statistics-2019
Purging 2976 hits from Scoop\.it in statistics-2019
Total number of bot hits purged: 10180
$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2018 -u http://localhost:8083/solr -p
Purging 1702 hits from TrendsmapResolver in statistics-2018
Purging 1062 hits from Vizzit in statistics-2018
Purging 920 hits from Scoop\.it in statistics-2018
Total number of bot hits purged: 3684
</code></pre><!-- raw HTML omitted -->
2020-10-06 15:59:31 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2020-10/">October, 2020</a></li>
<li><a href="/cgspace-notes/2020-09/">September, 2020</a></li>
<li><a href="/cgspace-notes/2020-08/">August, 2020</a></li>
<li><a href="/cgspace-notes/2020-07/">July, 2020</a></li>
<li><a href="/cgspace-notes/2020-06/">June, 2020</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>