mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-16 03:47:04 +01:00
1296 lines
72 KiB
HTML
1296 lines
72 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en" >
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
||
|
||
<meta property="og:title" content="October, 2020" />
|
||
<meta property="og:description" content="2020-10-06
|
||
|
||
Add tests for the new /items POST handlers to the DSpace 6.x branch of my dspace-statistics-api
|
||
|
||
It took a bit of extra work because I had to learn how to mock the responses for when Solr is not available
|
||
Tag and release version 1.3.0 on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0
|
||
|
||
|
||
Trying to test the changes Atmire sent last week but I had to re-create my local database from a recent CGSpace dump
|
||
|
||
During the FlywayDB migration I got an error:
|
||
|
||
|
||
" />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-10/" />
|
||
<meta property="article:published_time" content="2020-10-06T16:55:54+03:00" />
|
||
<meta property="article:modified_time" content="2020-11-16T10:53:45+02:00" />
|
||
|
||
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="October, 2020"/>
|
||
<meta name="twitter:description" content="2020-10-06
|
||
|
||
Add tests for the new /items POST handlers to the DSpace 6.x branch of my dspace-statistics-api
|
||
|
||
It took a bit of extra work because I had to learn how to mock the responses for when Solr is not available
|
||
Tag and release version 1.3.0 on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0
|
||
|
||
|
||
Trying to test the changes Atmire sent last week but I had to re-create my local database from a recent CGSpace dump
|
||
|
||
During the FlywayDB migration I got an error:
|
||
|
||
|
||
"/>
|
||
<meta name="generator" content="Hugo 0.88.1" />
|
||
|
||
|
||
|
||
<script type="application/ld+json">
|
||
{
|
||
"@context": "http://schema.org",
|
||
"@type": "BlogPosting",
|
||
"headline": "October, 2020",
|
||
"url": "https://alanorth.github.io/cgspace-notes/2020-10/",
|
||
"wordCount": "6709",
|
||
"datePublished": "2020-10-06T16:55:54+03:00",
|
||
"dateModified": "2020-11-16T10:53:45+02:00",
|
||
"author": {
|
||
"@type": "Person",
|
||
"name": "Alan Orth"
|
||
},
|
||
"keywords": "Notes"
|
||
}
|
||
</script>
|
||
|
||
|
||
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-10/">
|
||
|
||
<title>October, 2020 | CGSpace Notes</title>
|
||
|
||
|
||
<!-- combined, minified CSS -->
|
||
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
|
||
|
||
|
||
<!-- minified Font Awesome for SVG icons -->
|
||
|
||
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
||
|
||
<!-- RSS 2.0 feed -->
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
</div>
|
||
</header>
|
||
|
||
|
||
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-10/">October, 2020</a></h2>
|
||
<p class="blog-post-meta">
|
||
<time datetime="2020-10-06T16:55:54+03:00">Tue Oct 06, 2020</time>
|
||
in
|
||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
|
||
|
||
|
||
</p>
|
||
</header>
|
||
<h2 id="2020-10-06">2020-10-06</h2>
|
||
<ul>
|
||
<li>Add tests for the new <code>/items</code> POST handlers to the DSpace 6.x branch of my <a href="https://github.com/ilri/dspace-statistics-api/tree/v6_x">dspace-statistics-api</a>
|
||
<ul>
|
||
<li>It took a bit of extra work because I had to learn how to mock the responses for when Solr is not available</li>
|
||
<li>Tag and release version 1.3.0 on GitHub: <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0">https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.0</a></li>
|
||
</ul>
|
||
</li>
|
||
<li>Trying to test the changes Atmire sent last week but I had to re-create my local database from a recent CGSpace dump
|
||
<ul>
|
||
<li>During the FlywayDB migration I got an error:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
|
||
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
|
||
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||
Detail: Key (short_description)=(EPUB) already exists.
|
||
2020-10-06 21:36:04,142 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [could not execute batch]
|
||
2020-10-06 21:36:04,143 ERROR org.dspace.storage.rdbms.DatabaseRegistryUpdater @ Error attempting to update Bitstream Format and/or Metadata Registries
|
||
org.hibernate.exception.ConstraintViolationException: could not execute batch
|
||
at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:129)
|
||
at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:49)
|
||
at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:124)
|
||
at org.hibernate.engine.jdbc.batch.internal.BatchingBatch.performExecution(BatchingBatch.java:122)
|
||
at org.hibernate.engine.jdbc.batch.internal.BatchingBatch.doExecuteBatch(BatchingBatch.java:101)
|
||
at org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl.execute(AbstractBatchImpl.java:161)
|
||
at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.executeBatch(JdbcCoordinatorImpl.java:207)
|
||
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:390)
|
||
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:304)
|
||
at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:349)
|
||
at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:56)
|
||
at org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1195)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||
at org.hibernate.context.internal.ThreadLocalSessionContext$TransactionProtectionWrapper.invoke(ThreadLocalSessionContext.java:352)
|
||
at com.sun.proxy.$Proxy162.flush(Unknown Source)
|
||
at org.dspace.core.HibernateDBConnection.commit(HibernateDBConnection.java:83)
|
||
at org.dspace.core.Context.commit(Context.java:435)
|
||
at org.dspace.core.Context.complete(Context.java:380)
|
||
at org.dspace.administer.MetadataImporter.loadRegistry(MetadataImporter.java:164)
|
||
at org.dspace.storage.rdbms.DatabaseRegistryUpdater.updateRegistries(DatabaseRegistryUpdater.java:72)
|
||
at org.dspace.storage.rdbms.DatabaseRegistryUpdater.afterMigrate(DatabaseRegistryUpdater.java:121)
|
||
at org.flywaydb.core.internal.command.DbMigrate$3.doInTransaction(DbMigrate.java:250)
|
||
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
|
||
at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:246)
|
||
at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
|
||
at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
|
||
at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
|
||
at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
|
||
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:663)
|
||
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:575)
|
||
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:551)
|
||
at org.dspace.core.Context.<clinit>(Context.java:103)
|
||
at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
|
||
at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
|
||
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5197)
|
||
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5720)
|
||
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
|
||
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1016)
|
||
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:992)
|
||
</code></pre><ul>
|
||
<li>I checked the database migrations with <code>dspace database info</code> and they were all OK
|
||
<ul>
|
||
<li>Then I restarted the Tomcat again and it started up OK…</li>
|
||
</ul>
|
||
</li>
|
||
<li>There were two issues I had reported to Atmire last month:
|
||
<ul>
|
||
<li>Importing items from the command line throws a <code>NullPointerException</code> from <code>com.atmire.dspace.cua.CUASolrLoggerServiceImpl</code> for every item, but the item still gets imported</li>
|
||
<li>No results for author name in Listing and Reports, despite there being hits in Discovery search</li>
|
||
</ul>
|
||
</li>
|
||
<li>To test the first one I imported a very simple CSV file with one item with minimal data
|
||
<ul>
|
||
<li>There is a new error now (but the item does get imported):</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
|
||
Loading @mire database changes for module MQM
|
||
Changes have been processed
|
||
-----------------------------------------------------------
|
||
New item:
|
||
+ New owning collection (10568/3): ILRI articles in journals
|
||
+ Add (dc.contributor.author): Orth, Alan
|
||
+ Add (dc.date.issued): 2020-09-01
|
||
+ Add (dc.title): Testing CUA import NPE
|
||
|
||
1 item(s) will be changed
|
||
|
||
Do you want to make these changes? [y/n] y
|
||
-----------------------------------------------------------
|
||
New item: aff5e78d-87c9-438d-94f8-1050b649961c (10568/108548)
|
||
+ New owning collection (10568/3): ILRI articles in journals
|
||
+ Added (dc.contributor.author): Orth, Alan
|
||
+ Added (dc.date.issued): 2020-09-01
|
||
+ Added (dc.title): Testing CUA import NPE
|
||
Tue Oct 06 22:06:14 CEST 2020 | Query:containerItem:aff5e78d-87c9-438d-94f8-1050b649961c
|
||
Error while updating
|
||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> The requested resource [/solr/update] is not available</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/7.0.104</h3></body></html>
|
||
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
|
||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
|
||
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
|
||
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
|
||
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
|
||
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212)
|
||
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104)
|
||
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093)
|
||
at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104)
|
||
at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177)
|
||
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123)
|
||
at org.dspace.core.Context.dispatchEvents(Context.java:455)
|
||
at org.dspace.core.Context.commit(Context.java:424)
|
||
at org.dspace.core.Context.complete(Context.java:380)
|
||
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||
</code></pre><ul>
|
||
<li>Also, I tested Listings and Reports and there are still no hits for “Orth, Alan” as a contributor, despite there being dozens of items in the repository and the Solr query generated by Listings and Reports actually returning hits:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&fq=dateIssued.year:[2013+TO+2021]&rows=500&wt=javabin&version=2} hits=18 status=0 QTime=10
|
||
</code></pre><ul>
|
||
<li>Solr returns <code>hits=18</code> for the L&R query, but there are no result shown in the L&R UI</li>
|
||
<li>I sent all this feedback to Atmire…</li>
|
||
</ul>
|
||
<h2 id="2020-10-07">2020-10-07</h2>
|
||
<ul>
|
||
<li>Udana from IWMI had asked about stats discrepencies from reports they had generated in previous months or years
|
||
<ul>
|
||
<li>I told him that we very often purge bots and the number of stats can change drastically</li>
|
||
<li>Also, I told him that it is not possible to compare stats from previous exports and that the stats should be taking with a grain of salt</li>
|
||
</ul>
|
||
</li>
|
||
<li>Testing POSTing items to the DSpace 6 REST API
|
||
<ul>
|
||
<li>We need to authenticate to get a JSESSIONID cookie first:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
|
||
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
|
||
</code></pre><ul>
|
||
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE < item-object.json
|
||
</code></pre><ul>
|
||
<li>Format of JSON is:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>{ "metadata": [
|
||
{
|
||
"key": "dc.title",
|
||
"value": "Testing REST API post",
|
||
"language": "en_US"
|
||
},
|
||
{
|
||
"key": "dc.contributor.author",
|
||
"value": "Orth, Alan",
|
||
"language": "en_US"
|
||
},
|
||
{
|
||
"key": "dc.date.issued",
|
||
"value": "2020-09-01",
|
||
"language": "en_US"
|
||
}
|
||
],
|
||
"archived":"false",
|
||
"withdrawn":"false"
|
||
}
|
||
</code></pre><ul>
|
||
<li>What is unclear to me is the <code>archived</code> parameter, it seems to do nothing… perhaps it is only used for the <code>/items</code> endpoint when printing information about an item
|
||
<ul>
|
||
<li>If I submit to a collection that has a workflow, even as a super admin and with “archived=false” in the JSON, the item enters the workflow (“Awaiting editor’s attention”)</li>
|
||
<li>If I submit to a new collection without a workflow the item gets archived immediately</li>
|
||
<li>I created <a href="https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a">some notes</a> to share with Salem and Leroy for future reference when we start discussion POSTing items to the REST API</li>
|
||
</ul>
|
||
</li>
|
||
<li>I created an account for Salem on DSpace Test and added it to the submitters group of an ICARDA collection with no other workflow steps so we can see what happens
|
||
<ul>
|
||
<li>We are curious to see if he gets a UUID when posting from MEL</li>
|
||
</ul>
|
||
</li>
|
||
<li>I did some tests by adding his account to certain workflow steps and trying to POST the item</li>
|
||
<li>Member of collection “Submitters” step:
|
||
<ul>
|
||
<li>HTTP Status 401 – Unauthorized</li>
|
||
<li>The request has not been applied because it lacks valid authentication credentials for the target resource.</li>
|
||
</ul>
|
||
</li>
|
||
<li>Member of collection “Accept/Reject” step:
|
||
<ul>
|
||
<li>Same error…</li>
|
||
</ul>
|
||
</li>
|
||
<li>Member of collection “Accept/Reject/Edit Metadata” step:
|
||
<ul>
|
||
<li>Same error…</li>
|
||
</ul>
|
||
</li>
|
||
<li>Member of collection Administrators with no other workflow steps…:
|
||
<ul>
|
||
<li>Posts straight to archive</li>
|
||
</ul>
|
||
</li>
|
||
<li>Member of collection Administrators with empty “Accept/Reject/Edit Metadata” step:
|
||
<ul>
|
||
<li>Posts straight to archive</li>
|
||
</ul>
|
||
</li>
|
||
<li>Member of collection Administrators with populated “Accept/Reject/Edit Metadata” step:
|
||
<ul>
|
||
<li>Does <em>not</em> post straight to archive, goes to workflow</li>
|
||
</ul>
|
||
</li>
|
||
<li>Note that community administrators have no role in item submission other than being able to create/manage collection groups</li>
|
||
</ul>
|
||
<h2 id="2020-10-08">2020-10-08</h2>
|
||
<ul>
|
||
<li>I did some testing of the DSpace 5 REST API because Salem and I were curious
|
||
<ul>
|
||
<li>The authentication is a little different (uses a serialized JSON object instead of a form and the token is an HTTP header instead of a cookie):</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
|
||
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
|
||
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 < item-object.json
|
||
</code></pre><ul>
|
||
<li>The item submission works exactly the same as in DSpace 6:</li>
|
||
</ul>
|
||
<ol>
|
||
<li>The submitting user must be a collection admin</li>
|
||
<li>If the collection has a workflow the item will enter it and the API returns an item ID</li>
|
||
<li>If the collection does not have a workflow then the item is committed to the archive and you get a Handle</li>
|
||
</ol>
|
||
<h2 id="2020-10-09">2020-10-09</h2>
|
||
<ul>
|
||
<li>Skype with Peter about AReS and CGSpace
|
||
<ul>
|
||
<li>We discussed removing Atmire Listings and Reports from DSpace 6 because we can probably make the same reports in AReS and this module is the one that is currently holding us back from the upgrade</li>
|
||
<li>We discussed allowing partners to submit content via the REST API and perhaps making it an extra fee due to the burden it incurs with unfinished submissions, manual duplicate checking, developer support, etc</li>
|
||
<li>He was excited about the possibility of using my statistics API for more things on AReS as well as item view pages</li>
|
||
</ul>
|
||
</li>
|
||
<li>Also I fixed a bunch of the CRP mappings in the AReS value mapper and started a fresh re-indexing</li>
|
||
</ul>
|
||
<h2 id="2020-10-12">2020-10-12</h2>
|
||
<ul>
|
||
<li>Looking at CGSpace’s Solr statistics for 2020-09 and I see:
|
||
<ul>
|
||
<li><code>RTB website BOT</code>: 212916</li>
|
||
<li><code>Java/1.8.0_66</code>: 3122</li>
|
||
<li><code>Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1</code>: 614</li>
|
||
<li><code>omgili/0.5 +http://omgili.com</code>: 272</li>
|
||
<li><code>Mozilla/5.0 (compatible; TrendsmapResolver/0.1)</code>: 199</li>
|
||
<li><code>Vizzit</code>: 160</li>
|
||
<li><code>Scoop.it</code>: 151</li>
|
||
</ul>
|
||
</li>
|
||
<li>I’m confused because a pattern for <code>bot</code> has existed in the default DSpace spider agents file forever…
|
||
<ul>
|
||
<li>I see 259,000 hits in CGSpace’s 2020 Solr core when I search for this: <code>userAgent:/.*[Bb][Oo][Tt].*/</code>
|
||
<ul>
|
||
<li>This includes 228,000 for <code>RTB website BOT</code> and 18,000 for <code>ILRI Livestock Website Publications importer BOT</code></li>
|
||
</ul>
|
||
</li>
|
||
<li>I made a few requests to DSpace Test with the RTB user agent to see if it gets logged or not:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||
</code></pre><ul>
|
||
<li>After a few minutes I saw these four hits in Solr… WTF
|
||
<ul>
|
||
<li>So is there some issue with DSpace’s parsing of the spider agent files?</li>
|
||
<li>I added <code>RTB website BOT</code> to the ilri pattern file, restarted Tomcat, and made four more requests to the bitstream</li>
|
||
<li>These four requests were recorded in Solr too, WTF!</li>
|
||
<li>It seems like the patterns aren’t working at all…</li>
|
||
<li>I decided to try something drastic and removed all pattern files, adding only one single pattern <code>bot</code> to make sure this is not because of a syntax or precedence issue</li>
|
||
<li>Now even those four requests were recorded in Solr, WTF!</li>
|
||
<li>I will try one last thing, to put a single entry with the exact pattern <code>RTB website BOT</code> in a single spider agents pattern file…</li>
|
||
<li>Nope! Still records the hits… WTF</li>
|
||
<li>As a last resort I tried to use the vanilla <a href="https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/config/spiders/agents/example">DSpace 6 <code>example</code> file</a></li>
|
||
<li>And the hits still get recorded… WTF</li>
|
||
<li>So now I’m wondering if this is because of our custom Atmire shit?</li>
|
||
<li>I will have to test on a vanilla DSpace instance I guess before I can complain to the dspace-tech mailing list</li>
|
||
</ul>
|
||
</li>
|
||
<li>I re-factored the <code>check-spider-hits.sh</code> script to read patterns from a text file rather than sed’s stdout, and to properly search for spaces in patterns that use <code>\s</code> because Lucene’s search syntax doesn’t support it (and spaces work just fine)
|
||
<ul>
|
||
<li>Reference: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html">https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html</a></li>
|
||
<li>Reference: <a href="https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches">https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches</a></li>
|
||
</ul>
|
||
</li>
|
||
<li>I added <code>[Ss]pider</code> to the Tomcat Crawler Session Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID</li>
|
||
<li>I added a few of the patterns from above to our local agents list and ran the <code>check-spider-hits.sh</code> on CGSpace:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
|
||
Purging 228916 hits from RTB website BOT in statistics
|
||
Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
|
||
Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
|
||
Purging 199 hits from [Ss]pider in statistics
|
||
Purging 2326 hits from ubermetrics in statistics
|
||
Purging 888 hits from omgili\.com in statistics
|
||
Purging 1888 hits from TrendsmapResolver in statistics
|
||
Purging 3546 hits from Vizzit in statistics
|
||
Purging 2127 hits from Scoop\.it in statistics
|
||
|
||
Total number of bot hits purged: 261258
|
||
$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2019 -u http://localhost:8083/solr -p
|
||
Purging 2952 hits from TrendsmapResolver in statistics-2019
|
||
Purging 4252 hits from Vizzit in statistics-2019
|
||
Purging 2976 hits from Scoop\.it in statistics-2019
|
||
|
||
Total number of bot hits purged: 10180
|
||
$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2018 -u http://localhost:8083/solr -p
|
||
Purging 1702 hits from TrendsmapResolver in statistics-2018
|
||
Purging 1062 hits from Vizzit in statistics-2018
|
||
Purging 920 hits from Scoop\.it in statistics-2018
|
||
|
||
Total number of bot hits purged: 3684
|
||
</code></pre><h2 id="2020-10-13">2020-10-13</h2>
|
||
<ul>
|
||
<li>Skype with Peter about AReS again
|
||
<ul>
|
||
<li>We decided to use Title Case for our countries on CGSpace to minimize the need for mapping on AReS</li>
|
||
<li>We did some work to add a dozen more mappings for strange and incorrect CRPs on AReS</li>
|
||
</ul>
|
||
</li>
|
||
<li>I can update the country metadata in PostgreSQL like this:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>dspace=> BEGIN;
|
||
dspace=> UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
|
||
UPDATE 51756
|
||
dspace=> COMMIT;
|
||
</code></pre><ul>
|
||
<li>I will need to pay special attention to Côte d’Ivoire, Bosnia and Herzegovina, and a few others though… maybe better do search and replace using <code>fix-metadata-values.csv</code>
|
||
<ul>
|
||
<li>Export a list of distinct values from the database:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
|
||
COPY 195
|
||
</code></pre><ul>
|
||
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
|
||
<ul>
|
||
<li>I still had to double check everything to catch some corner cases (Andorra, Timor-leste, etc)</li>
|
||
</ul>
|
||
</li>
|
||
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>:'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
|
||
</code></pre><ul>
|
||
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka “lookaround” in PCRE?) to match words that are <em>not</em> “pair”, “displayed”, etc because we don’t want to edit the XML tags themselves…
|
||
<ul>
|
||
<li>I had to fix a few manually after doing this, as above with PostgreSQL</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-10-14">2020-10-14</h2>
|
||
<ul>
|
||
<li>I discussed the title casing of countries with Abenet and she suggested we also apply title casing to regions
|
||
<ul>
|
||
<li>I exported the list of regions from the database:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
|
||
COPY 34
|
||
</code></pre><ul>
|
||
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
|
||
<li>After testing the replacements locally I ran them on CGSpace:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
|
||
</code></pre><ul>
|
||
<li>Then I started a full re-indexing:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
|
||
real 88m21.678s
|
||
user 7m59.182s
|
||
sys 2m22.713s
|
||
</code></pre><ul>
|
||
<li>I added a dozen or so more mappings to fix some country outliers on AReS
|
||
<ul>
|
||
<li>I will start a fresh harvest there once the Discovery update is done on CGSpace</li>
|
||
</ul>
|
||
</li>
|
||
<li>I also adjusted my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts to work on DSpace 6 where there is no more <code>resource_type_id</code> field
|
||
<ul>
|
||
<li>I will need to do it on a few more scripts as well, but I’ll do that after we migrate to DSpace 6 because those scripts are less important</li>
|
||
</ul>
|
||
</li>
|
||
<li>I found a new setting in DSpace 6’s <code>usage-statistics.cfg</code> about case insensitive matching of bots that defaults to false, so I enabled it in our DSpace 6 branch
|
||
<ul>
|
||
<li>I am curious to see if that resolves the strange issues I noticed yesterday about bot matching of patterns in the spider agents file completely not working</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-10-15">2020-10-15</h2>
|
||
<ul>
|
||
<li>Re-deploy latest code on both CGSpace and DSpace Test to get the input forms changes
|
||
<ul>
|
||
<li>Run system updates and reboot each server (linode18 and linode26)</li>
|
||
<li>I had to restart Tomcat seven times on CGSpace before all Solr stats cores came up OK</li>
|
||
</ul>
|
||
</li>
|
||
<li>Skype with Peter and Abenet about AReS and CGSpace
|
||
<ul>
|
||
<li>We agreed to lower case the AGROVOC subjects on CGSpace to make it harmonized with MELSpace and WorldFish</li>
|
||
<li>We agreed to separate the AGROVOC from the other center- and CRP-specific subjects so that the search and tag clouds are cleaner and more useful</li>
|
||
<li>We added a filter for journal title</li>
|
||
</ul>
|
||
</li>
|
||
<li>I enabled anonymous access to the “Export search metadata” option on DSpace Test
|
||
<ul>
|
||
<li>If I search for author containing “Orth, Alan” or “Orth Alan” the export search metadata returns HTTP 400</li>
|
||
<li>If I search for author containing “Orth” it exports a CSV properly…</li>
|
||
</ul>
|
||
</li>
|
||
<li>I created issues on the OpenRXV repository:
|
||
<ul>
|
||
<li><a href="https://github.com/ilri/OpenRXV/issues/42">Can’t download templates that have spaces in their file name</a></li>
|
||
<li><a href="https://github.com/ilri/OpenRXV/issues/43">Can’t search for text values with a space in “Mapping Values” interface</a></li>
|
||
</ul>
|
||
</li>
|
||
<li>Atmire responded about the Listings and Reports and Content and Usage Statistics issues with DSpace 6 that I reported last week
|
||
<ul>
|
||
<li>They said that the CUA issue was a mistake and should be fixed in a minor version bump</li>
|
||
<li>They asked me to confirm if the L&R version bump from last week did not solve the issue there (which I had tested locally, but not on DSpace Test)</li>
|
||
<li>I will test them both again on DSpace Test and report back</li>
|
||
</ul>
|
||
</li>
|
||
<li>I posted a message on Yammer to inform all our users about the changes to countries, regions, and AGROVOC subjects</li>
|
||
<li>I modified all AGROVOC subjects to be lower case in PostgreSQL and then exported a list of the top 1500 to update the controlled vocabulary in our submission form:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>dspace=> BEGIN;
|
||
dspace=> UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
|
||
UPDATE 335063
|
||
dspace=> COMMIT;
|
||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.subject", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY "dc.subject" ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER;
|
||
COPY 1500
|
||
</code></pre><ul>
|
||
<li>Use my <code>agrovoc-lookup.py</code> script to validate subject terms against the AGROVOC REST API, extract matches with <code>csvgrep</code>, and then update and format the controlled vocabulary:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 > /tmp/subjects.txt
|
||
$ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
|
||
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' > dspace/config/controlled-vocabularies/dc-subject.xml
|
||
# apply formatting in XML file
|
||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
|
||
</code></pre><ul>
|
||
<li>Then I started a full re-indexing on CGSpace:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
|
||
real 88m21.678s
|
||
user 7m59.182s
|
||
sys 2m22.713s
|
||
</code></pre><h2 id="2020-10-18">2020-10-18</h2>
|
||
<ul>
|
||
<li>Macaroni Bros wrote to me to ask why some of their CCAFS harvesting is failing
|
||
<ul>
|
||
<li>They are scraping HTML from /browse responses like this:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<p><a href="https://cgspace.cgiar.org/browse?type=crpsubject&value=Climate+Change%2C+Agriculture+and+Food+Security&XML&rpp=5000">https://cgspace.cgiar.org/browse?type=crpsubject&value=Climate+Change%2C+Agriculture+and+Food+Security&XML&rpp=5000</a></p>
|
||
<ul>
|
||
<li>They are using the user agent “CCAFS Website Publications importer BOT” so they are getting rate limited by nginx</li>
|
||
<li>Ideally they would use the REST <code>find-by-metadata-field</code> endpoint, but it is <em>really</em> slow for large result sets (like twenty minutes!):</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
|
||
</code></pre><ul>
|
||
<li>For now I will whitelist their user agent so that they can continue scraping /browse</li>
|
||
<li>I figured out that the mappings for AReS are stored in Elasticsearch
|
||
<ul>
|
||
<li>There is a Kibana interface running on port 5601 that can help explore the values in the index</li>
|
||
<li>I can interact with Elasticsearch by sending requests, for example to delete an item by its <code>_id</code>:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
|
||
{
|
||
"query": {
|
||
"match": {
|
||
"_id": "64j_THMBiwiQ-PKfCSlI"
|
||
}
|
||
}
|
||
}
|
||
</code></pre><ul>
|
||
<li>I added a new find/replace:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||
{
|
||
"find": "ALAN1",
|
||
"replace": "ALAN2",
|
||
}
|
||
'
|
||
</code></pre><ul>
|
||
<li>I see it in Kibana, and I can search it in Elasticsearch, but I don’t see it in OpenRXV’s mapping values dashboard</li>
|
||
<li>Now I deleted everything in the <code>openrxv-values</code> index:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||
</code></pre><ul>
|
||
<li>Then I tried posting it again:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||
{
|
||
"find": "ALAN1",
|
||
"replace": "ALAN2",
|
||
}
|
||
'
|
||
</code></pre><ul>
|
||
<li>But I still don’t see it in AReS</li>
|
||
<li>Interesting! I added a find/replace manually in AReS and now I see the one I POSTed…</li>
|
||
<li>I fixed a few bugs in the Simple and Extended PDF reports on AReS
|
||
<ul>
|
||
<li>Add missing ISI Journal and Type to Simple PDF report</li>
|
||
<li>Fix DOIs in Simple PDF report</li>
|
||
<li>Add missing “<a href="https://hdl.handle.net">https://hdl.handle.net</a>” to Handles in Extented PDF report</li>
|
||
</ul>
|
||
</li>
|
||
<li>Testing Atmire CUA and L&R based on their feedback from a few days ago
|
||
<ul>
|
||
<li>I no longer get the NullPointerException from CUA when importing metadata on the command line (!)</li>
|
||
<li>Listings and Reports now shows results for simple queries that I tested (!), though it seems that there are some new JavaScript libraries I need to allow in nginx</li>
|
||
</ul>
|
||
</li>
|
||
<li>I sent a mail to the dspace-tech mailing list asking about the error with DSpace 6’s “Export Search Metadata” function
|
||
<ul>
|
||
<li>If I search for an author like “Orth, Alan” it gives an HTTP 400, but if I search for “Orth” alone it exports a CSV</li>
|
||
<li>I replicated the same issue on demo.dspace.org</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-10-19">2020-10-19</h2>
|
||
<ul>
|
||
<li>Last night I learned how to POST mappings to Elasticsearch for AReS:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
|
||
</code></pre><ul>
|
||
<li>The JSON file looks like this, with one instruction on each line:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>{"index":{}}
|
||
{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
|
||
{"index":{}}
|
||
{ "find": "FISH", "replace": "Fish" }
|
||
</code></pre><ul>
|
||
<li>Adjust the report templates on AReS based on some of Peter’s feedback</li>
|
||
<li>I wrote a quick Python script to filter and convert the old AReS mappings to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html">Elasticsearch’s Bulk API</a> format:</li>
|
||
</ul>
|
||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#75715e">#!/usr/bin/env python3</span>
|
||
|
||
<span style="color:#f92672">import</span> json
|
||
<span style="color:#f92672">import</span> re
|
||
|
||
f <span style="color:#f92672">=</span> open(<span style="color:#e6db74">'/tmp/mapping.json'</span>, <span style="color:#e6db74">'r'</span>)
|
||
data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
|
||
|
||
<span style="color:#75715e"># Iterate over old mapping file, which is in format "find": "replace", ie:</span>
|
||
<span style="color:#75715e">#</span>
|
||
<span style="color:#75715e"># "alan": "ALAN"</span>
|
||
<span style="color:#75715e">#</span>
|
||
<span style="color:#75715e"># And convert to proper dictionaries for import into Elasticsearch's Bulk API:</span>
|
||
<span style="color:#75715e">#</span>
|
||
<span style="color:#75715e"># { "find": "alan", "replace": "ALAN" }</span>
|
||
<span style="color:#75715e">#</span>
|
||
<span style="color:#66d9ef">for</span> find, replace <span style="color:#f92672">in</span> data<span style="color:#f92672">.</span>items():
|
||
<span style="color:#75715e"># Skip all upper and all lower case strings because they are indicative of</span>
|
||
<span style="color:#75715e"># some AGROVOC or other mappings we no longer want to do</span>
|
||
<span style="color:#66d9ef">if</span> find<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> find<span style="color:#f92672">.</span>islower() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>islower():
|
||
<span style="color:#66d9ef">continue</span>
|
||
|
||
<span style="color:#75715e"># Skip replacements with acronyms like:</span>
|
||
<span style="color:#75715e">#</span>
|
||
<span style="color:#75715e"># International Livestock Research Institute - ILRI</span>
|
||
<span style="color:#75715e">#</span>
|
||
acronym_pattern <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(<span style="color:#e6db74">r</span><span style="color:#e6db74">"[A-Z]+$"</span>)
|
||
acronym_pattern_match <span style="color:#f92672">=</span> acronym_pattern<span style="color:#f92672">.</span>search(replace)
|
||
<span style="color:#66d9ef">if</span> acronym_pattern_match <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
|
||
<span style="color:#66d9ef">continue</span>
|
||
|
||
mapping <span style="color:#f92672">=</span> { <span style="color:#e6db74">"find"</span>: find, <span style="color:#e6db74">"replace"</span>: replace }
|
||
|
||
<span style="color:#75715e"># Print command for Elasticsearch</span>
|
||
print(<span style="color:#e6db74">'{"index":</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">}'</span>)
|
||
print(json<span style="color:#f92672">.</span>dumps(mapping))
|
||
|
||
f<span style="color:#f92672">.</span>close()
|
||
</code></pre></div><ul>
|
||
<li>It filters all upper and lower case strings as well as any replacements that end in an acronym like “- ILRI”, reducing the number of mappings from around 4,000 to about 900</li>
|
||
<li>I deleted the existing <code>openrxv-values</code> Elasticsearch core and then POSTed it:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ ./convert-mapping.py > /tmp/elastic-mappings.txt
|
||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
|
||
</code></pre><ul>
|
||
<li>Then in AReS I didn’t see the mappings in the dashboard until I added a new one manually, after which they all appeared
|
||
<ul>
|
||
<li>I started a new harvesting</li>
|
||
</ul>
|
||
</li>
|
||
<li>I checked the CIMMYT DSpace repository and I see they have <a href="https://repository.cimmyt.org/rest">the REST API enabled</a>
|
||
<ul>
|
||
<li>The data doesn’t look too bad actually: they have countries in title case, AGROVOC in upper case, CRPs, etc</li>
|
||
<li>According to <a href="https://repository.cimmyt.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc">their OAI</a> they have 6,500 items in the repository</li>
|
||
<li>I would be interested to explore the possibility to harvest them…</li>
|
||
</ul>
|
||
</li>
|
||
<li>Bosede said they were having problems with the “Access” step during item submission
|
||
<ul>
|
||
<li>I looked at the Munin graphs for PostgreSQL and both connections and locks look normal so I’m not sure what it could be</li>
|
||
<li>I restarted the PostgreSQL service just to see if that would help</li>
|
||
<li>She said she was still experiencing the issue…</li>
|
||
</ul>
|
||
</li>
|
||
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||
Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
|
||
</code></pre><ul>
|
||
<li>The solution is, as always:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
|
||
UPDATE 1
|
||
</code></pre><ul>
|
||
<li>After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
|
||
|
||
Purging 2474 hits from ShortLinkTranslate in statistics
|
||
Purging 2568 hits from RI\/1\.0 in statistics
|
||
Purging 1851 hits from ILRI Livestock Website Publications importer BOT in statistics
|
||
Purging 1282 hits from curl in statistics
|
||
|
||
Total number of bot hits purged: 8174
|
||
</code></pre><ul>
|
||
<li>Add “Infographic” to types in input form</li>
|
||
<li>Looking into the spider agent issue from last week, where hits seem to be logged regardless of ANY spider agent patterns being loaded
|
||
<ul>
|
||
<li>I changed the following two options:
|
||
<ul>
|
||
<li><code>usage-statistics.logBots = false</code></li>
|
||
<li><code>usage-statistics.bots.case-insensitive = true</code></li>
|
||
</ul>
|
||
</li>
|
||
<li>Then I made several requests with a bot user agent:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
|
||
</code></pre><ul>
|
||
<li>And I saw three hits in Solr with <code>isBot: true</code>!!!
|
||
<ul>
|
||
<li>I made a few more requests with user agent “fumanchu” and it logs them with <code>isBot: false</code>…</li>
|
||
<li>I made a request with user agent “Delphi 2009” which is in the ilri pattern file, and it was logged with <code>isBot: true</code></li>
|
||
<li>I made a few more requests and confirmed that if a pattern is in the list it gets logged with <code>isBot: true</code> despite the fact that <code>usage-statistics.logBots</code> is false…</li>
|
||
<li>So WTF this means that it <em>knows</em> they are from a bot, but it logs them anyways</li>
|
||
<li>Is this an issue with Atmire’s modules?</li>
|
||
<li>I sent them feedback on the ticket</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-10-21">2020-10-21</h2>
|
||
<ul>
|
||
<li>Peter needs to do some reporting on gender across the entirety of CGSpace so he asked me to tag a bunch of items with the AGROVOC “gender” subject (in CGIAR Gender Platform community, all ILRI items with subject “gender” or “women”, all CCAFS with “gender and social inclusion” etc)
|
||
<ul>
|
||
<li>First I exported the Gender Platform community and tagged all the items there with “gender” in OpenRefine</li>
|
||
<li>Then I exported all of CGSpace and extracted just the ILRI and other center-specific tags with <code>csvcut</code>:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
|
||
$ dspace metadata-export -f /tmp/cgspace.csv
|
||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||
</code></pre><ul>
|
||
<li>Then I went through all center subjects looking for “WOMEN” or “GENDER” and checking if they were missing the associated AGROVOC subject
|
||
<ul>
|
||
<li>To reduce the size of the CSV file I removed all center subject columns after filtering them, and I flagged all rows that I changed so I could upload a CSV with only the items that were modified</li>
|
||
<li>In total it was about 1,100 items that I tagged across the Gender Platform community and elsewhere</li>
|
||
<li>Also, I ran the CSVs through my <code>csv-metadata-quality</code> checker to do basic sanity checks, which ended up removing a few dozen duplicated subjects</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-10-22">2020-10-22</h2>
|
||
<ul>
|
||
<li>Bosede was getting this error on CGSpace yesterday:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
|
||
</code></pre><ul>
|
||
<li>Collection 1072 appears to be <a href="https://cgspace.cgiar.org/handle/10568/69542">IITA Miscellaneous</a>
|
||
<ul>
|
||
<li>The submit step is defined, but has no users or groups</li>
|
||
<li>I added the IITA submitters there and told Bosede to try again</li>
|
||
</ul>
|
||
</li>
|
||
<li>Add two new blocks to list the top communities and collections on AReS</li>
|
||
<li>I want to extract all CRPs and affiliations from AReS to do some text processing and create some mappings…
|
||
<ul>
|
||
<li>First extract 10,000 affiliations from Elasticsearch by only including the <code>affiliation</code> source:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
|
||
</code></pre><ul>
|
||
<li>Then I decided to try a different approach and I adjusted my <code>convert-mapping.py</code> script to re-consider some replacement patterns with acronyms from the original AReS <code>mapping.json</code> file to hopefully address some MEL to CGSpace mappings
|
||
<ul>
|
||
<li>For example, to changes this:
|
||
<ul>
|
||
<li>find: International Livestock Research Institute</li>
|
||
<li>replace: International Livestock Research Institute - ILRI</li>
|
||
</ul>
|
||
</li>
|
||
<li>… into this:
|
||
<ul>
|
||
<li>find: International Livestock Research Institute - ILRI</li>
|
||
<li>replace: International Livestock Research Institute</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
<li>I re-uploaded the mappings to Elasticsearch like I did yesterday and restarted the harvesting</li>
|
||
</ul>
|
||
<h2 id="2020-10-24">2020-10-24</h2>
|
||
<ul>
|
||
<li>Atmire sent a small version bump to CUA (6.x-4.1.10-ilri-RC5) to fix the logging of bot requests when <code>usage-statistics.logBots</code> is false
|
||
<ul>
|
||
<li>I tested it by making several requests to DSpace Test with the <code>RTB website BOT</code> and <code>Delphi 2009</code> user agents and can verify that they are no longer logged</li>
|
||
</ul>
|
||
</li>
|
||
<li>I spent a few hours working on mappings on AReS
|
||
<ul>
|
||
<li>I decided to do a full re-harvest on AReS with <em>no mappings</em> so I could extract the CRPs and affiliations to see how much work they needed</li>
|
||
<li>I worked on my Python script to process some cleanups of the values to create find/replace mappings for common scenarios:
|
||
<ul>
|
||
<li>Removing acronyms from the end of strings</li>
|
||
<li>Removing “CRP on " from strings</li>
|
||
</ul>
|
||
</li>
|
||
<li>The problem is that the mappings are applied to all fields, and we want to keep “CGIAR Research Program on …” in the authors, but not in the CRPs field</li>
|
||
<li>Really the best solution is to have each repository use the same controlled vocabularies</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-10-25">2020-10-25</h2>
|
||
<ul>
|
||
<li>I re-installed DSpace Test with a fresh snapshot of CGSpace’s to test the DSpace 6 upgrade (the last time was in 2020-05, and we’ve fixed a lot of issues since then):</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
|
||
$ git checkout origin/6_x-dev-atmire-modules
|
||
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
|
||
$ sudo su - postgres
|
||
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
|
||
$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
|
||
$ exit
|
||
$ sudo systemctl stop tomcat7
|
||
$ cd dspace/target/dspace-installer
|
||
$ rm -rf /blah/dspacetest/config/spring
|
||
$ ant update
|
||
$ dspace database migrate
|
||
(10 minutes)
|
||
$ sudo systemctl start tomcat7
|
||
(discovery indexing starts)
|
||
</code></pre><ul>
|
||
<li>Then I started processing the Solr stats one core and 1 million records at a time:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||
</code></pre><ul>
|
||
<li>After the fifth or so run I got this error:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
|
||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
|
||
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
|
||
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
|
||
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
|
||
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
|
||
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
|
||
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||
</code></pre><ul>
|
||
<li>So basically, as I saw at this same step in 2020-05, there are some documents that have IDs that have <em>not</em> been converted to UUID, and have <em>not</em> been labeled as “unmigrated” either…
|
||
<ul>
|
||
<li>I see there are about 217,000 of them, 99% of which are of <code>type: 5</code> which is “site”</li>
|
||
<li>I purged them:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||
</code></pre><ul>
|
||
<li>Then I restarted the <code>solr-upgrade-statistics-6x</code> process, which apparently had no records left to process</li>
|
||
<li>I started processing the statistics-2019 core…
|
||
<ul>
|
||
<li>I managed to process 7.5 million records in 7 hours without any errors!</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2020-10-26">2020-10-26</h2>
|
||
<ul>
|
||
<li>The statistics processing on the statistics-2018 core errored after 1.8 million records:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>Exception: Java heap space
|
||
java.lang.OutOfMemoryError: Java heap space
|
||
</code></pre><ul>
|
||
<li>I had the same problem when I processed the statistics-2018 core in 2020-07 and 2020-08
|
||
<ul>
|
||
<li>I will try to purge some unmigrated records (around 460,000), most of which are of <code>type: 5</code> (site) <del>so not relevant to our views and downloads anyways</del>:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
|
||
</code></pre><ul>
|
||
<li>I restarted the process and it crashed again a few minutes later
|
||
<ul>
|
||
<li>I increased the memory to 4096m and tried again</li>
|
||
<li>It eventually completed, after which time I purge all remaining 350,000 unmigrated records (99% of which were <code>type: 5</code>):</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||
</code></pre><ul>
|
||
<li>Then I started processing the statistics-2017 core…
|
||
<ul>
|
||
<li>The processing finished with no errors and afterwards I purged 800,000 unmigrated records (all with <code>type: 5</code>):</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||
</code></pre><ul>
|
||
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
|
||
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
|
||
<li>Add new ORCID identifier for <a href="https://orcid.org/0000-0003-3871-6277">Perle LATRE DE LATE</a> to controlled vocabulary</li>
|
||
<li>Use <code>move-collections.sh</code> to move a few AgriFood Tools collections on CGSpace into a new <a href="https://hdl.handle.net/10568/109982">sub community</a></li>
|
||
</ul>
|
||
<h2 id="2020-10-27">2020-10-27</h2>
|
||
<ul>
|
||
<li>I purged 849,408 unmigrated records from the statistics-2016 core after it finished processing…</li>
|
||
<li>I purged 285,000 unmigrated records from the statistics-2015 core after it finished processing…</li>
|
||
<li>I purged 196,000 unmigrated records from the statistics-2014 core after it finished processing…</li>
|
||
<li>I finally finished processing all the statistics cores with the <code>solr-upgrade-statistics-6x</code> utility on DSpace Test
|
||
<ul>
|
||
<li>I started the Atmire stats processing:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
|
||
</code></pre><ul>
|
||
<li>Peter asked me to add the new preferred AGROVOC subject “covid-19” to all items we had previously added “coronavirus disease”, and to make sure all items with ILRI subject “ZOONOTIC DISEASES” have the AGROVOC subject “zoonoses”
|
||
<ul>
|
||
<li>I exported all the records on CGSpace from the CLI and extracted the columns I needed to process them in OpenRefine:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ dspace metadata-export -f /tmp/cgspace.csv
|
||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||
</code></pre><ul>
|
||
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
|
||
<li>Skype with Peter and Abenet about CGSpace Explorer (AReS)
|
||
<ul>
|
||
<li>They want to do a big push in ILRI and our partners to use it in mid November (around 16th) so we need to clean up the metadata and try to fix the views/downloads issue by then</li>
|
||
<li>I filed <a href="https://github.com/ilri/OpenRXV/issues/45">an issue</a> on OpenRXV for the views/downloads</li>
|
||
<li>We also talked about harvesting CIMMYT’s repository into AReS, perhaps with only a subset of their data, though they seem to have some issues with their data:
|
||
<ul>
|
||
<li>dc.contributor.author and dcterms.creator</li>
|
||
<li>dc.title and dcterms.title</li>
|
||
<li>dc.region.focus</li>
|
||
<li>dc.coverage.countryfocus</li>
|
||
<li>dc.rights.accesslevel (access status)</li>
|
||
<li>dc.source.journal (source)</li>
|
||
<li>dcterms.type and dc.type</li>
|
||
<li>dc.subject.agrovoc</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
<li>I did some work on my previous <code>create-mappings.py</code> script to process journal titles and sponsors/investors as well as CRPs and affiliations
|
||
<ul>
|
||
<li>I converted it to use the Elasticsearch scroll API directly rather than consuming a JSON file</li>
|
||
<li>The result is about 1200 mappings, mostly to remove acronyms at the end of metadata values</li>
|
||
<li>I added a few custom mappings using <code>convert-mapping.py</code> and then uploaded them to AReS:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
|
||
$ ./convert-mapping.py >> /tmp/elasticsearch-mappings.txt
|
||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
|
||
</code></pre><ul>
|
||
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
|
||
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ docker-compose up --build -d angular_nginx
|
||
</code></pre><h2 id="2020-10-28">2020-10-28</h2>
|
||
<ul>
|
||
<li>Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ docker-compose up --build -d --force-recreate angular_nginx
|
||
</code></pre><ul>
|
||
<li>Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like “Burkina faso” is due to the country formatter (see: <code>backend/src/harvester/consumers/fetch.consumer.ts</code>)
|
||
<ul>
|
||
<li>I don’t understand Typescript syntax so for now I will just disable that formatter in each repository configuration and I’m sure it will be better, as we’re all using title case like “Kenya” and “Burkina Faso” now anyways</li>
|
||
</ul>
|
||
</li>
|
||
<li>Also, I fixed a few mappings with WorldFish data</li>
|
||
<li>Peter really wants us to move forward with the alignment of our regions to UN M.49, and the CKM web team hasn’t responded to any of the mails we’ve sent recently so I will just do it
|
||
<ul>
|
||
<li>These are the changes that will happen in the input forms:
|
||
<ul>
|
||
<li>East Africa → Eastern Africa</li>
|
||
<li>West Africa → Western Africa</li>
|
||
<li>Southeast Asia → South-eastern Asia</li>
|
||
<li>South Asia → Southern Asia</li>
|
||
<li>Africa South of Sahara → Sub-Saharan Africa</li>
|
||
<li>North Africa → Northern Africa</li>
|
||
<li>West Asia → Western Asia</li>
|
||
</ul>
|
||
</li>
|
||
<li>There are some regions we use that are not present, for example Sahel, ACP, Middle East, and West and Central Africa. I will advocate for closer alignment later</li>
|
||
<li>I ran my <code>fix-metadata-values.py</code> script to update the values in the database:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ cat 2020-10-28-update-regions.csv
|
||
cg.coverage.region,correct
|
||
East Africa,Eastern Africa
|
||
West Africa,Western Africa
|
||
Southeast Asia,South-eastern Asia
|
||
South Asia,Southern Asia
|
||
Africa South Of Sahara,Sub-Saharan Africa
|
||
North Africa,Northern Africa
|
||
West Asia,Western Asia
|
||
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
|
||
</code></pre><ul>
|
||
<li>Then I started a full Discovery re-indexing:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
|
||
real 92m14.294s
|
||
user 7m59.840s
|
||
sys 2m22.327s
|
||
</code></pre><ul>
|
||
<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>…
|
||
<ul>
|
||
<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
|
||
<li>I should have used this: <code>id:/.+-unmigrated/</code></li>
|
||
<li>Or perhaps this (with a check first!): <code>*:* NOT id:/.{36}/</code></li>
|
||
<li>We need to make sure to explicitly purge the unmigrated records, then purge any that are not matching the UUID pattern (after inspecting manually!)</li>
|
||
<li>There are still 3.7 million records in our ten years of Solr statistics that are unmigrated (I only noticed because the DSpace Statistics API indexer kept failing)</li>
|
||
<li>I don’t think this is serious enough to re-start the simulation of the DSpace 6 migration over again, but I definitely want to make sure I use the correct query when I do CGSpace</li>
|
||
</ul>
|
||
</li>
|
||
<li>The AReS indexing finished after I removed the country formatting from all the repository configurations and now I see values like “SA”, “CA”, etc…
|
||
<ul>
|
||
<li>So really we need this to fix MELSpace countries, so I will re-enable the country formatting for their repository</li>
|
||
</ul>
|
||
</li>
|
||
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
|
||
COPY 6357
|
||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
|
||
COPY 730
|
||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
|
||
COPY 71748
|
||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.publisher", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
|
||
COPY 3882
|
||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.source", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
|
||
COPY 3684
|
||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.relation.ispartofseries", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
|
||
COPY 5598
|
||
</code></pre><ul>
|
||
<li>I noticed there are still some mapping for acronyms and other fixes that haven’t been applied, so I ran my <code>create-mappings.py</code> script against Elasticsearch again
|
||
<ul>
|
||
<li>Now I’m comparing yesterday’s mappings with today’s and I don’t see any duplicates…</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ grep -c '"find"' /tmp/elasticsearch-mappings*
|
||
/tmp/elasticsearch-mappings2.txt:350
|
||
/tmp/elasticsearch-mappings.txt:1228
|
||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
|
||
1578
|
||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | uniq | wc -l
|
||
1578
|
||
</code></pre><ul>
|
||
<li>I have no idea why they wouldn’t have been caught yesterday when I originally ran the script on a clean AReS with no mappings…
|
||
<ul>
|
||
<li>In any case, I combined the mappings and then uploaded them to AReS:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
|
||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
|
||
</code></pre><ul>
|
||
<li>The latest indexing (second for today!) finally finshed on AReS and the countries and affiliations/crps/journals all look MUCH better
|
||
<ul>
|
||
<li>There are still a few acronyms present, some of which are in the value mappings and some which aren’t</li>
|
||
</ul>
|
||
</li>
|
||
<li>Lower case some straggling AGROVOC subjects on CGSpace:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>dspace=# BEGIN;
|
||
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
|
||
UPDATE 123
|
||
dspace=# COMMIT;
|
||
</code></pre><ul>
|
||
<li>Move some top-level communities to the CGIAR System community for Peter:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
|
||
$ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
|
||
</code></pre><h2 id="2020-10-30">2020-10-30</h2>
|
||
<ul>
|
||
<li>The <code>AtomicStatisticsUpdateCLI</code> process finished on the current DSpace Test statistics core after about 32 hours
|
||
<ul>
|
||
<li>I started it on the statistics-2019 core</li>
|
||
</ul>
|
||
</li>
|
||
<li>Atmire responded about the duplicate values in Solr that I had asked about a few days ago
|
||
<ul>
|
||
<li>They said it could be due to the schema and asked if I see it only on old records or even on new ones created in the new CUA with DSpace 6</li>
|
||
<li>I did a test and found that I got duplicate data after browsing for a minute on DSpace Test (version 6) and sent them a screenshot</li>
|
||
</ul>
|
||
</li>
|
||
<li>Looking over Peter’s corrections to journal titles (dc.source) and publishers (dc.publisher)
|
||
<ul>
|
||
<li>I had to check the corrections for strange Unicode errors and replacements with “|” and “;” in OpenRefine using this GREL:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>or(
|
||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||
isNotNull(value.match(/.*\u00A0.*/)),
|
||
isNotNull(value.match(/.*\u200A.*/)),
|
||
isNotNull(value.match(/.*\u2019.*/)),
|
||
isNotNull(value.match(/.*\u00b4.*/)),
|
||
isNotNull(value.match(/.*\u007e.*/))
|
||
).toString()
|
||
</code></pre><ul>
|
||
<li>Then I did a test to apply the corrections and deletions on my local DSpace:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
|
||
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
|
||
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
|
||
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
|
||
</code></pre><ul>
|
||
<li>I will wait to apply them on CGSpace when I have all the other corrections from Peter processed</li>
|
||
</ul>
|
||
<h2 id="2020-10-31">2020-10-31</h2>
|
||
<ul>
|
||
<li>I had the idea to use the country formatter for CGSpace on the AReS Explorer because we have the <code>cg.coverage.iso3166-alpha2</code> field…
|
||
<ul>
|
||
<li>This will be better than using the raw text values because AReS will match directly from the ISO 3166-1 list when using the country formatter</li>
|
||
</ul>
|
||
</li>
|
||
<li>Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
|
||
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
|
||
</code></pre><ul>
|
||
<li>I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:</li>
|
||
</ul>
|
||
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
</code></pre><!-- raw HTML omitted -->
|
||
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
|
||
<li><a href="/cgspace-notes/2021-09/">September, 2021</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
|
||
|
||
<footer class="blog-footer">
|
||
<p dir="auto">
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
|
||
</body>
|
||
|
||
</html>
|