cgspace-notes/docs/2018-09/index.html

643 lines
32 KiB
HTML
Raw Normal View History

2018-09-02 11:03:43 +02:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="September, 2018" />
<meta property="og:description" content="2018-09-02
New PostgreSQL JDBC driver version 42.2.5
2018-09-02 16:37:18 +02:00
I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
2018-09-02 11:03:43 +02:00
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
" />
<meta property="og:type" content="article" />
2018-09-03 15:47:24 +02:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-09/" /><meta property="article:published_time" content="2018-09-02T09:55:54&#43;03:00"/>
2018-09-21 23:49:53 +02:00
<meta property="article:modified_time" content="2018-09-20T13:11:48&#43;03:00"/>
2018-09-02 11:03:43 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2018"/>
<meta name="twitter:description" content="2018-09-02
New PostgreSQL JDBC driver version 42.2.5
2018-09-02 16:37:18 +02:00
I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
2018-09-02 11:03:43 +02:00
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
2018-09-03 15:47:24 +02:00
<meta name="generator" content="Hugo 0.48" />
2018-09-02 11:03:43 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "September, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-09/",
2018-09-21 23:49:53 +02:00
"wordCount": "2816",
2018-09-02 11:03:43 +02:00
"datePublished": "2018-09-02T09:55:54&#43;03:00",
2018-09-21 23:49:53 +02:00
"dateModified": "2018-09-20T13:11:48&#43;03:00",
2018-09-02 11:03:43 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-09/">
<title>September, 2018 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-Upm5uY/SXdvbjuIGH6fBjF5vOYUr9DguqBskM&#43;EQpLBzO9U&#43;9fMVmWEt&#43;TTlGrWQ" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-09/">September, 2018</a></h2>
<p class="blog-post-meta"><time datetime="2018-09-02T09:55:54&#43;03:00">Sun Sep 02, 2018</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2018-09-02">2018-09-02</h2>
<ul>
<li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li>
2018-09-02 16:37:18 +02:00
<li>I&rsquo;ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li>
2018-09-02 11:03:43 +02:00
<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
<p></p>
<pre><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5240)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:754)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:730)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1838)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
</code></pre>
<ul>
<li>Full log here: <a href="https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2">https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2</a></li>
2018-09-02 11:07:04 +02:00
<li>XMLUI fails to load, but the REST, SOLR, JSPUI, etc work</li>
2018-09-02 11:03:43 +02:00
<li>The old <code>5_x-prod-dspace-5.5</code> branch does work in Ubuntu 18.04 with Tomcat 8.5.30-1ubuntu1.4, however!</li>
<li>And the <code>5_x-prod</code> DSpace 5.8 branch does work in Tomcat 8.5.x on my Arch Linux laptop&hellip;</li>
<li>I&rsquo;m not sure where the issue is then!</li>
</ul>
2018-09-03 15:47:24 +02:00
<h2 id="2018-09-03">2018-09-03</h2>
<ul>
<li>Abenet says she&rsquo;s getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week</li>
<li>They are from the CUA module</li>
<li>Two of them have &ldquo;no data&rdquo; and one has a &ldquo;null&rdquo; title</li>
<li>The last one is a report of the top downloaded items, and includes a graph</li>
<li>She will try to click the &ldquo;Unsubscribe&rdquo; link in the first two to see if it works, otherwise we should contact Atmire</li>
<li>The only one she remembers subscribing to is the top downloads one</li>
</ul>
2018-09-04 12:25:13 +02:00
<h2 id="2018-09-04">2018-09-04</h2>
<ul>
<li>I&rsquo;m looking over the latest round of IITA records from Sisay: <a href="https://dspacetest.cgiar.org/handle/10568/104230">Mercy1806_August_29</a>
<ul>
<li>All fields are split with multiple columns like <code>cg.authorship.types</code> and <code>cg.authorship.types[]</code></li>
<li>This makes it super annoying to do the checks and cleanup, so I will merge them (also time consuming)</li>
2018-09-04 16:08:34 +02:00
<li>Five items had <code>dc.date.issued</code> values like <code>2013-5</code> so I corrected them to be <code>2013-05</code></li>
2018-09-04 12:25:13 +02:00
<li>Several metadata fields had values with newlines in them (even in some titles!), which I fixed by trimming the consecutive whitespaces in Open Refine</li>
2018-09-04 16:31:20 +02:00
<li>Many (91!) items from before 2011 are indicated as having a CRP, but CRPs didn&rsquo;t exist then so this is impossible</li>
2018-09-04 16:08:34 +02:00
<li>I got all items that were from 2011 and onwards using a custom facet with this GREL on the <code>dc.date.issued</code> column: <code>isNotNull(value.match(/201[1-8].*/))</code> and then blanking their CRPs</li>
<li>Some affiliations with only one separator (|) for multiple values</li>
<li>I replaced smart quotes like <code></code> with plain ones</li>
2018-09-04 16:31:20 +02:00
<li>Some inconsistencies in <code>cg.subject.iita</code> like COWPEA and COWPEAS, and YAM and YAMS, etc, as well as some spelling mistakes like IMPACT ASSESSMENTN</li>
2018-09-04 16:08:34 +02:00
<li>Some values in the <code>dc.identifier.isbn</code> are actually ISSNs so I moved them to the <code>dc.identifier.issn</code> column</li>
<li>I found one invalid ISSN using a custom text facet with the regex from the <a href="https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format">ISSN page on Wikipedia</a>: <code>isNotBlank(value.match(/^\d{4}-\d{3}[\dxX]$/))</code></li>
<li>One invalid value for <code>dc.type</code></li>
2018-09-04 12:25:13 +02:00
</ul></li>
2018-09-04 16:33:30 +02:00
<li>Abenet says she hasn&rsquo;t received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don&rsquo;t need create an issue on Atmire&rsquo;s bug tracker anymore</li>
2018-09-04 12:25:13 +02:00
</ul>
2018-09-10 10:59:08 +02:00
<h2 id="2018-09-10">2018-09-10</h2>
<ul>
<li>Playing with <a href="https://github.com/eykhagen/strest">strest</a> to test the DSpace REST API programatically</li>
<li>For example, given this <code>test.yaml</code>:</li>
</ul>
<pre><code>version: 1
requests:
test:
method: GET
url: https://dspacetest.cgiar.org/rest/test
validate:
raw: &quot;REST api is running.&quot;
login:
url: https://dspacetest.cgiar.org/rest/login
method: POST
data:
json: {&quot;email&quot;:&quot;test@dspace&quot;,&quot;password&quot;:&quot;thepass&quot;}
status:
url: https://dspacetest.cgiar.org/rest/status
method: GET
headers:
rest-dspace-token: Value(login)
logout:
url: https://dspacetest.cgiar.org/rest/logout
method: POST
headers:
rest-dspace-token: Value(login)
# vim: set sw=2 ts=2:
</code></pre>
<ul>
<li>Works pretty well, though the DSpace <code>logout</code> always returns an HTTP 415 error for some reason</li>
<li>We could eventually use this to test sanity of the API for creating collections etc</li>
<li>A user is getting an error in her workflow:</li>
</ul>
<pre><code>2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step:
org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
</code></pre>
<ul>
<li>Seems to be during submit step, because it&rsquo;s workflow step 1&hellip;?</li>
2018-09-10 17:19:00 +02:00
<li>Move some top-level CRP communities to be below the new <a href="https://cgspace.cgiar.org/handle/10568/97114">CGIAR Research Programs and Platforms</a> community:</li>
</ul>
<pre><code>$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
$ dspace community-filiator --set -p 10568/97114 -c 10568/35409
$ dspace community-filiator --set -p 10568/97114 -c 10568/3112
</code></pre>
<ul>
<li>Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:</li>
</ul>
<pre><code>update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
UPDATE 1
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
UPDATE 23
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
UPDATE 1
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
DELETE 17
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
UPDATE 15
</code></pre>
<ul>
<li>Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)</li>
<li>The current <code>cg.identifier.status</code> field will become &ldquo;Access rights&rdquo; and <code>dc.rights</code> will become &ldquo;Usage rights&rdquo;</li>
<li>I have some work in progress on the <a href="https://github.com/alanorth/DSpace/tree/5_x-rights"><code>5_x-rights</code> branch</a></li>
2018-09-10 23:37:38 +02:00
<li>Linode said that CGSpace (linode18) had a high CPU load earlier today</li>
<li>When I looked, I see it&rsquo;s the same Russian IP that I noticed last month:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1459 157.55.39.202
1579 95.108.181.88
1615 157.55.39.147
1714 66.249.64.91
1924 50.116.102.77
3696 157.55.39.106
3763 157.55.39.148
4470 70.32.83.92
4724 35.237.175.180
14132 5.9.6.51
</code></pre>
<ul>
<li>And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):</li>
</ul>
<pre><code># grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10
14133
</code></pre>
<ul>
<li>The user agent is still the same:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
</code></pre>
<ul>
<li>I added <code>.*crawl.*</code> to the Tomcat Session Crawler Manager Valve, so I&rsquo;m not sure why the bot is creating so many sessions&hellip;</li>
<li>I just tested that user agent on CGSpace and it <em>does not</em> create a new session:</li>
</ul>
<pre><code>$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Mon, 10 Sep 2018 20:43:04 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
</code></pre>
<ul>
<li>I will have to keep an eye on it and perhaps add it to the list of &ldquo;bad bots&rdquo; that get rate limited</li>
2018-09-10 10:59:08 +02:00
</ul>
2018-09-12 12:49:33 +02:00
<h2 id="2018-09-12">2018-09-12</h2>
<ul>
<li>Merge AReS explorer changes to nginx config and deploy on CGSpace so CodeObia can start testing more</li>
<li>Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:</li>
</ul>
<pre><code>$ sudo docker volume create --name dspacetest_data
$ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre>
2018-09-12 16:02:14 +02:00
<ul>
<li>Sisay is still having problems with the controlled vocabulary for top authors</li>
<li>I took a look at the submission template and Firefox complains that the XML file is missing a root element</li>
<li>I guess it&rsquo;s because Firefox is receiving an empty XML file</li>
<li>I told Sisay to run the XML file through tidy</li>
<li>More testing of the access and usage rights changes</li>
</ul>
2018-09-13 11:48:20 +02:00
<h2 id="2018-09-13">2018-09-13</h2>
<ul>
<li>Peter was communicating with Altmetric about the OAI mapping issue for item <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/82810"><sup>10568</sup>&frasl;<sub>82810</sub></a> again</li>
<li>Altmetric said it was somehow related to the OAI <code>dateStamp</code> not getting updated when the mappings changed, but I said that back in <a href="/cgspace-notes/2018-07/">2018-07</a> when this happened it was because the OAI was actually just not reflecting all the item&rsquo;s mappings</li>
<li>After forcing a complete re-indexing of OAI the mappings were fine</li>
<li>The <code>dateStamp</code> is most probably only updated when the item&rsquo;s metadata changes, not its mappings, so if Altmetric is relying on that we&rsquo;re in a tricky spot</li>
2018-09-13 15:15:01 +02:00
<li>We need to make sure that our OAI isn&rsquo;t publicizing stale data&hellip; I was going to post something on the dspace-tech mailing list, but never did</li>
<li>Linode says that CGSpace (linode18) has had high CPU for the past two hours</li>
<li>The top IP addresses today are:</li>
2018-09-13 11:48:20 +02:00
</ul>
2018-09-02 11:03:43 +02:00
2018-09-13 15:15:01 +02:00
<pre><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;13/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
32 46.229.161.131
38 104.198.9.108
39 66.249.64.91
56 157.55.39.224
57 207.46.13.49
58 40.77.167.120
78 169.255.105.46
702 54.214.112.202
1840 50.116.102.77
4469 70.32.83.92
</code></pre>
<ul>
<li>And the top two addresses seem to be re-using their Tomcat sessions properly:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
7
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
2
</code></pre>
<ul>
<li>So I&rsquo;m not sure what&rsquo;s going on</li>
<li>Valerio asked me if there&rsquo;s a way to get the page views and downloads from CGSpace</li>
<li>I said no, but that we might be able to piggyback on the Atmire statlet REST API</li>
<li>For example, when you expand the &ldquo;statlet&rdquo; at the bottom of an item like <a href="https://cgspace.cgiar.org/handle/10568/97103"><sup>10568</sup>&frasl;<sub>97103</sub></a> you can see the following request in the browser console:</li>
</ul>
2018-09-17 16:34:48 +02:00
<pre><code>https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
2018-09-13 15:15:01 +02:00
</code></pre>
<ul>
<li>That JSON file has the total page views and item downloads for the item&hellip;</li>
2018-09-13 23:21:41 +02:00
<li>Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds</li>
<li>I had a quick look at the DSpace 5.x manual and it doesn&rsquo;t not seem that this is possible (you can only add metadata)</li>
<li>Testing the new LDAP server the CGNET says will be replacing the old one, it doesn&rsquo;t seem that they are using the global catalog on port 3269 anymore, now only 636 is open</li>
<li>I did a clean deploy of DSpace 5.8 on Ubuntu 18.04 with some stripped down Tomcat 8 configuration and actually managed to get it up and running without the autowire errors that I had previously experienced</li>
<li>I realized that it always works on my local machine with Tomcat 8.5.x, but not when I do the deployment from Ansible in Ubuntu 18.04</li>
<li>So there must be something in my Tomcat 8 <code>server.xml</code> template</li>
2018-09-13 23:31:05 +02:00
<li>Now I re-deployed it with the normal server template and it&rsquo;s working, WTF?</li>
<li>Must have been something like an old DSpace 5.5 file in the spring folder&hellip; weird</li>
<li>But yay, this means we can update DSpace Test to Ubuntu 18.04, Tomcat 8, PostgreSQL 9.6, etc&hellip;</li>
2018-09-13 15:15:01 +02:00
</ul>
2018-09-14 11:30:17 +02:00
<h2 id="2018-09-14">2018-09-14</h2>
<ul>
<li>Sisay uploaded the IITA records to CGSpace, but forgot to remove the old Handles</li>
<li>I explicitly told him not to forget to remove them yesterday!</li>
</ul>
2018-09-16 15:30:32 +02:00
<h2 id="2018-09-16">2018-09-16</h2>
<ul>
<li>Add the DSpace build.properties as a template into my <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> for configuring DSpace machines</li>
<li>One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can&rsquo;t override them (like SMTP server) on a per-host basis</li>
<li>Discuss access and usage rights with Peter</li>
<li>I suggested that we leave access rights (<code>cg.identifier.access</code>) as it is now, with &ldquo;Open Access&rdquo; or &ldquo;Limited Access&rdquo;, and then simply re-brand that as &ldquo;Access rights&rdquo; in the UIs and relevant drop downs</li>
<li>Then we continue as planned to add <code>dc.rights</code> as &ldquo;Usage rights&rdquo;</li>
</ul>
2018-09-17 16:34:48 +02:00
<h2 id="2018-09-17">2018-09-17</h2>
<ul>
<li>Skype meeting with CGSpace team in Addis</li>
<li>Change <code>cg.identifier.status</code> &ldquo;Access rights&rdquo; options to:
<ul>
<li>Open Access→Unrestricted Access</li>
<li>Limited Access→Restricted Access</li>
<li>Metadata Only</li>
</ul></li>
<li>Update these immediately, but talk to CodeObia to create a mapping between the old and new values</li>
<li>Finalize <code>dc.rights</code> &ldquo;Usage rights&rdquo; with seven combinations of Creative Commons, plus the others</li>
2018-09-17 18:53:08 +02:00
<li>Need to double check the new <a href="https://cgspace.cgiar.org/handle/10568/97114">CRP community</a> to see why the collection counts aren&rsquo;t updated after we moved the communities there last week
2018-09-17 16:34:48 +02:00
<ul>
2018-09-17 18:53:08 +02:00
<li>I forced a full Discovery re-index and now the community shows 1,600 items</li>
2018-09-17 16:34:48 +02:00
</ul></li>
<li>Check if it&rsquo;s possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL</li>
<li>Agree that we&rsquo;ll publicize AReS explorer on the week before the Big Data Platform workshop
<ul>
<li>Put a link and or picture on the CGSpace homepage saying &ldquo;Visualized CGSpace research&rdquo; or something, and post a message on Yammer</li>
</ul></li>
2018-09-18 00:16:21 +02:00
<li>I want to explore creating a thin API to make the item view and download stats available from Solr so CodeObia can use them in the AReS explorer</li>
<li>Currently CodeObia is exploring using the Atmire statlets internal API, but I don&rsquo;t really like that&hellip;</li>
<li>There are some example queries on the <a href="https://wiki.duraspace.org/display/DSPACE/Solr">DSpace Solr wiki</a></li>
<li>For example, this query returns 1655 rows for item <a href="https://cgspace.cgiar.org/handle/10568/10630"><sup>10568</sup>&frasl;<sub>10630</sub></a>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false'
</code></pre>
<ul>
<li>The id in the Solr query is the item&rsquo;s database id (get it from the REST API or something)</li>
<li>Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire&rsquo;s statlet shows, though the query logic here is confusing:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
</code></pre>
<ul>
<li>According to the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">SolrQuerySyntax</a> page on the Apache wiki, the <code>[* TO *]</code> syntax just selects a range (in this case all values for a field)</li>
<li>So it seems to be:
<ul>
<li><code>type:0</code> is for bitstreams according to the DSpace Solr documentation</li>
<li><code>-(bundleName:[*+TO+*]-bundleName:ORIGINAL)</code> seems to be a <a href="https://wiki.apache.org/solr/NegativeQueryProblems">negative query starting with all documents</a>, subtracting those with <code>bundleName:ORIGINAL</code>, and then negating the whole thing&hellip; meaning only documents from <code>bundleName:ORIGINAL</code>?</li>
</ul></li>
<li>What the shit, I think I&rsquo;m right: the simplified logic in <em>this</em> query returns the same 889:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
</code></pre>
<ul>
<li>And if I simplify the <code>statistics_type</code> logic the same way, it still returns the same 889!</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view'
</code></pre>
<ul>
<li>As for item views, I suppose that&rsquo;s just the same query, minus the <code>bundleName:ORIGINAL</code>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view'
</code></pre>
<ul>
<li>That one returns 766, which is exactly 1655 minus 889&hellip;</li>
<li>Also, Solr&rsquo;s <code>fq</code> is similar to the regular <code>q</code> query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries</li>
2018-09-17 16:34:48 +02:00
</ul>
2018-09-18 14:52:20 +02:00
<h2 id="2018-09-18">2018-09-18</h2>
<ul>
<li>I managed to create a simple proof of concept REST API to expose item view and download statistics: <a href="https://github.com/alanorth/cgspace-statistics-api">cgspace-statistics-api</a></li>
<li>It uses the Python-based <a href="https://falcon.readthedocs.io">Falcon</a> web framework and talks to Solr directly using the <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> library (which seems to have issues in Python 3.7 currently)</li>
<li>After deploying on DSpace Test I can then get the stats for an item using its ID:</li>
</ul>
<pre><code>$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
{
&quot;downloads&quot;: 2,
&quot;id&quot;: 110988,
&quot;views&quot;: 15
}
</code></pre>
<ul>
<li>The numbers are different than those that come from Atmire&rsquo;s statlets for some reason, but as I&rsquo;m querying Solr directly, I have no idea where their numbers come from!</li>
<li>Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&amp;page=1</li>
<li>Getting all the item IDs from PostgreSQL is certainly easy:</li>
</ul>
<pre><code>dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
</code></pre>
<ul>
<li>The rest of the Falcon tooling will be more difficult&hellip;</li>
</ul>
2018-09-19 19:40:18 +02:00
<h2 id="2018-09-19">2018-09-19</h2>
<ul>
<li>I emailed Jane Poole to ask if there is some money we can use from the Big Data Platform (BDP) to fund the purchase of some Atmire credits for CGSpace</li>
<li>I learned that there is an efficient way to do <a href="http://yonik.com/solr/paging-and-deep-paging/">&ldquo;deep paging&rdquo; in large Solr results sets by using <code>cursorMark</code></a>, but it doesn&rsquo;t work with faceting</li>
</ul>
2018-09-20 12:11:48 +02:00
<h2 id="2018-09-20">2018-09-20</h2>
<ul>
<li>Contact Atmire to ask how we can buy more credits for future development</li>
2018-09-21 23:49:53 +02:00
<li>I researched the Solr <code>filterCache</code> size and I found out that the formula for calculating the potential memory use of <strong>each entry</strong> in the cache is:</li>
</ul>
<pre><code>((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
</code></pre>
<ul>
<li>Which means that, for our statistics core with <em>149 million</em> documents, each entry in our <code>filterCache</code> would use 8.9 GB!</li>
</ul>
<pre><code>((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
</code></pre>
<ul>
<li>So I think we can forget about tuning this for now!</li>
<li><a href="http://lucene.472066.n3.nabble.com/Calculating-filterCache-size-td4142526.html">Discussion on the mailing list about <code>filterCache</code> size</a></li>
<li><a href="https://docs.google.com/document/d/1vl-nmlprSULvNZKQNrqp65eLnLhG9s_ydXQtg9iML10/edit">Article discussing testing methodology for different <code>filterCache</code> sizes</a></li>
<li>Discuss Handle links on Twitter with IWMI</li>
</ul>
<h2 id="2018-09-21">2018-09-21</h2>
<ul>
<li>I see that there was a nice optimization to the ImageMagick PDF CMYK detection in the upstream <code>dspace-5_x</code> branch: <a href="https://github.com/DSpace/DSpace/pull/2204">DS-3664</a></li>
<li>The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I&rsquo;ll cherry-pick that fix into our <code>5_x-prod</code> branch:
<ul>
<li>4e8c7b578bdbe26ead07e36055de6896bbf02f83: ImageMagick: Only execute &ldquo;identify&rdquo; on first page</li>
</ul></li>
<li>I think it would also be nice to cherry-pick the fixes for <a href="https://github.com/DSpace/DSpace/pull/2020">DS-3883</a>, which is related to optimizing the XMLUI item display of items with many bitstreams
<ul>
<li>a0ea20bd1821720b111e2873b08e03ce2bf93307: DS-3883: Don&rsquo;t loop through original bitstreams if only displaying thumbnails</li>
<li>8d81e825dee62c2aa9d403a505e4a4d798964e8d: DS-3883: If only including thumbnails, only load the main item thumbnail.</li>
</ul></li>
2018-09-20 12:11:48 +02:00
</ul>
2018-09-13 15:15:01 +02:00
<!-- vim: set sw=2 ts=2: -->
2018-09-02 11:03:43 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2018-09/">September, 2018</a></li>
<li><a href="/cgspace-notes/2018-08/">August, 2018</a></li>
<li><a href="/cgspace-notes/2018-07/">July, 2018</a></li>
<li><a href="/cgspace-notes/2018-06/">June, 2018</a></li>
<li><a href="/cgspace-notes/2018-05/">May, 2018</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>