cgspace-notes/docs/2018-09/index.html

690 lines
35 KiB
HTML
Raw Normal View History

2018-09-02 11:03:43 +02:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="September, 2018" />
<meta property="og:description" content="2018-09-02
New PostgreSQL JDBC driver version 42.2.5
2018-09-02 16:37:18 +02:00
I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
2018-09-02 11:03:43 +02:00
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
" />
<meta property="og:type" content="article" />
2018-09-03 15:47:24 +02:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-09/" /><meta property="article:published_time" content="2018-09-02T09:55:54&#43;03:00"/>
2018-09-24 15:35:43 +02:00
<meta property="article:modified_time" content="2018-09-24T16:24:35&#43;03:00"/>
2018-09-02 11:03:43 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2018"/>
<meta name="twitter:description" content="2018-09-02
New PostgreSQL JDBC driver version 42.2.5
2018-09-02 16:37:18 +02:00
I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and run the updated playbooks on CGSpace and DSpace Test
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
2018-09-02 11:03:43 +02:00
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
2018-09-03 15:47:24 +02:00
<meta name="generator" content="Hugo 0.48" />
2018-09-02 11:03:43 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "September, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-09/",
2018-09-24 15:35:43 +02:00
"wordCount": "3180",
2018-09-02 11:03:43 +02:00
"datePublished": "2018-09-02T09:55:54&#43;03:00",
2018-09-24 15:35:43 +02:00
"dateModified": "2018-09-24T16:24:35&#43;03:00",
2018-09-02 11:03:43 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-09/">
<title>September, 2018 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-Upm5uY/SXdvbjuIGH6fBjF5vOYUr9DguqBskM&#43;EQpLBzO9U&#43;9fMVmWEt&#43;TTlGrWQ" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-09/">September, 2018</a></h2>
<p class="blog-post-meta"><time datetime="2018-09-02T09:55:54&#43;03:00">Sun Sep 02, 2018</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2018-09-02">2018-09-02</h2>
<ul>
<li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li>
2018-09-02 16:37:18 +02:00
<li>I&rsquo;ll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li>
2018-09-02 11:03:43 +02:00
<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
<p></p>
<pre><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5240)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:754)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:730)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1838)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
</code></pre>
<ul>
<li>Full log here: <a href="https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2">https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2</a></li>
2018-09-02 11:07:04 +02:00
<li>XMLUI fails to load, but the REST, SOLR, JSPUI, etc work</li>
2018-09-02 11:03:43 +02:00
<li>The old <code>5_x-prod-dspace-5.5</code> branch does work in Ubuntu 18.04 with Tomcat 8.5.30-1ubuntu1.4, however!</li>
<li>And the <code>5_x-prod</code> DSpace 5.8 branch does work in Tomcat 8.5.x on my Arch Linux laptop&hellip;</li>
<li>I&rsquo;m not sure where the issue is then!</li>
</ul>
2018-09-03 15:47:24 +02:00
<h2 id="2018-09-03">2018-09-03</h2>
<ul>
<li>Abenet says she&rsquo;s getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week</li>
<li>They are from the CUA module</li>
<li>Two of them have &ldquo;no data&rdquo; and one has a &ldquo;null&rdquo; title</li>
<li>The last one is a report of the top downloaded items, and includes a graph</li>
<li>She will try to click the &ldquo;Unsubscribe&rdquo; link in the first two to see if it works, otherwise we should contact Atmire</li>
<li>The only one she remembers subscribing to is the top downloads one</li>
</ul>
2018-09-04 12:25:13 +02:00
<h2 id="2018-09-04">2018-09-04</h2>
<ul>
<li>I&rsquo;m looking over the latest round of IITA records from Sisay: <a href="https://dspacetest.cgiar.org/handle/10568/104230">Mercy1806_August_29</a>
<ul>
<li>All fields are split with multiple columns like <code>cg.authorship.types</code> and <code>cg.authorship.types[]</code></li>
<li>This makes it super annoying to do the checks and cleanup, so I will merge them (also time consuming)</li>
2018-09-04 16:08:34 +02:00
<li>Five items had <code>dc.date.issued</code> values like <code>2013-5</code> so I corrected them to be <code>2013-05</code></li>
2018-09-04 12:25:13 +02:00
<li>Several metadata fields had values with newlines in them (even in some titles!), which I fixed by trimming the consecutive whitespaces in Open Refine</li>
2018-09-04 16:31:20 +02:00
<li>Many (91!) items from before 2011 are indicated as having a CRP, but CRPs didn&rsquo;t exist then so this is impossible</li>
2018-09-04 16:08:34 +02:00
<li>I got all items that were from 2011 and onwards using a custom facet with this GREL on the <code>dc.date.issued</code> column: <code>isNotNull(value.match(/201[1-8].*/))</code> and then blanking their CRPs</li>
<li>Some affiliations with only one separator (|) for multiple values</li>
<li>I replaced smart quotes like <code></code> with plain ones</li>
2018-09-04 16:31:20 +02:00
<li>Some inconsistencies in <code>cg.subject.iita</code> like COWPEA and COWPEAS, and YAM and YAMS, etc, as well as some spelling mistakes like IMPACT ASSESSMENTN</li>
2018-09-04 16:08:34 +02:00
<li>Some values in the <code>dc.identifier.isbn</code> are actually ISSNs so I moved them to the <code>dc.identifier.issn</code> column</li>
<li>I found one invalid ISSN using a custom text facet with the regex from the <a href="https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format">ISSN page on Wikipedia</a>: <code>isNotBlank(value.match(/^\d{4}-\d{3}[\dxX]$/))</code></li>
<li>One invalid value for <code>dc.type</code></li>
2018-09-04 12:25:13 +02:00
</ul></li>
2018-09-04 16:33:30 +02:00
<li>Abenet says she hasn&rsquo;t received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don&rsquo;t need create an issue on Atmire&rsquo;s bug tracker anymore</li>
2018-09-04 12:25:13 +02:00
</ul>
2018-09-10 10:59:08 +02:00
<h2 id="2018-09-10">2018-09-10</h2>
<ul>
<li>Playing with <a href="https://github.com/eykhagen/strest">strest</a> to test the DSpace REST API programatically</li>
<li>For example, given this <code>test.yaml</code>:</li>
</ul>
<pre><code>version: 1
requests:
test:
method: GET
url: https://dspacetest.cgiar.org/rest/test
validate:
raw: &quot;REST api is running.&quot;
login:
url: https://dspacetest.cgiar.org/rest/login
method: POST
data:
json: {&quot;email&quot;:&quot;test@dspace&quot;,&quot;password&quot;:&quot;thepass&quot;}
status:
url: https://dspacetest.cgiar.org/rest/status
method: GET
headers:
rest-dspace-token: Value(login)
logout:
url: https://dspacetest.cgiar.org/rest/logout
method: POST
headers:
rest-dspace-token: Value(login)
# vim: set sw=2 ts=2:
</code></pre>
<ul>
<li>Works pretty well, though the DSpace <code>logout</code> always returns an HTTP 415 error for some reason</li>
<li>We could eventually use this to test sanity of the API for creating collections etc</li>
<li>A user is getting an error in her workflow:</li>
</ul>
<pre><code>2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step:
org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
</code></pre>
<ul>
<li>Seems to be during submit step, because it&rsquo;s workflow step 1&hellip;?</li>
2018-09-10 17:19:00 +02:00
<li>Move some top-level CRP communities to be below the new <a href="https://cgspace.cgiar.org/handle/10568/97114">CGIAR Research Programs and Platforms</a> community:</li>
</ul>
<pre><code>$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
$ dspace community-filiator --set -p 10568/97114 -c 10568/35409
$ dspace community-filiator --set -p 10568/97114 -c 10568/3112
</code></pre>
<ul>
<li>Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:</li>
</ul>
<pre><code>update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
UPDATE 1
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
UPDATE 23
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
UPDATE 1
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
DELETE 17
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
UPDATE 15
</code></pre>
<ul>
<li>Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)</li>
<li>The current <code>cg.identifier.status</code> field will become &ldquo;Access rights&rdquo; and <code>dc.rights</code> will become &ldquo;Usage rights&rdquo;</li>
<li>I have some work in progress on the <a href="https://github.com/alanorth/DSpace/tree/5_x-rights"><code>5_x-rights</code> branch</a></li>
2018-09-10 23:37:38 +02:00
<li>Linode said that CGSpace (linode18) had a high CPU load earlier today</li>
<li>When I looked, I see it&rsquo;s the same Russian IP that I noticed last month:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1459 157.55.39.202
1579 95.108.181.88
1615 157.55.39.147
1714 66.249.64.91
1924 50.116.102.77
3696 157.55.39.106
3763 157.55.39.148
4470 70.32.83.92
4724 35.237.175.180
14132 5.9.6.51
</code></pre>
<ul>
<li>And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):</li>
</ul>
<pre><code># grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10
14133
</code></pre>
<ul>
<li>The user agent is still the same:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
</code></pre>
<ul>
<li>I added <code>.*crawl.*</code> to the Tomcat Session Crawler Manager Valve, so I&rsquo;m not sure why the bot is creating so many sessions&hellip;</li>
<li>I just tested that user agent on CGSpace and it <em>does not</em> create a new session:</li>
</ul>
<pre><code>$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Mon, 10 Sep 2018 20:43:04 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
</code></pre>
<ul>
<li>I will have to keep an eye on it and perhaps add it to the list of &ldquo;bad bots&rdquo; that get rate limited</li>
2018-09-10 10:59:08 +02:00
</ul>
2018-09-12 12:49:33 +02:00
<h2 id="2018-09-12">2018-09-12</h2>
<ul>
<li>Merge AReS explorer changes to nginx config and deploy on CGSpace so CodeObia can start testing more</li>
<li>Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:</li>
</ul>
<pre><code>$ sudo docker volume create --name dspacetest_data
$ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre>
2018-09-12 16:02:14 +02:00
<ul>
<li>Sisay is still having problems with the controlled vocabulary for top authors</li>
<li>I took a look at the submission template and Firefox complains that the XML file is missing a root element</li>
<li>I guess it&rsquo;s because Firefox is receiving an empty XML file</li>
<li>I told Sisay to run the XML file through tidy</li>
<li>More testing of the access and usage rights changes</li>
</ul>
2018-09-13 11:48:20 +02:00
<h2 id="2018-09-13">2018-09-13</h2>
<ul>
<li>Peter was communicating with Altmetric about the OAI mapping issue for item <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/82810"><sup>10568</sup>&frasl;<sub>82810</sub></a> again</li>
<li>Altmetric said it was somehow related to the OAI <code>dateStamp</code> not getting updated when the mappings changed, but I said that back in <a href="/cgspace-notes/2018-07/">2018-07</a> when this happened it was because the OAI was actually just not reflecting all the item&rsquo;s mappings</li>
<li>After forcing a complete re-indexing of OAI the mappings were fine</li>
<li>The <code>dateStamp</code> is most probably only updated when the item&rsquo;s metadata changes, not its mappings, so if Altmetric is relying on that we&rsquo;re in a tricky spot</li>
2018-09-13 15:15:01 +02:00
<li>We need to make sure that our OAI isn&rsquo;t publicizing stale data&hellip; I was going to post something on the dspace-tech mailing list, but never did</li>
<li>Linode says that CGSpace (linode18) has had high CPU for the past two hours</li>
<li>The top IP addresses today are:</li>
2018-09-13 11:48:20 +02:00
</ul>
2018-09-02 11:03:43 +02:00
2018-09-13 15:15:01 +02:00
<pre><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;13/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
32 46.229.161.131
38 104.198.9.108
39 66.249.64.91
56 157.55.39.224
57 207.46.13.49
58 40.77.167.120
78 169.255.105.46
702 54.214.112.202
1840 50.116.102.77
4469 70.32.83.92
</code></pre>
<ul>
<li>And the top two addresses seem to be re-using their Tomcat sessions properly:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
7
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
2
</code></pre>
<ul>
<li>So I&rsquo;m not sure what&rsquo;s going on</li>
<li>Valerio asked me if there&rsquo;s a way to get the page views and downloads from CGSpace</li>
<li>I said no, but that we might be able to piggyback on the Atmire statlet REST API</li>
<li>For example, when you expand the &ldquo;statlet&rdquo; at the bottom of an item like <a href="https://cgspace.cgiar.org/handle/10568/97103"><sup>10568</sup>&frasl;<sub>97103</sub></a> you can see the following request in the browser console:</li>
</ul>
2018-09-17 16:34:48 +02:00
<pre><code>https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
2018-09-13 15:15:01 +02:00
</code></pre>
<ul>
<li>That JSON file has the total page views and item downloads for the item&hellip;</li>
2018-09-13 23:21:41 +02:00
<li>Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds</li>
<li>I had a quick look at the DSpace 5.x manual and it doesn&rsquo;t not seem that this is possible (you can only add metadata)</li>
<li>Testing the new LDAP server the CGNET says will be replacing the old one, it doesn&rsquo;t seem that they are using the global catalog on port 3269 anymore, now only 636 is open</li>
<li>I did a clean deploy of DSpace 5.8 on Ubuntu 18.04 with some stripped down Tomcat 8 configuration and actually managed to get it up and running without the autowire errors that I had previously experienced</li>
<li>I realized that it always works on my local machine with Tomcat 8.5.x, but not when I do the deployment from Ansible in Ubuntu 18.04</li>
<li>So there must be something in my Tomcat 8 <code>server.xml</code> template</li>
2018-09-13 23:31:05 +02:00
<li>Now I re-deployed it with the normal server template and it&rsquo;s working, WTF?</li>
<li>Must have been something like an old DSpace 5.5 file in the spring folder&hellip; weird</li>
<li>But yay, this means we can update DSpace Test to Ubuntu 18.04, Tomcat 8, PostgreSQL 9.6, etc&hellip;</li>
2018-09-13 15:15:01 +02:00
</ul>
2018-09-14 11:30:17 +02:00
<h2 id="2018-09-14">2018-09-14</h2>
<ul>
<li>Sisay uploaded the IITA records to CGSpace, but forgot to remove the old Handles</li>
<li>I explicitly told him not to forget to remove them yesterday!</li>
</ul>
2018-09-16 15:30:32 +02:00
<h2 id="2018-09-16">2018-09-16</h2>
<ul>
<li>Add the DSpace build.properties as a template into my <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> for configuring DSpace machines</li>
<li>One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can&rsquo;t override them (like SMTP server) on a per-host basis</li>
<li>Discuss access and usage rights with Peter</li>
<li>I suggested that we leave access rights (<code>cg.identifier.access</code>) as it is now, with &ldquo;Open Access&rdquo; or &ldquo;Limited Access&rdquo;, and then simply re-brand that as &ldquo;Access rights&rdquo; in the UIs and relevant drop downs</li>
<li>Then we continue as planned to add <code>dc.rights</code> as &ldquo;Usage rights&rdquo;</li>
</ul>
2018-09-17 16:34:48 +02:00
<h2 id="2018-09-17">2018-09-17</h2>
<ul>
<li>Skype meeting with CGSpace team in Addis</li>
<li>Change <code>cg.identifier.status</code> &ldquo;Access rights&rdquo; options to:
<ul>
<li>Open Access→Unrestricted Access</li>
<li>Limited Access→Restricted Access</li>
<li>Metadata Only</li>
</ul></li>
<li>Update these immediately, but talk to CodeObia to create a mapping between the old and new values</li>
<li>Finalize <code>dc.rights</code> &ldquo;Usage rights&rdquo; with seven combinations of Creative Commons, plus the others</li>
2018-09-17 18:53:08 +02:00
<li>Need to double check the new <a href="https://cgspace.cgiar.org/handle/10568/97114">CRP community</a> to see why the collection counts aren&rsquo;t updated after we moved the communities there last week
2018-09-17 16:34:48 +02:00
<ul>
2018-09-17 18:53:08 +02:00
<li>I forced a full Discovery re-index and now the community shows 1,600 items</li>
2018-09-17 16:34:48 +02:00
</ul></li>
<li>Check if it&rsquo;s possible to have items deposited via REST use a workflow so we can perhaps tell ICARDA to use that from MEL</li>
<li>Agree that we&rsquo;ll publicize AReS explorer on the week before the Big Data Platform workshop
<ul>
<li>Put a link and or picture on the CGSpace homepage saying &ldquo;Visualized CGSpace research&rdquo; or something, and post a message on Yammer</li>
</ul></li>
2018-09-18 00:16:21 +02:00
<li>I want to explore creating a thin API to make the item view and download stats available from Solr so CodeObia can use them in the AReS explorer</li>
<li>Currently CodeObia is exploring using the Atmire statlets internal API, but I don&rsquo;t really like that&hellip;</li>
<li>There are some example queries on the <a href="https://wiki.duraspace.org/display/DSPACE/Solr">DSpace Solr wiki</a></li>
<li>For example, this query returns 1655 rows for item <a href="https://cgspace.cgiar.org/handle/10568/10630"><sup>10568</sup>&frasl;<sub>10630</sub></a>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false'
</code></pre>
<ul>
<li>The id in the Solr query is the item&rsquo;s database id (get it from the REST API or something)</li>
<li>Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire&rsquo;s statlet shows, though the query logic here is confusing:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
</code></pre>
<ul>
<li>According to the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">SolrQuerySyntax</a> page on the Apache wiki, the <code>[* TO *]</code> syntax just selects a range (in this case all values for a field)</li>
<li>So it seems to be:
<ul>
<li><code>type:0</code> is for bitstreams according to the DSpace Solr documentation</li>
<li><code>-(bundleName:[*+TO+*]-bundleName:ORIGINAL)</code> seems to be a <a href="https://wiki.apache.org/solr/NegativeQueryProblems">negative query starting with all documents</a>, subtracting those with <code>bundleName:ORIGINAL</code>, and then negating the whole thing&hellip; meaning only documents from <code>bundleName:ORIGINAL</code>?</li>
</ul></li>
<li>What the shit, I think I&rsquo;m right: the simplified logic in <em>this</em> query returns the same 889:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
</code></pre>
<ul>
<li>And if I simplify the <code>statistics_type</code> logic the same way, it still returns the same 889!</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view'
</code></pre>
<ul>
<li>As for item views, I suppose that&rsquo;s just the same query, minus the <code>bundleName:ORIGINAL</code>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view'
</code></pre>
<ul>
<li>That one returns 766, which is exactly 1655 minus 889&hellip;</li>
<li>Also, Solr&rsquo;s <code>fq</code> is similar to the regular <code>q</code> query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries</li>
2018-09-17 16:34:48 +02:00
</ul>
2018-09-18 14:52:20 +02:00
<h2 id="2018-09-18">2018-09-18</h2>
<ul>
<li>I managed to create a simple proof of concept REST API to expose item view and download statistics: <a href="https://github.com/alanorth/cgspace-statistics-api">cgspace-statistics-api</a></li>
<li>It uses the Python-based <a href="https://falcon.readthedocs.io">Falcon</a> web framework and talks to Solr directly using the <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> library (which seems to have issues in Python 3.7 currently)</li>
<li>After deploying on DSpace Test I can then get the stats for an item using its ID:</li>
</ul>
<pre><code>$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
{
&quot;downloads&quot;: 2,
&quot;id&quot;: 110988,
&quot;views&quot;: 15
}
</code></pre>
<ul>
<li>The numbers are different than those that come from Atmire&rsquo;s statlets for some reason, but as I&rsquo;m querying Solr directly, I have no idea where their numbers come from!</li>
<li>Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&amp;page=1</li>
<li>Getting all the item IDs from PostgreSQL is certainly easy:</li>
</ul>
<pre><code>dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
</code></pre>
<ul>
<li>The rest of the Falcon tooling will be more difficult&hellip;</li>
</ul>
2018-09-19 19:40:18 +02:00
<h2 id="2018-09-19">2018-09-19</h2>
<ul>
<li>I emailed Jane Poole to ask if there is some money we can use from the Big Data Platform (BDP) to fund the purchase of some Atmire credits for CGSpace</li>
<li>I learned that there is an efficient way to do <a href="http://yonik.com/solr/paging-and-deep-paging/">&ldquo;deep paging&rdquo; in large Solr results sets by using <code>cursorMark</code></a>, but it doesn&rsquo;t work with faceting</li>
</ul>
2018-09-20 12:11:48 +02:00
<h2 id="2018-09-20">2018-09-20</h2>
<ul>
<li>Contact Atmire to ask how we can buy more credits for future development</li>
2018-09-21 23:49:53 +02:00
<li>I researched the Solr <code>filterCache</code> size and I found out that the formula for calculating the potential memory use of <strong>each entry</strong> in the cache is:</li>
</ul>
<pre><code>((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
</code></pre>
<ul>
<li>Which means that, for our statistics core with <em>149 million</em> documents, each entry in our <code>filterCache</code> would use 8.9 GB!</li>
</ul>
<pre><code>((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
</code></pre>
<ul>
<li>So I think we can forget about tuning this for now!</li>
<li><a href="http://lucene.472066.n3.nabble.com/Calculating-filterCache-size-td4142526.html">Discussion on the mailing list about <code>filterCache</code> size</a></li>
<li><a href="https://docs.google.com/document/d/1vl-nmlprSULvNZKQNrqp65eLnLhG9s_ydXQtg9iML10/edit">Article discussing testing methodology for different <code>filterCache</code> sizes</a></li>
<li>Discuss Handle links on Twitter with IWMI</li>
</ul>
<h2 id="2018-09-21">2018-09-21</h2>
<ul>
<li>I see that there was a nice optimization to the ImageMagick PDF CMYK detection in the upstream <code>dspace-5_x</code> branch: <a href="https://github.com/DSpace/DSpace/pull/2204">DS-3664</a></li>
<li>The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I&rsquo;ll cherry-pick that fix into our <code>5_x-prod</code> branch:
<ul>
<li>4e8c7b578bdbe26ead07e36055de6896bbf02f83: ImageMagick: Only execute &ldquo;identify&rdquo; on first page</li>
</ul></li>
<li>I think it would also be nice to cherry-pick the fixes for <a href="https://github.com/DSpace/DSpace/pull/2020">DS-3883</a>, which is related to optimizing the XMLUI item display of items with many bitstreams
<ul>
<li>a0ea20bd1821720b111e2873b08e03ce2bf93307: DS-3883: Don&rsquo;t loop through original bitstreams if only displaying thumbnails</li>
<li>8d81e825dee62c2aa9d403a505e4a4d798964e8d: DS-3883: If only including thumbnails, only load the main item thumbnail.</li>
</ul></li>
2018-09-20 12:11:48 +02:00
</ul>
2018-09-23 15:51:17 +02:00
<h2 id="2019-09-23">2019-09-23</h2>
<ul>
<li>I did more work on my <a href="https://github.com/alanorth/cgspace-statistics-api">cgspace-statistics-api</a>, fixing some item view counts and adding indexing via SQLite (I&rsquo;m trying to avoid having to set up <em>yet another</em> database, user, password, etc) during deployment</li>
<li>I created a new branch called <code>5_x-upstream-cherry-picks</code> to test and track those cherry-picks from the upstream 5.x branch</li>
<li>Also, I need to test the new LDAP server, so I will deploy that on DSpace Test today</li>
2018-09-23 23:31:59 +02:00
<li>Rename my cgspace-statistics-api to <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> on GitHub</li>
2018-09-23 15:51:17 +02:00
</ul>
2018-09-24 12:40:59 +02:00
<h2 id="2018-09-24">2018-09-24</h2>
<ul>
<li>Trying to figure out how to get item views and downloads from SQLite in a join</li>
<li>It appears SQLite doesn&rsquo;t support <code>FULL OUTER JOIN</code> so some people on StackOverflow have emulated it with <code>LEFT JOIN</code> and <code>UNION</code>:</li>
</ul>
<pre><code>&gt; SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
LEFT JOIN itemdownloads downloads USING(id)
UNION ALL
SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloads downloads
LEFT JOIN itemviews views USING(id)
WHERE views.id IS NULL;
</code></pre>
<ul>
<li>This &ldquo;works&rdquo; but the resulting rows are kinda messy so I&rsquo;d have to do extra logic in Python</li>
<li>Maybe we can use one &ldquo;items&rdquo; table with defaults values and UPSERT (aka insert&hellip; on conflict &hellip; do update):</li>
</ul>
<pre><code>sqlite&gt; CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
sqlite&gt; INSERT INTO items(id, views) VALUES(0, 52);
sqlite&gt; INSERT INTO items(id, downloads) VALUES(1, 171);
sqlite&gt; INSERT INTO items(id, downloads) VALUES(1, 176) ON CONFLICT(id) DO UPDATE SET downloads=176;
sqlite&gt; INSERT INTO items(id, views) VALUES(0, 78) ON CONFLICT(id) DO UPDATE SET views=78;
sqlite&gt; INSERT INTO items(id, views) VALUES(0, 3) ON CONFLICT(id) DO UPDATE SET downloads=3;
sqlite&gt; INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE SET downloads=excluded.views;
</code></pre>
<ul>
<li>This totally works!</li>
<li>Note the special <code>excluded.views</code> form! See <a href="https://www.sqlite.org/lang_UPSERT.html">SQLite&rsquo;s lang_UPSERT documentation</a></li>
2018-09-24 15:24:35 +02:00
<li>Oh nice, I finally finished the Falcon API route to page through all the results using SQLite&rsquo;s amazing <code>LIMIT</code> and <code>OFFSET</code> support</li>
<li>But when I deployed it on my Ubuntu 16.04 environment I realized Ubuntu&rsquo;s SQLite is old and doesn&rsquo;t support <code>UPSERT</code>, so my indexing doesn&rsquo;t work&hellip;</li>
<li>Apparently <code>UPSERT</code> came in SQLite 3.24.0 (2018-06-04), and Ubuntu 16.04 has 3.11.0</li>
2018-09-24 15:35:43 +02:00
<li>Ok this is hilarious, I manually downloaded the <a href="https://packages.ubuntu.com/cosmic/libsqlite3-0">libsqlite3 3.24.0 deb from Ubuntu 18.10 &ldquo;cosmic&rdquo;</a> and installed it in Ubnutu 16.04 and now the Python <code>indexer.py</code> works</li>
2018-09-24 12:40:59 +02:00
</ul>
2018-09-13 15:15:01 +02:00
<!-- vim: set sw=2 ts=2: -->
2018-09-02 11:03:43 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2018-09/">September, 2018</a></li>
<li><a href="/cgspace-notes/2018-08/">August, 2018</a></li>
<li><a href="/cgspace-notes/2018-07/">July, 2018</a></li>
<li><a href="/cgspace-notes/2018-06/">June, 2018</a></li>
<li><a href="/cgspace-notes/2018-05/">May, 2018</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>