<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="January, 2019" /> <meta property="og:description" content="2019-01-02 Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning I don’t see anything interesting in the web server logs around that time though: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 99 210.7.29.100 120 38.126.157.45 177 35.237.175.180 177 40.77.167.32 216 66.249.75.219 225 18.203.76.93 261 46.101.86.248 357 207.46.13.1 903 54.70.40.11 " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-01/" /> <meta property="article:published_time" content="2019-01-02T09:48:30+02:00" /> <meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="January, 2019"/> <meta name="twitter:description" content="2019-01-02 Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning I don’t see anything interesting in the web server logs around that time though: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 99 210.7.29.100 120 38.126.157.45 177 35.237.175.180 177 40.77.167.32 216 66.249.75.219 225 18.203.76.93 261 46.101.86.248 357 207.46.13.1 903 54.70.40.11 "/> <meta name="generator" content="Hugo 0.73.0" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "January, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-01/", "wordCount": "5532", "datePublished": "2019-01-02T09:48:30+02:00", "dateModified": "2019-10-28T13:39:25+02:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-01/"> <title>January, 2019 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-01/">January, 2019</a></h2> <p class="blog-post-meta"><time datetime="2019-01-02T09:48:30+02:00">Wed Jan 02, 2019</time> by Alan Orth in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2019-01-02">2019-01-02</h2> <ul> <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li> <li>I don’t see anything interesting in the web server logs around that time though:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 92 40.77.167.4 99 210.7.29.100 120 38.126.157.45 177 35.237.175.180 177 40.77.167.32 216 66.249.75.219 225 18.203.76.93 261 46.101.86.248 357 207.46.13.1 903 54.70.40.11 </code></pre><ul> <li>Analyzing the types of requests made by the top few IPs during that time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 30 bitstream 534 discover 352 handle # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 207.46.13.1 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 194 bitstream 345 handle # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 261 handle </code></pre><ul> <li>It’s not clear to me what was causing the outbound traffic spike</li> <li>Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):</li> </ul> <pre><code>Moving: 81742 into core statistics-2010 Moving: 1837285 into core statistics-2011 Moving: 3764612 into core statistics-2012 Moving: 4557946 into core statistics-2013 Moving: 5483684 into core statistics-2014 Moving: 2941736 into core statistics-2015 Moving: 5926070 into core statistics-2016 Moving: 10562554 into core statistics-2017 Moving: 18497180 into core statistics-2018 </code></pre><ul> <li>This could by why the outbound traffic rate was high, due to the S3 backup that run at 3:30AM…</li> <li>Run all system updates on DSpace Test (linode19) and reboot the server</li> </ul> <h2 id="2019-01-03">2019-01-03</h2> <ul> <li>Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:</li> </ul> <pre><code>$ sudo docker pull postgres:9.6-alpine $ sudo docker rm dspacedb $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine </code></pre><ul> <li>Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire’s Listings and Reports still doesn’t work <ul> <li>After logging in via XMLUI and clicking the Listings and Reports link from the sidebar it redirects me to a JSPUI login page</li> <li>If I log in again there the Listings and Reports work… hmm.</li> </ul> </li> <li>The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:</li> </ul> <pre><code>2019-01-03 14:45:21,727 INFO org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini: 2019-01-03 14:45:21,971 INFO org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23 2019-01-03 14:45:22,115 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error -- Method: GET -- Parameters were: org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discovery/static-tagcloud-facet.jsp (line: [57], column: [8]) No tag [tagcloud] defined in tag library imported with prefix [dspace] at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41) at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291) at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97) at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347) at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380) at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481) at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445) at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683) at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016) at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291) at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470) at org.apache.jasper.compiler.Parser.parse(Parser.java:144) at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244) at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105) at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202) at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373) at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350) at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334) at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330) at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728) at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470) at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395) at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316) at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60) at org.apache.jsp.index_jsp._jspService(index_jsp.java:191) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330) at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234) at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) </code></pre><ul> <li>I notice that I get different JSESSIONID cookies for <code>/</code> (XMLUI) and <code>/jspui</code> (JSPUI) on Tomcat 8.5.37, I wonder if it’s the same on Tomcat 7.0.92… yes I do.</li> <li>Hmm, on Tomcat 7.0.92 I see that I get a <code>dspace.current.user.id</code> session cookie after logging into XMLUI, and then when I browse to JSPUI I am still logged in… <ul> <li>I didn’t see that cookie being set on Tomcat 8.5.37</li> </ul> </li> <li>I sent a message to the dspace-tech mailing list to ask</li> </ul> <h2 id="2019-01-04">2019-01-04</h2> <ul> <li>Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’t see anything around that time in the web server logs:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 189 207.46.13.192 217 31.6.77.23 340 66.249.70.29 349 40.77.167.86 417 34.218.226.147 630 207.46.13.173 710 35.237.175.180 790 40.77.167.87 1776 66.249.70.27 2099 54.70.40.11 </code></pre><ul> <li>I’m thinking about trying to validate our <code>dc.subject</code> terms against <a href="http://aims.fao.org/agrovoc/webservices">AGROVOC webservices</a></li> <li>There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for <code>SOIL</code>:</li> </ul> <pre><code>$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en HTTP/1.1 200 OK Access-Control-Allow-Origin: * Connection: Keep-Alive Content-Length: 493 Content-Type: application/json; charset=utf-8 Date: Fri, 04 Jan 2019 13:44:27 GMT Keep-Alive: timeout=5, max=100 Server: Apache Strict-Transport-Security: max-age=63072000; includeSubdomains Vary: Accept X-Content-Type-Options: nosniff X-Frame-Options: ALLOW-FROM http://aims.fao.org { "@context": { "@language": "en", "altLabel": "skos:altLabel", "hiddenLabel": "skos:hiddenLabel", "isothes": "http://purl.org/iso25964/skos-thes#", "onki": "http://schema.onki.fi/onki#", "prefLabel": "skos:prefLabel", "results": { "@container": "@list", "@id": "onki:results" }, "skos": "http://www.w3.org/2004/02/skos/core#", "type": "@type", "uri": "@id" }, "results": [ { "lang": "en", "prefLabel": "soil", "type": [ "skos:Concept" ], "uri": "http://aims.fao.org/aos/agrovoc/c_7156", "vocab": "agrovoc" } ], "uri": "" } </code></pre><ul> <li>The API does not appear to be case sensitive (searches for <code>SOIL</code> and <code>soil</code> return the same thing)</li> <li>I’m a bit confused that there’s no obvious return code or status when a term is not found, for example <code>SOILS</code>:</li> </ul> <pre><code>HTTP/1.1 200 OK Access-Control-Allow-Origin: * Connection: Keep-Alive Content-Length: 367 Content-Type: application/json; charset=utf-8 Date: Fri, 04 Jan 2019 13:48:31 GMT Keep-Alive: timeout=5, max=100 Server: Apache Strict-Transport-Security: max-age=63072000; includeSubdomains Vary: Accept X-Content-Type-Options: nosniff X-Frame-Options: ALLOW-FROM http://aims.fao.org { "@context": { "@language": "en", "altLabel": "skos:altLabel", "hiddenLabel": "skos:hiddenLabel", "isothes": "http://purl.org/iso25964/skos-thes#", "onki": "http://schema.onki.fi/onki#", "prefLabel": "skos:prefLabel", "results": { "@container": "@list", "@id": "onki:results" }, "skos": "http://www.w3.org/2004/02/skos/core#", "type": "@type", "uri": "@id" }, "results": [], "uri": "" } </code></pre><ul> <li>I guess the <code>results</code> object will just be empty…</li> <li>Another way would be to try with SPARQL, perhaps using the Python 2.7 <a href="https://pypi.org/project/sparql-client/">sparql-client</a>:</li> </ul> <pre><code>$ python2.7 -m virtualenv /tmp/sparql $ . /tmp/sparql/bin/activate $ pip install sparql-client ipython $ ipython In [10]: import sparql In [11]: s = sparql.Service("http://agrovoc.uniroma2.it:3030/agrovoc/sparql", "utf-8", "GET") In [12]: statement=('PREFIX skos: <http://www.w3.org/2004/02/skos/core#> ' ...: 'SELECT ' ...: '?label ' ...: 'WHERE { ' ...: '{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } ' ...: 'FILTER regex(str(?label), "^fish", "i") . ' ...: '} LIMIT 10') In [13]: result = s.query(statement) In [14]: for row in result.fetchone(): ...: print(row) ...: (<Literal "fish catching"@en>,) (<Literal "fish harvesting"@en>,) (<Literal "fish meat"@en>,) (<Literal "fish roe"@en>,) (<Literal "fish conversion"@en>,) (<Literal "fisheries catches (composition)"@en>,) (<Literal "fishtail palm"@en>,) (<Literal "fishflies"@en>,) (<Literal "fishery biology"@en>,) (<Literal "fish production"@en>,) </code></pre><ul> <li>The SPARQL query comes from my notes in <a href="/cgspace-notes/2017-08/">2017-08</a></li> </ul> <h2 id="2019-01-06">2019-01-06</h2> <ul> <li>I built a clean DSpace 5.8 installation from the upstream <code>dspace-5.8</code> tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37 <ul> <li>If I log into XMLUI and then nagivate to JSPUI I need to log in again</li> <li>XMLUI does not set the <code>dspace.current.user.id</code> session cookie in Tomcat 8.5.37 for some reason</li> <li>I sent an update to the dspace-tech mailing list to ask for more help troubleshooting</li> </ul> </li> </ul> <h2 id="2019-01-07">2019-01-07</h2> <ul> <li>I built a clean DSpace 6.3 installation from the upstream <code>dspace-6.3</code> tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37 <ul> <li>If I log into XMLUI and then nagivate to JSPUI I need to log in again</li> <li>XMLUI does not set the <code>dspace.current.user.id</code> session cookie in Tomcat 8.5.37 for some reason</li> <li>I sent an update to the dspace-tech mailing list to ask for more help troubleshooting</li> </ul> </li> </ul> <h2 id="2019-01-08">2019-01-08</h2> <ul> <li>Tim Donohue responded to my thread about the cookies on the dspace-tech mailing list <ul> <li>He suspects it’s a change of behavior in Tomcat 8.5, and indeed I see a mention of new cookie processing in the <a href="https://tomcat.apache.org/migration-85.html#Cookies">Tomcat 8.5 migration guide</a></li> <li>I tried to switch my XMLUI and JSPUI contexts to use the <code>LegacyCookieProcessor</code>, but it didn’t seem to help</li> <li>I <a href="https://jira.duraspace.org/browse/DS-4140">filed DS-4140 on the DSpace issue tracker</a></li> </ul> </li> </ul> <h2 id="2019-01-11">2019-01-11</h2> <ul> <li>Tezira wrote to say she has stopped receiving the <code>DSpace Submission Approved and Archived</code> emails from CGSpace as of January 2nd <ul> <li>I told her that I haven’t done anything to disable it lately, but that I would check</li> <li>Bizu also says she hasn’t received them lately</li> </ul> </li> </ul> <h2 id="2019-01-14">2019-01-14</h2> <ul> <li>Day one of CGSpace AReS meeting in Amman</li> </ul> <h2 id="2019-01-15">2019-01-15</h2> <ul> <li>Day two of CGSpace AReS meeting in Amman <ul> <li>Discuss possibly extending the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to make community and collection statistics available</li> <li>Discuss new “final” CG Core document and some changes that we’ll need to do on CGSpace and other repositories</li> <li>We agreed to try to stick to pure Dublin Core where possible, then use fields that exist in standard DSpace, and use “cg” namespace for everything else</li> <li>Major changes are to move <code>dc.contributor.author</code> to <code>dc.creator</code> (which MELSpace and WorldFish are already using in their DSpace repositories)</li> </ul> </li> <li>I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in <a href="/cgspace-notes/2018-10/">2018-10</a>:</li> </ul> <pre><code>$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0' 0.16s user 0.03s system 3% cpu 5.185 total 0.17s user 0.02s system 2% cpu 7.123 total 0.18s user 0.02s system 6% cpu 3.047 total </code></pre><ul> <li>In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 157 31.6.77.23 192 54.70.40.11 202 66.249.64.157 207 40.77.167.204 220 157.55.39.140 326 197.156.105.116 385 207.46.13.158 1211 35.237.175.180 1830 66.249.64.155 2482 45.5.186.2 </code></pre><h2 id="2019-01-16">2019-01-16</h2> <ul> <li>Day three of CGSpace AReS meeting in Amman <ul> <li>We discussed CG Core 2.0 metadata and decided some action points</li> <li>We discussed branding of AReS tool</li> </ul> </li> <li>Notes from our CG Core 2.0 metadata discussion: <ul> <li>Not Dublin Core: <ul> <li>dc.subtype</li> <li>dc.peer-reviewed</li> </ul> </li> <li>Dublin Core, possible action for CGSpace: <ul> <li>dc.description: <ul> <li>We use dc.description.abstract, dc.description (Notes), dc.description.version (Peer review status), dc.description.sponsorship (Funder)</li> <li>Maybe move abstract to dc.description</li> <li>Maybe notes moves to cg.description.notes???</li> <li>Maybe move dc.description.version to cg.peer-reviewed or cg.peer-review-status???</li> <li>Move dc.description.sponsorship to cg.contributor.donor???</li> </ul> </li> <li>dc.subject: <ul> <li>Wait for guidance, evaluate technical implications (Google indexing, OAI, etc)</li> </ul> </li> <li>Move dc.contributor.author to dc.creator</li> <li>dc.contributor Project <ul> <li>Recommend against creating new fields for all projects</li> <li>We use collections projects/themes/etc</li> </ul> </li> <li>dc.contributor Project Lead Center <ul> <li>MELSpace uses cg.contributor.project-lead-institute (institute is more generic than center)</li> <li>Maybe we use?</li> </ul> </li> <li>dc.contributor Partner <ul> <li>Wait for guidance</li> <li>MELSpace uses cg.contibutor.center (?)</li> </ul> </li> <li>dc.contributor Donor <ul> <li>Use cg.contributor.donor</li> </ul> </li> <li>dc.date <ul> <li>Wait for guidance, maybe move dc.date.issued?</li> <li>dc.date.accessioned and dc.date.available are automatic in DSpace</li> </ul> </li> <li>dc.language <ul> <li>Move dc.language.iso to dc.language</li> </ul> </li> <li>dc.identifier <ul> <li>Move cg.identifier.url to dc.identifier</li> </ul> </li> <li>dc.identifier bibliographicCitation <ul> <li>dc.identifier.citation should move to dc.bibliographicCitation</li> </ul> </li> <li>dc.description.notes <ul> <li>Wait for guidance, maybe move to cg.description.notes ???</li> </ul> </li> <li>dc.relation <ul> <li>Maybe move cg.link.reference</li> <li>Perhaps consolodate cg.link.audio etc there…?</li> </ul> </li> <li>dc.relation.isPartOf <ul> <li>Move dc.relation.ispartofseries to dc.relation.isPartOf</li> </ul> </li> <li>dc.audience <ul> <li>Move cg.targetaudience to dc.audience</li> </ul> </li> </ul> </li> </ul> </li> <li>Something happened to the Solr usage statistics on CGSpace <ul> <li>I looked on the server and the Solr cores are there (56GB!), and I don’t see any obvious errors in dmesg or anything</li> <li>I see that the server hasn’t been rebooted in 26 days so I rebooted it</li> </ul> </li> <li>After reboot the Solr stats are still messed up in the Atmire Usage Stats module, it only shows 2019-01!</li> </ul> <p><img src="/cgspace-notes/2019/01/solr-stats-incorrect.png" alt="Solr stats fucked up"></p> <ul> <li>In the Solr admin UI I see the following error:</li> </ul> <pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher </code></pre><ul> <li>Looking in the Solr log I see this:</li> </ul> <pre><code>2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466) at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845) ... 31 more Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:89) at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753) at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77) at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279) at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528) ... 33 more 2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.solr.common.SolrException: Unable to create core [statistics-2018] at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466) at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575) ... 27 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491) ... 29 more Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845) ... 31 more Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:89) at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753) at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77) at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279) at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528) ... 33 more </code></pre><ul> <li>I found some threads on StackOverflow etc discussing this and several suggested increasing the address space for the shell with ulimit</li> <li>I added <code>ulimit -v unlimited</code> to the <code>/etc/default/tomcat7</code> and restarted Tomcat and now Solr is working again:</li> </ul> <p><img src="/cgspace-notes/2019/01/solr-stats-incorrect.png" alt="Solr stats working"></p> <ul> <li>Some StackOverflow discussions related to this: <ul> <li><a href="https://stackoverflow.com/questions/2895417/solrexception-internal-server-error/3035916#3035916">https://stackoverflow.com/questions/2895417/solrexception-internal-server-error/3035916#3035916</a></li> <li><a href="https://stackoverflow.com/questions/11683850/how-much-memory-could-vm-use">https://stackoverflow.com/questions/11683850/how-much-memory-could-vm-use</a></li> <li><a href="https://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed/8893684#8893684">https://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed/8893684#8893684</a></li> </ul> </li> <li>Abenet was asking if the Atmire Usage Stats are correct because they are over 2 million the last few months…</li> <li>For 2019-01 alone the Usage Stats are already around 1.2 million</li> <li>I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:</li> </ul> <pre><code># time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019" 1442874 real 0m17.161s user 0m16.205s sys 0m2.396s </code></pre><h2 id="2019-01-17">2019-01-17</h2> <ul> <li>Send reminder to Atmire about purchasing the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657">MQM module</a></li> <li>Trying to decide the solid action points for CGSpace on the CG Core 2.0 metadata…</li> <li>It’s difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!</li> <li>Also, there is not a good Dublin Core reference (or maybe I just don’t understand?)</li> <li>Several authoritative documents on Dublin Core appear to be: <ul> <li><a href="http://dublincore.org/documents/dces/">Dublin Core Metadata Element Set, Version 1.1: Reference Description</a></li> <li><a href="http://www.dublincore.org/documents/dcmi-terms/">DCMI Metadata Terms</a></li> </ul> </li> <li>And what is the relationship between DC and DCTERMS?</li> <li>DSpace uses DCTERMS in the metadata it embeds in XMLUI item views!</li> <li>We really need to look at this more carefully and see the impacts that might be made from switching core fields like languages, abstract, authors, etc</li> <li>We can check WorldFish and MELSpace repositories to see what effects these changes have had on theirs because they have already adopted some of these changes…</li> <li>I think I understand the difference between DC and DCTERMS finally: DC is the original set of fifteen elements and DCTERMS is the newer version that was supposed to address much of the drawbacks of the original with regards to digital content</li> <li>We might be able to use some proper fields for citation, abstract, etc that are part of DCTERMS</li> <li>To make matters more confusing, there is also “qualified Dublin Core” that uses the original fifteen elements of legacy DC and qualifies them, like <code>dc.date.accessioned</code> <ul> <li>According to Wikipedia <a href="https://en.wikipedia.org/wiki/Dublin_Core">Qualified Dublin Core was superseded by DCTERMS in 2008</a>!</li> </ul> </li> <li>So we should be trying to use DCTERMS where possible, unless it is some internal thing that might mess up DSpace (like dates)</li> <li>“Elements 1.1” means legacy DC</li> <li>Possible action list for CGSpace: <ul> <li>dc.description.abstract → dcterms.abstract</li> <li>dc.description.version → cg.peer-reviewed (or cg.peer-review-status?)</li> <li>dc.description.sponsorship → cg.contributor.donor</li> <li>dc.contributor.author → dc.creator</li> <li>dc.language.iso → dcterms.language</li> <li>cg.identifier.url → dcterms.identifier</li> <li>dc.identifier.citation → dcterms.bibliographicCitation</li> <li>dc.relation.ispartofseries → dcterms.isPartOf</li> <li>cg.targetaudience → dcterms.audience</li> </ul> </li> </ul> <h2 id="2019-01-19">2019-01-19</h2> <ul> <li> <p>There’s no official set of Dublin Core qualifiers so I can’t tell if things like <code>dc.contributor.author</code> that are used by DSpace are official</p> </li> <li> <p>I found a great <a href="https://www.dri.ie/sites/default/files/files/qualified-dublin-core-metadata-guidelines.pdf">presentation from 2015 by the Digital Repository of Ireland</a> that discusses using MARC Relator Terms with Dublin Core elements</p> </li> <li> <p>It seems that <code>dc.contributor.author</code> would be a supported term according to this <a href="https://memory.loc.gov/diglib/loc.terms/relators/dc-contributor.html">Library of Congress list</a> linked from the <a href="http://dublincore.org/usage/documents/relators/">Dublin Core website</a></p> </li> <li> <p>The Library of Congress document specifically says:</p> <p>These terms conform with the DCMI Abstract Model and may be used in DCMI application profiles. DCMI endorses their use with Dublin Core elements as indicated.</p> </li> </ul> <h2 id="2019-01-20">2019-01-20</h2> <ul> <li>That’s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:</li> </ul> <pre><code># w 04:46:14 up 213 days, 7:25, 4 users, load average: 1.94, 1.50, 1.35 </code></pre><ul> <li>I’ve definitely rebooted it several times in the past few months… according to <code>journalctl -b</code> it was a few weeks ago on 2019-01-02</li> <li>I re-ran the Ansible DSpace tag, ran all system updates, and rebooted the host</li> <li>After rebooting I notice that the Linode kernel went down from 4.19.8 to 4.18.16…</li> <li>Atmire sent a quote on our <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657">ticket about purchasing the Metadata Quality Module (MQM) for DSpace 5.8</a></li> <li>Abenet asked me for an <a href="https://cgspace.cgiar.org/open-search/discover?query=crpsubject:Livestock&sort_by=3&order=DESC">OpenSearch query that could generate and RSS feed for items in the Livestock CRP</a></li> <li>According to my notes, <code>sort_by=3</code> is accession date (as configured in `dspace.cfg)</li> <li>The query currently shows 3023 items, but a <a href="https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&filter_relational_operator_1=equals&filter_1=Livestock&submit_apply_filter=&query=">Discovery search for Livestock CRP only returns 858 items</a></li> <li>That query seems to return items tagged with <code>Livestock and Fish</code> CRP as well… hmm.</li> </ul> <h2 id="2019-01-21">2019-01-21</h2> <ul> <li>Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04’s Tomcat 8.5</li> <li>I could either run with a simple <code>tomcat7.service</code> like this:</li> </ul> <pre><code>[Unit] Description=Apache Tomcat 7 Web Application Container After=network.target [Service] Type=forking ExecStart=/path/to/apache-tomcat-7.0.92/bin/startup.sh ExecStop=/path/to/apache-tomcat-7.0.92/bin/shutdown.sh User=aorth Group=aorth [Install] WantedBy=multi-user.target </code></pre><ul> <li>Or try to use adapt a real systemd service like Arch Linux’s:</li> </ul> <pre><code>[Unit] Description=Tomcat 7 servlet container After=network.target [Service] Type=forking PIDFile=/var/run/tomcat7.pid Environment=CATALINA_PID=/var/run/tomcat7.pid Environment=TOMCAT_JAVA_HOME=/usr/lib/jvm/default-runtime Environment=CATALINA_HOME=/usr/share/tomcat7 Environment=CATALINA_BASE=/usr/share/tomcat7 Environment=CATALINA_OPTS= Environment=ERRFILE=SYSLOG Environment=OUTFILE=SYSLOG ExecStart=/usr/bin/jsvc \ -Dcatalina.home=${CATALINA_HOME} \ -Dcatalina.base=${CATALINA_BASE} \ -Djava.io.tmpdir=/var/tmp/tomcat7/temp \ -cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \ -user tomcat7 \ -java-home ${TOMCAT_JAVA_HOME} \ -pidfile /var/run/tomcat7.pid \ -errfile ${ERRFILE} \ -outfile ${OUTFILE} \ $CATALINA_OPTS \ org.apache.catalina.startup.Bootstrap ExecStop=/usr/bin/jsvc \ -pidfile /var/run/tomcat7.pid \ -stop \ org.apache.catalina.startup.Bootstrap [Install] WantedBy=multi-user.target </code></pre><ul> <li>I see that <code>jsvc</code> and <code>libcommons-daemon-java</code> are both available on Ubuntu so that should be easy to port</li> <li>We probably don’t need Eclipse Java Bytecode Compiler (ecj)</li> <li>I tested Tomcat 7.0.92 on Arch Linux using the <code>tomcat7.service</code> with <code>jsvc</code> and it works… nice!</li> <li>I think I might manage this the same way I do the restic releases in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>, where I download a specific version and symlink to some generic location without the version number</li> <li>I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:</li> </ul> <pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound <result name="response" numFound="33" start="0"> $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound <result name="response" numFound="241" start="0"> </code></pre><ul> <li>I opened an issue on the GitHub issue tracker (<a href="https://github.com/ilri/dspace-statistics-api/issues/10">#10</a>)</li> <li>I don’t think the <a href="https://solrclient.readthedocs.io/en/latest/">SolrClient library</a> we are currently using supports these type of queries so we might have to just do raw queries with requests</li> <li>The <a href="https://github.com/django-haystack/pysolr">pysolr</a> library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):</li> </ul> <pre><code>import pysolr solr = pysolr.Solr('http://localhost:3000/solr/statistics') results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0}) print(results.facets['facet_fields']) {'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]} </code></pre><ul> <li>If I double check one item from above, for example <code>77572</code>, it appears this is only working on the current statistics core and not the shards:</li> </ul> <pre><code>import pysolr solr = pysolr.Solr('http://localhost:3000/solr/statistics') results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'}) print(results.hits) 646 solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/') results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'}) print(results.hits) 595 </code></pre><ul> <li>So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON</li> <li>This enumerates the list of Solr cores and returns JSON format:</li> </ul> <pre><code>http://localhost:3000/solr/admin/cores?action=STATUS&wt=json </code></pre><ul> <li>I think I figured out how to search across shards, I needed to give the whole URL to each other core</li> <li>Now I get more results when I start adding the other statistics cores:</li> </ul> <pre><code>$ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0"> $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound <result name="response" numFound="16280292" start="0" maxScore="1.0"> $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&indent=on&rows=0&q=*:*' | grep numFound <result name="response" numFound="25606142" start="0" maxScore="1.0"> $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&indent=on&rows=0&q=*:*' | grep numFound <result name="response" numFound="31532212" start="0" maxScore="1.0"> </code></pre><ul> <li>I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the <code>shards</code> parameter to each query to make the search distributed among the cores</li> <li>I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a <code>shards</code> query string</li> <li>A few things I noticed: <ul> <li>Solr doesn’t mind if you use an empty <code>shards</code> parameter</li> <li>Solr doesn’t mind if you have an extra comma at the end of the <code>shards</code> parameter</li> <li>If you are searching multiple cores, you need to include the base core in the <code>shards</code> parameter as well</li> <li>For example, compare the following two queries, first including the base core and the shard in the <code>shards</code> parameter, and then only including the shard:</li> </ul> </li> </ul> <pre><code>$ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound <result name="response" numFound="275" start="0" maxScore="12.205825"> $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound <result name="response" numFound="241" start="0" maxScore="12.205825"> </code></pre><h2 id="2019-01-22">2019-01-22</h2> <ul> <li>Release <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v0.9.0">version 0.9.0 of the dspace-statistics-api</a> to address the issue of querying multiple Solr statistics shards</li> <li>I deployed it on DSpace Test (linode19) and restarted the indexer and now it shows all the stats from 2018 as well (756 pages of views, intead of 6)</li> <li>I deployed it on CGSpace (linode18) and restarted the indexer as well</li> <li>Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 155 40.77.167.106 176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8 189 107.21.16.70 217 54.83.93.85 310 46.174.208.142 346 83.103.94.48 360 45.5.186.2 595 154.113.73.30 716 196.191.127.37 915 35.237.175.180 </code></pre><ul> <li>35.237.175.180 is known to us</li> <li>I don’t think we’ve seen 196.191.127.37 before. Its user agent is:</li> </ul> <pre><code>Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36 </code></pre><ul> <li>Interestingly this IP is located in Addis Ababa…</li> <li>Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:</li> </ul> <pre><code>Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 </code></pre><h2 id="2019-01-23">2019-01-23</h2> <ul> <li>Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:</li> </ul> <blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/ILRI?src=hash&ref_src=twsrc%5Etfw">#ILRI</a> research: Towards unlocking the potential of the hides and skins value chain in Somaliland <a href="https://t.co/EZH7ALW4dp">https://t.co/EZH7ALW4dp</a></p>— ILRI Communications (@ILRI) <a href="https://twitter.com/ILRI/status/1086330519904673793?ref_src=twsrc%5Etfw">January 18, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <ul> <li>The shortened link is <a href="goo.gl/fb/VRj9Gq">goo.gl/fb/VRj9Gq</a> and it shows a “Dynamic Link not found” error from Firebase:</li> </ul> <p><img src="/cgspace-notes/2019/01/firebase-link-not-found.png" alt="Dynamic Link not found"></p> <ul> <li> <p>Apparently Google announced last year that they plan to <a href="https://developers.googleblog.com/2018/03/transitioning-google-url-shortener.html">discontinue the shortner and transition to Firebase Dynamic Links in March, 2019</a>, so maybe this is related…</p> </li> <li> <p>Very interesting discussion of methods for <a href="https://jdebp.eu/FGA/systemd-house-of-horror/tomcat.html">running Tomcat under systemd</a></p> </li> <li> <p>We can set the ulimit options that used to be in <code>/etc/default/tomcat7</code> with systemd’s <code>LimitNOFILE</code> and <code>LimitAS</code> (see the <code>systemd.exec</code> man page)</p> <ul> <li>Note that we need to use <code>infinity</code> instead of <code>unlimited</code> for the address space</li> </ul> </li> <li> <p>Create accounts for Bosun from IITA and Valerio from ICARDA / CGMEL on DSpace Test</p> </li> <li> <p>Maria Garruccio asked me for a list of author affiliations from all of their submitted items so she can clean them up</p> </li> <li> <p>I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:</p> </li> </ul> <pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv; COPY 1109 </code></pre><ul> <li>Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP</li> <li>Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 222 54.226.25.74 241 40.77.167.13 272 46.101.86.248 297 35.237.175.180 332 45.5.184.72 355 34.218.226.147 404 66.249.64.155 4637 205.186.128.185 4637 70.32.83.92 9265 45.5.186.2 </code></pre><ul> <li> <p>I think it’s the usual IPs:</p> <ul> <li>45.5.186.2 is CIAT</li> <li>70.32.83.92 is CCAFS</li> <li>205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)</li> </ul> </li> <li> <p>Following up on the thumbnail issue that we had in <a href="/cgspace-notes/2018-12/">2018-12</a></p> </li> <li> <p>It looks like the two items with problematic PDFs both have thumbnails now:</p> <ul> <li><a href="https://hdl.handle.net/10568/98390">10568/98390</a></li> <li><a href="https://hdl.handle.net/10568/98391">10568/98391</a></li> </ul> </li> <li> <p>Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’s <code>filter-media</code>:</p> </li> </ul> <pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390 $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391 </code></pre><ul> <li>Both of these were successful, so there must have been an update to ImageMagick or Ghostscript in Ubuntu since early 2018-12</li> <li>Looking at the apt history logs I see that on 2018-12-07 a security update for Ghostscript was installed (version 9.26~dfsg+0-0ubuntu0.16.04.3)</li> <li>I think this Launchpad discussion is relevant: <a href="https://bugs.launchpad.net/ubuntu/+source/ghostscript/+bug/1806517">https://bugs.launchpad.net/ubuntu/+source/ghostscript/+bug/1806517</a></li> <li>As well as the original Ghostscript bug report: <a href="https://bugs.ghostscript.com/show_bug.cgi?id=699815">https://bugs.ghostscript.com/show_bug.cgi?id=699815</a></li> </ul> <h2 id="2019-01-24">2019-01-24</h2> <ul> <li>I noticed Ubuntu’s Ghostscript 9.26 works on some troublesome PDFs where Arch’s Ghostscript 9.26 doesn’t, so the fix for the first/last page crash is not the patch I found yesterday</li> <li>Ubuntu’s Ghostscript uses another <a href="http://git.ghostscript.com/?p=ghostpdl.git;h=fae21f1668d2b44b18b84cf0923a1d5f3008a696">patch from Ghostscript git</a> (<a href="https://bugs.ghostscript.com/show_bug.cgi?id=700315">upstream bug report</a>)</li> <li>I re-compiled Arch’s ghostscript with the patch and then I was able to generate a thumbnail from one of the <a href="https://cgspace.cgiar.org/handle/10568/98390">troublesome PDFs</a></li> <li>Before and after:</li> </ul> <pre><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\] zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\] $ identify Food\ safety\ Kenya\ fruits.pdf\[0\] Food safety Kenya fruits.pdf[0]=>Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000 identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747. </code></pre><ul> <li>I reported it to the Arch Linux bug tracker (<a href="https://bugs.archlinux.org/task/61513">61513</a>)</li> <li>I told Atmire to go ahead with the Metadata Quality Module addition based on our <code>5_x-dev</code> branch (<a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657">657</a>)</li> <li>Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 305 3.81.136.184 306 3.83.14.11 306 52.54.252.47 325 54.221.57.180 378 66.249.64.157 424 54.70.40.11 497 47.29.247.74 783 35.237.175.180 1108 66.249.64.155 2378 45.5.186.2 </code></pre><ul> <li>45.5.186.2 is CIAT and 66.249.64.155 is Google… hmmm.</li> <li>Linode sent another alert this morning, here are the top ten IPs active during that time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 360 3.89.134.93 362 34.230.15.139 366 100.24.48.177 369 18.212.208.240 377 3.81.136.184 404 54.221.57.180 506 66.249.64.155 4642 70.32.83.92 4643 205.186.128.185 8593 45.5.186.2 </code></pre><ul> <li>Just double checking what CIAT is doing, they are mainly hitting the REST API:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n </code></pre><ul> <li>CIAT’s community currently has 12,000 items in it so this is normal</li> <li>The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again…</li> <li>For example: <a href="https://goo.gl/fb/VRj9Gq">https://goo.gl/fb/VRj9Gq</a></li> <li>The full <a href="http://id.loc.gov/vocabulary/relators.html">list of MARC Relators on the Library of Congress website</a> linked from the <a href="http://dublincore.org/usage/documents/relators/">DMCI relators page</a> is very confusing</li> <li>Looking at the default DSpace XMLUI crosswalk in <a href="https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/config/crosswalks/xhtml-head-item.properties">xhtml-head-item.properties</a> I see a very complete mapping of DSpace DC and QDC fields to DCTERMS <ul> <li>This is good for standards-compliant web crawlers, but what about for those harvesting via REST or OAI APIs?</li> </ul> </li> <li>I sent a message titled “<a href="https://groups.google.com/forum/#!topic/dspace-tech/phV_t51TGuE">DC, QDC, and DCTERMS: reviewing our metadata practices</a>” to the dspace-tech mailing list to ask about some of this</li> </ul> <h2 id="2019-01-25">2019-01-25</h2> <ul> <li>A little bit more work on getting Tomcat to run from a tarball on our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> <ul> <li>I tested by doing a Tomcat 7.0.91 installation, then switching it to 7.0.92 and it worked… nice!</li> <li>I refined the tasks so much that I was confident enough to deploy them on DSpace Test and it went very well</li> <li>Basically I just stopped tomcat7, created a dspace user, removed tomcat7, chown’d everything to the dspace user, then ran the playbook</li> <li>So now DSpace Test (linode19) is running Tomcat 7.0.92… w00t</li> <li>Now we need to monitor it for a few weeks to see if there is anything we missed, and then I can change CGSpace (linode18) as well, and we’re ready for Ubuntu 18.04 too!</li> </ul> </li> </ul> <h2 id="2019-01-27">2019-01-27</h2> <ul> <li>Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 189 40.77.167.108 191 157.55.39.2 263 34.218.226.147 283 45.5.184.2 332 45.5.184.72 608 5.9.6.51 679 66.249.66.223 1116 66.249.66.219 4644 205.186.128.185 4644 70.32.83.92 </code></pre><ul> <li>I think it’s the usual IPs: <ul> <li>70.32.83.92 is CCAFS</li> <li>205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)</li> </ul> </li> </ul> <h2 id="2019-01-28">2019-01-28</h2> <ul> <li>Udana from WLE asked me about the interaction between their publication website and their items on CGSpace <ul> <li>There is an item that is mapped into their collection from IWMI and is missing their <code>cg.identifier.wletheme</code> metadata</li> <li>I told him that, as far as I remember, when WLE introduced Phase II research themes in 2017 we decided to infer theme ownership from the collection hierarchy and we created a <a href="https://cgspace.cgiar.org/handle/10568/81268">WLE Phase II Research Themes</a> subCommunity</li> <li>Perhaps they need to ask Macaroni Bros about the mapping</li> </ul> </li> <li>Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 67 207.46.13.50 105 41.204.190.40 117 34.218.226.147 126 35.237.175.180 203 213.55.99.121 332 45.5.184.72 377 5.9.6.51 512 45.5.184.2 4644 205.186.128.185 4644 70.32.83.92 </code></pre><ul> <li>There seems to be a pattern with <code>70.32.83.92</code> and <code>205.186.128.185</code> lately!</li> <li>Every morning at 8AM they are the top users… I should tell them to stagger their requests…</li> <li>I signed up for a <a href="https://visualping.io/">VisualPing</a> of the <a href="https://jdbc.postgresql.org/download.html">PostgreSQL JDBC driver download page</a> to my CGIAR email address <ul> <li>Hopefully this will one day alert me that a new driver is released!</li> </ul> </li> <li>Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 310 45.5.184.2 425 5.143.231.39 526 54.70.40.11 1003 199.47.87.141 1374 35.237.175.180 1455 5.9.6.51 1501 66.249.66.223 1771 66.249.66.219 2107 199.47.87.140 2540 45.5.186.2 </code></pre><ul> <li>Of course there is CIAT’s <code>45.5.186.2</code>, but also <code>45.5.184.2</code> appears to be CIAT… I wonder why they have two harvesters?</li> <li><code>199.47.87.140</code> and <code>199.47.87.141</code> is TurnItIn with the following user agent:</li> </ul> <pre><code>TurnitinBot (https://turnitin.com/robot/crawlerinfo.html) </code></pre><h2 id="2019-01-29">2019-01-29</h2> <ul> <li>Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 334 45.5.184.72 429 66.249.66.223 522 35.237.175.180 555 34.218.226.147 655 66.249.66.221 844 5.9.6.51 2507 66.249.66.219 4645 70.32.83.92 4646 205.186.128.185 9329 45.5.186.2 </code></pre><ul> <li><code>45.5.186.2</code> is CIAT as usual…</li> <li><code>70.32.83.92</code> and <code>205.186.128.185</code> are CCAFS as usual…</li> <li><code>66.249.66.219</code> is Google…</li> <li>I’m thinking it might finally be time to increase the threshold of the Linode CPU alerts <ul> <li>I adjusted the alert threshold from 250% to 275%</li> </ul> </li> </ul> <h2 id="2019-01-30">2019-01-30</h2> <ul> <li>Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 273 46.101.86.248 301 35.237.175.180 334 45.5.184.72 387 5.9.6.51 527 2a01:4f8:13b:1296::2 1021 34.218.226.147 1448 66.249.66.219 4649 205.186.128.185 4649 70.32.83.92 5163 45.5.184.2 </code></pre><ul> <li>I might need to adjust the threshold again, because the load average this morning was 296% and the activity looks pretty normal (as always recently)</li> </ul> <h2 id="2019-01-31">2019-01-31</h2> <ul> <li>Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 436 18.196.196.108 460 157.55.39.168 460 207.46.13.96 500 197.156.105.116 728 54.70.40.11 1560 5.9.6.51 1562 35.237.175.180 1601 85.25.237.71 1894 66.249.66.219 2610 45.5.184.2 # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "31/Jan/2019:0(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 318 207.46.13.242 334 45.5.184.72 486 35.237.175.180 609 34.218.226.147 620 66.249.66.219 1054 5.9.6.51 4391 70.32.83.92 4428 205.186.128.185 6758 85.25.237.71 9239 45.5.186.2 </code></pre><ul> <li><code>45.5.186.2</code> and <code>45.5.184.2</code> are CIAT as always</li> <li><code>85.25.237.71</code> is some new server in Germany that I’ve never seen before with the user agent:</li> </ul> <pre><code>Linguee Bot (http://www.linguee.com/bot; bot@linguee.com) </code></pre><!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2020-07/">July, 2020</a></li> <li><a href="/cgspace-notes/2020-06/">June, 2020</a></li> <li><a href="/cgspace-notes/2020-05/">May, 2020</a></li> <li><a href="/cgspace-notes/2020-04/">April, 2020</a></li> <li><a href="/cgspace-notes/2020-03/">March, 2020</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>