2019-01-02 09:59:01 +02:00
<!DOCTYPE html>
2019-10-11 11:19:42 +03:00
< html lang = "en" >
2019-01-02 09:59:01 +02:00
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "January, 2019" / >
< meta property = "og:description" content = "2019-01-02
Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
2020-01-27 16:20:44 +02:00
I don’ t see anything interesting in the web server logs around that time though:
2019-01-02 09:59:01 +02:00
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 02/Jan/2019:0(1|2|3)" | awk ' {print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
177 35.237.175.180
177 40.77.167.32
216 66.249.75.219
225 18.203.76.93
261 46.101.86.248
357 207.46.13.1
903 54.70.40.11
2019-01-02 09:59:01 +02:00
" />
< meta property = "og:type" content = "article" / >
2019-02-02 14:12:57 +02:00
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2019-01/" / >
2019-08-08 18:10:44 +03:00
< meta property = "article:published_time" content = "2019-01-02T09:48:30+02:00" / >
2019-10-28 13:43:25 +02:00
< meta property = "article:modified_time" content = "2019-10-28T13:39:25+02:00" / >
2019-01-02 09:59:01 +02:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "January, 2019" / >
< meta name = "twitter:description" content = "2019-01-02
Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
2020-01-27 16:20:44 +02:00
I don’ t see anything interesting in the web server logs around that time though:
2019-01-02 09:59:01 +02:00
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 02/Jan/2019:0(1|2|3)" | awk ' {print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
177 35.237.175.180
177 40.77.167.32
216 66.249.75.219
225 18.203.76.93
261 46.101.86.248
357 207.46.13.1
903 54.70.40.11
2019-01-02 09:59:01 +02:00
"/>
2020-08-06 09:00:37 +03:00
< meta name = "generator" content = "Hugo 0.74.3" / >
2019-01-02 09:59:01 +02:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2019",
2020-04-02 10:55:42 +03:00
"url": "https://alanorth.github.io/cgspace-notes/2019-01/",
2019-02-01 21:45:50 +02:00
"wordCount": "5532",
2019-10-11 11:19:42 +03:00
"datePublished": "2019-01-02T09:48:30+02:00",
2019-10-28 13:43:25 +02:00
"dateModified": "2019-10-28T13:39:25+02:00",
2019-01-02 09:59:01 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2019-01/" >
< title > January, 2019 | CGSpace Notes< / title >
2019-10-11 11:19:42 +03:00
2019-01-02 09:59:01 +02:00
<!-- combined, minified CSS -->
2020-01-23 20:19:38 +02:00
2020-01-28 12:01:42 +02:00
< link href = "https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel = "stylesheet" integrity = "sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin = "anonymous" >
2019-10-11 11:19:42 +03:00
2019-01-02 09:59:01 +02:00
2020-01-28 12:01:42 +02:00
<!-- minified Font Awesome for SVG icons -->
2020-04-02 10:55:42 +03:00
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity = "sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin = "anonymous" > < / script >
2020-01-28 12:01:42 +02:00
2019-04-14 16:59:47 +03:00
<!-- RSS 2.0 feed -->
2019-01-02 09:59:01 +02:00
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
2019-10-11 11:19:42 +03:00
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
2019-01-02 09:59:01 +02:00
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
2019-10-11 11:19:42 +03:00
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2019-01/" > January, 2019< / a > < / h2 >
2020-04-02 10:55:42 +03:00
< p class = "blog-post-meta" > < time datetime = "2019-01-02T09:48:30+02:00" > Wed Jan 02, 2019< / time > by Alan Orth in
2020-01-28 12:01:42 +02:00
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/cgspace-notes/categories/notes/" rel = "category tag" > Notes< / a >
2019-01-02 09:59:01 +02:00
< / p >
< / header >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-02" > 2019-01-02< / h2 >
2019-01-02 09:59:01 +02:00
< ul >
< li > Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning< / li >
2020-01-27 16:20:44 +02:00
< li > I don’ t see anything interesting in the web server logs around that time though:< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-02 09:59:01 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
177 35.237.175.180
177 40.77.167.32
216 66.249.75.219
225 18.203.76.93
261 46.101.86.248
357 207.46.13.1
903 54.70.40.11
< / code > < / pre > < ul >
< li > Analyzing the types of requests made by the top few IPs during that time:< / li >
2019-01-02 09:59:01 +02:00
< / ul >
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E " (bitstream|discover|handle)" | sort | uniq -c
2019-11-28 17:30:45 +02:00
30 bitstream
534 discover
352 handle
2019-01-02 09:59:01 +02:00
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 02/Jan/2019:0(1|2|3)" | grep 207.46.13.1 | grep -o -E " (bitstream|discover|handle)" | sort | uniq -c
2019-11-28 17:30:45 +02:00
194 bitstream
345 handle
2019-01-02 09:59:01 +02:00
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E " (bitstream|discover|handle)" | sort | uniq -c
2019-11-28 17:30:45 +02:00
261 handle
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > It’ s not clear to me what was causing the outbound traffic spike< / li >
2019-11-28 17:30:45 +02:00
< li > Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):< / li >
< / ul >
2019-01-02 10:28:26 +02:00
< pre > < code > Moving: 81742 into core statistics-2010
Moving: 1837285 into core statistics-2011
Moving: 3764612 into core statistics-2012
Moving: 4557946 into core statistics-2013
Moving: 5483684 into core statistics-2014
Moving: 2941736 into core statistics-2015
Moving: 5926070 into core statistics-2016
Moving: 10562554 into core statistics-2017
Moving: 18497180 into core statistics-2018
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > This could by why the outbound traffic rate was high, due to the S3 backup that run at 3:30AM… < / li >
< li > Run all system updates on DSpace Test (linode19) and reboot the server< / li >
2019-01-02 09:59:01 +02:00
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-03" > 2019-01-03< / h2 >
2019-01-03 11:52:26 +02:00
< ul >
2019-11-28 17:30:45 +02:00
< li > Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:< / li >
< / ul >
2019-01-03 11:52:26 +02:00
< pre > < code > $ sudo docker pull postgres:9.6-alpine
$ sudo docker rm dspacedb
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire’ s Listings and Reports still doesn’ t work
2019-01-04 20:38:11 +02:00
< ul >
< li > After logging in via XMLUI and clicking the Listings and Reports link from the sidebar it redirects me to a JSPUI login page< / li >
< li > If I log in again there the Listings and Reports work… hmm.< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
< li > The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:< / li >
< / ul >
2019-01-04 20:38:11 +02:00
< pre > < code > 2019-01-03 14:45:21,727 INFO org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
2019-01-03 14:45:21,971 INFO org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
2019-01-03 14:45:22,115 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error
-- Method: GET
-- Parameters were:
org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discovery/static-tagcloud-facet.jsp (line: [57], column: [8]) No tag [tagcloud] defined in tag library imported with prefix [dspace]
2019-11-28 17:30:45 +02:00
at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41)
at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291)
at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97)
at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347)
at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380)
at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481)
at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445)
at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683)
at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016)
at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291)
at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470)
at org.apache.jasper.compiler.Parser.parse(Parser.java:144)
at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244)
at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105)
at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202)
at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373)
at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350)
at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334)
at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595)
at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399)
at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728)
at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470)
at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395)
at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316)
at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60)
at org.apache.jsp.index_jsp._jspService(index_jsp.java:191)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476)
at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > I notice that I get different JSESSIONID cookies for < code > /< / code > (XMLUI) and < code > /jspui< / code > (JSPUI) on Tomcat 8.5.37, I wonder if it’ s the same on Tomcat 7.0.92… yes I do.< / li >
2019-11-28 17:30:45 +02:00
< li > Hmm, on Tomcat 7.0.92 I see that I get a < code > dspace.current.user.id< / code > session cookie after logging into XMLUI, and then when I browse to JSPUI I am still logged in…
< ul >
2020-01-27 16:20:44 +02:00
< li > I didn’ t see that cookie being set on Tomcat 8.5.37< / li >
2019-01-04 20:38:11 +02:00
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< li > I sent a message to the dspace-tech mailing list to ask< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-04" > 2019-01-04< / h2 >
2019-01-04 20:38:11 +02:00
< ul >
2020-01-27 16:20:44 +02:00
< li > Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’ t see anything around that time in the web server logs:< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-04 20:38:11 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
189 207.46.13.192
217 31.6.77.23
340 66.249.70.29
349 40.77.167.86
417 34.218.226.147
630 207.46.13.173
710 35.237.175.180
790 40.77.167.87
1776 66.249.70.27
2099 54.70.40.11
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > I’ m thinking about trying to validate our < code > dc.subject< / code > terms against < a href = "http://aims.fao.org/agrovoc/webservices" > AGROVOC webservices< / a > < / li >
2019-11-28 17:30:45 +02:00
< li > There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for < code > SOIL< / code > :< / li >
< / ul >
2019-01-04 20:38:11 +02:00
< pre > < code > $ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL& lang=en
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Length: 493
Content-Type: application/json; charset=utf-8
Date: Fri, 04 Jan 2019 13:44:27 GMT
Keep-Alive: timeout=5, max=100
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubdomains
Vary: Accept
X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org
{
2019-11-28 17:30:45 +02:00
" @context" : {
" @language" : " en" ,
" altLabel" : " skos:altLabel" ,
" hiddenLabel" : " skos:hiddenLabel" ,
" isothes" : " http://purl.org/iso25964/skos-thes#" ,
" onki" : " http://schema.onki.fi/onki#" ,
" prefLabel" : " skos:prefLabel" ,
" results" : {
" @container" : " @list" ,
" @id" : " onki:results"
},
" skos" : " http://www.w3.org/2004/02/skos/core#" ,
" type" : " @type" ,
" uri" : " @id"
2019-01-04 20:38:11 +02:00
},
2019-11-28 17:30:45 +02:00
" results" : [
{
" lang" : " en" ,
" prefLabel" : " soil" ,
" type" : [
" skos:Concept"
],
" uri" : " http://aims.fao.org/aos/agrovoc/c_7156" ,
" vocab" : " agrovoc"
}
],
" uri" : " "
2019-01-04 20:38:11 +02:00
}
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > The API does not appear to be case sensitive (searches for < code > SOIL< / code > and < code > soil< / code > return the same thing)< / li >
2020-01-27 16:20:44 +02:00
< li > I’ m a bit confused that there’ s no obvious return code or status when a term is not found, for example < code > SOILS< / code > :< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-04 20:38:11 +02:00
< pre > < code > HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Length: 367
Content-Type: application/json; charset=utf-8
Date: Fri, 04 Jan 2019 13:48:31 GMT
Keep-Alive: timeout=5, max=100
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubdomains
Vary: Accept
X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org
{
2019-11-28 17:30:45 +02:00
" @context" : {
" @language" : " en" ,
" altLabel" : " skos:altLabel" ,
" hiddenLabel" : " skos:hiddenLabel" ,
" isothes" : " http://purl.org/iso25964/skos-thes#" ,
" onki" : " http://schema.onki.fi/onki#" ,
" prefLabel" : " skos:prefLabel" ,
" results" : {
" @container" : " @list" ,
" @id" : " onki:results"
},
" skos" : " http://www.w3.org/2004/02/skos/core#" ,
" type" : " @type" ,
" uri" : " @id"
2019-01-04 20:38:11 +02:00
},
2019-11-28 17:30:45 +02:00
" results" : [],
" uri" : " "
2019-01-04 20:38:11 +02:00
}
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > I guess the < code > results< / code > object will just be empty… < / li >
< li > Another way would be to try with SPARQL, perhaps using the Python 2.7 < a href = "https://pypi.org/project/sparql-client/" > sparql-client< / a > :< / li >
< / ul >
2019-01-04 20:38:11 +02:00
< pre > < code > $ python2.7 -m virtualenv /tmp/sparql
$ . /tmp/sparql/bin/activate
$ pip install sparql-client ipython
$ ipython
In [10]: import sparql
In [11]: s = sparql.Service(" http://agrovoc.uniroma2.it:3030/agrovoc/sparql" , " utf-8" , " GET" )
In [12]: statement=('PREFIX skos: < http://www.w3.org/2004/02/skos/core#> '
2019-11-28 17:30:45 +02:00
...: 'SELECT '
...: '?label '
...: 'WHERE { '
...: '{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } '
...: 'FILTER regex(str(?label), " ^fish" , " i" ) . '
...: '} LIMIT 10')
2019-01-04 20:38:11 +02:00
In [13]: result = s.query(statement)
In [14]: for row in result.fetchone():
2019-11-28 17:30:45 +02:00
...: print(row)
...:
2019-01-04 20:38:11 +02:00
(< Literal " fish catching" @en> ,)
(< Literal " fish harvesting" @en> ,)
(< Literal " fish meat" @en> ,)
(< Literal " fish roe" @en> ,)
(< Literal " fish conversion" @en> ,)
(< Literal " fisheries catches (composition)" @en> ,)
(< Literal " fishtail palm" @en> ,)
(< Literal " fishflies" @en> ,)
(< Literal " fishery biology" @en> ,)
(< Literal " fish production" @en> ,)
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > The SPARQL query comes from my notes in < a href = "/cgspace-notes/2017-08/" > 2017-08< / a > < / li >
2019-01-04 20:38:11 +02:00
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-06" > 2019-01-06< / h2 >
2019-01-06 11:50:26 +02:00
< ul >
< li > I built a clean DSpace 5.8 installation from the upstream < code > dspace-5.8< / code > tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37
< ul >
< li > If I log into XMLUI and then nagivate to JSPUI I need to log in again< / li >
< li > XMLUI does not set the < code > dspace.current.user.id< / code > session cookie in Tomcat 8.5.37 for some reason< / li >
< li > I sent an update to the dspace-tech mailing list to ask for more help troubleshooting< / li >
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-07" > 2019-01-07< / h2 >
2019-01-07 22:30:23 +02:00
< ul >
< li > I built a clean DSpace 6.3 installation from the upstream < code > dspace-6.3< / code > tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37
< ul >
< li > If I log into XMLUI and then nagivate to JSPUI I need to log in again< / li >
< li > XMLUI does not set the < code > dspace.current.user.id< / code > session cookie in Tomcat 8.5.37 for some reason< / li >
< li > I sent an update to the dspace-tech mailing list to ask for more help troubleshooting< / li >
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-08" > 2019-01-08< / h2 >
2019-01-08 14:30:41 +02:00
< ul >
< li > Tim Donohue responded to my thread about the cookies on the dspace-tech mailing list
< ul >
2020-01-27 16:20:44 +02:00
< li > He suspects it’ s a change of behavior in Tomcat 8.5, and indeed I see a mention of new cookie processing in the < a href = "https://tomcat.apache.org/migration-85.html#Cookies" > Tomcat 8.5 migration guide< / a > < / li >
< li > I tried to switch my XMLUI and JSPUI contexts to use the < code > LegacyCookieProcessor< / code > , but it didn’ t seem to help< / li >
2019-01-08 14:30:41 +02:00
< li > I < a href = "https://jira.duraspace.org/browse/DS-4140" > filed DS-4140 on the DSpace issue tracker< / a > < / li >
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-11" > 2019-01-11< / h2 >
2019-01-11 16:01:21 +02:00
< ul >
< li > Tezira wrote to say she has stopped receiving the < code > DSpace Submission Approved and Archived< / code > emails from CGSpace as of January 2nd
< ul >
2020-01-27 16:20:44 +02:00
< li > I told her that I haven’ t done anything to disable it lately, but that I would check< / li >
< li > Bizu also says she hasn’ t received them lately< / li >
2019-01-11 16:01:21 +02:00
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-14" > 2019-01-14< / h2 >
2019-01-14 23:11:07 +02:00
< ul >
< li > Day one of CGSpace AReS meeting in Amman< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-15" > 2019-01-15< / h2 >
2019-01-15 16:35:16 +02:00
< ul >
< li > Day two of CGSpace AReS meeting in Amman
< ul >
< li > Discuss possibly extending the < a href = "https://github.com/ilri/dspace-statistics-api" > dspace-statistics-api< / a > to make community and collection statistics available< / li >
2020-01-27 16:20:44 +02:00
< li > Discuss new “ final” CG Core document and some changes that we’ ll need to do on CGSpace and other repositories< / li >
2019-01-15 16:35:16 +02:00
< li > We agreed to try to stick to pure Dublin Core where possible, then use fields that exist in standard DSpace, and use “ cg” namespace for everything else< / li >
< li > Major changes are to move < code > dc.contributor.author< / code > to < code > dc.creator< / code > (which MELSpace and WorldFish are already using in their DSpace repositories)< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2020-01-27 16:20:44 +02:00
< li > I am testing the speed of the WorldFish DSpace repository’ s REST API and it’ s five to ten times faster than CGSpace as I tested in < a href = "/cgspace-notes/2018-10/" > 2018-10< / a > :< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-15 16:35:16 +02:00
< pre > < code > $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList& limit=100& offset=0'
0.16s user 0.03s system 3% cpu 5.185 total
0.17s user 0.02s system 2% cpu 7.123 total
0.18s user 0.02s system 6% cpu 3.047 total
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:< / li >
2019-05-05 16:45:12 +03:00
< / ul >
2019-11-28 17:30:45 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
157 31.6.77.23
192 54.70.40.11
202 66.249.64.157
207 40.77.167.204
220 157.55.39.140
326 197.156.105.116
385 207.46.13.158
1211 35.237.175.180
1830 66.249.64.155
2482 45.5.186.2
2019-12-17 14:49:24 +02:00
< / code > < / pre > < h2 id = "2019-01-16" > 2019-01-16< / h2 >
2019-01-16 16:47:30 +02:00
< ul >
< li > Day three of CGSpace AReS meeting in Amman
< ul >
2019-01-17 13:28:41 +02:00
< li > We discussed CG Core 2.0 metadata and decided some action points< / li >
2019-01-16 16:47:30 +02:00
< li > We discussed branding of AReS tool< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > Notes from our CG Core 2.0 metadata discussion:
< ul >
2019-11-28 17:30:45 +02:00
< li > Not Dublin Core:
< ul >
2019-01-17 13:28:41 +02:00
< li > dc.subtype< / li >
< li > dc.peer-reviewed< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
< li > Dublin Core, possible action for CGSpace:
< ul >
2019-01-17 13:28:41 +02:00
< li > dc.description:
< ul >
< li > We use dc.description.abstract, dc.description (Notes), dc.description.version (Peer review status), dc.description.sponsorship (Funder)< / li >
< li > Maybe move abstract to dc.description< / li >
< li > Maybe notes moves to cg.description.notes???< / li >
< li > Maybe move dc.description.version to cg.peer-reviewed or cg.peer-review-status???< / li >
< li > Move dc.description.sponsorship to cg.contributor.donor???< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.subject:
< ul >
< li > Wait for guidance, evaluate technical implications (Google indexing, OAI, etc)< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > Move dc.contributor.author to dc.creator< / li >
< li > dc.contributor Project
< ul >
< li > Recommend against creating new fields for all projects< / li >
< li > We use collections projects/themes/etc< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.contributor Project Lead Center
< ul >
< li > MELSpace uses cg.contributor.project-lead-institute (institute is more generic than center)< / li >
< li > Maybe we use?< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.contributor Partner
< ul >
< li > Wait for guidance< / li >
< li > MELSpace uses cg.contibutor.center (?)< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.contributor Donor
< ul >
< li > Use cg.contributor.donor< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.date
< ul >
< li > Wait for guidance, maybe move dc.date.issued?< / li >
< li > dc.date.accessioned and dc.date.available are automatic in DSpace< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.language
< ul >
< li > Move dc.language.iso to dc.language< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.identifier
< ul >
< li > Move cg.identifier.url to dc.identifier< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.identifier bibliographicCitation
< ul >
< li > dc.identifier.citation should move to dc.bibliographicCitation< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.description.notes
< ul >
< li > Wait for guidance, maybe move to cg.description.notes ???< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.relation
< ul >
< li > Maybe move cg.link.reference< / li >
< li > Perhaps consolodate cg.link.audio etc there… ?< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.relation.isPartOf
< ul >
< li > Move dc.relation.ispartofseries to dc.relation.isPartOf< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > dc.audience
< ul >
< li > Move cg.targetaudience to dc.audience< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
< / ul >
< / li >
< / ul >
< / li >
2019-01-16 16:47:30 +02:00
< li > Something happened to the Solr usage statistics on CGSpace
< ul >
2020-01-27 16:20:44 +02:00
< li > I looked on the server and the Solr cores are there (56GB!), and I don’ t see any obvious errors in dmesg or anything< / li >
< li > I see that the server hasn’ t been rebooted in 26 days so I rebooted it< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-16 16:47:30 +02:00
< li > After reboot the Solr stats are still messed up in the Atmire Usage Stats module, it only shows 2019-01!< / li >
< / ul >
2019-11-28 17:30:45 +02:00
< p > < img src = "/cgspace-notes/2019/01/solr-stats-incorrect.png" alt = "Solr stats fucked up" > < / p >
2019-01-16 16:47:30 +02:00
< ul >
2019-11-28 17:30:45 +02:00
< li > In the Solr admin UI I see the following error:< / li >
< / ul >
2019-01-16 16:47:30 +02:00
< pre > < code > statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > Looking in the Solr log I see this:< / li >
< / ul >
2019-01-16 16:47:30 +02:00
< pre > < code > 2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
2019-11-28 17:30:45 +02:00
at org.apache.solr.core.SolrCore.< init> (SolrCore.java:873)
at org.apache.solr.core.SolrCore.< init> (SolrCore.java:646)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
2019-01-16 16:47:30 +02:00
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
2019-11-28 17:30:45 +02:00
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
at org.apache.solr.core.SolrCore.< init> (SolrCore.java:845)
... 31 more
2019-01-16 16:47:30 +02:00
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
2019-11-28 17:30:45 +02:00
at org.apache.lucene.store.Lock.obtain(Lock.java:89)
at org.apache.lucene.index.IndexWriter.< init> (IndexWriter.java:753)
at org.apache.solr.update.SolrIndexWriter.< init> (SolrIndexWriter.java:77)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 33 more
2019-01-16 16:47:30 +02:00
2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
2019-11-28 17:30:45 +02:00
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
2019-01-16 16:47:30 +02:00
Caused by: org.apache.solr.common.SolrException: Unable to create core [statistics-2018]
2019-11-28 17:30:45 +02:00
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
... 27 more
2019-01-16 16:47:30 +02:00
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
2019-11-28 17:30:45 +02:00
at org.apache.solr.core.SolrCore.< init> (SolrCore.java:873)
at org.apache.solr.core.SolrCore.< init> (SolrCore.java:646)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
... 29 more
2019-01-16 16:47:30 +02:00
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
2019-11-28 17:30:45 +02:00
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
at org.apache.solr.core.SolrCore.< init> (SolrCore.java:845)
... 31 more
2019-01-16 16:47:30 +02:00
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
2019-11-28 17:30:45 +02:00
at org.apache.lucene.store.Lock.obtain(Lock.java:89)
at org.apache.lucene.index.IndexWriter.< init> (IndexWriter.java:753)
at org.apache.solr.update.SolrIndexWriter.< init> (SolrIndexWriter.java:77)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 33 more
< / code > < / pre > < ul >
< li > I found some threads on StackOverflow etc discussing this and several suggested increasing the address space for the shell with ulimit< / li >
< li > I added < code > ulimit -v unlimited< / code > to the < code > /etc/default/tomcat7< / code > and restarted Tomcat and now Solr is working again:< / li >
2019-01-16 16:47:30 +02:00
< / ul >
2019-11-28 17:30:45 +02:00
< p > < img src = "/cgspace-notes/2019/01/solr-stats-incorrect.png" alt = "Solr stats working" > < / p >
2019-01-16 17:10:50 +02:00
< ul >
2019-01-17 13:28:41 +02:00
< li > Some StackOverflow discussions related to this:
< ul >
< li > < a href = "https://stackoverflow.com/questions/2895417/solrexception-internal-server-error/3035916#3035916" > https://stackoverflow.com/questions/2895417/solrexception-internal-server-error/3035916#3035916< / a > < / li >
< li > < a href = "https://stackoverflow.com/questions/11683850/how-much-memory-could-vm-use" > https://stackoverflow.com/questions/11683850/how-much-memory-could-vm-use< / a > < / li >
< li > < a href = "https://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed/8893684#8893684" > https://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed/8893684#8893684< / a > < / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-16 17:10:50 +02:00
< li > Abenet was asking if the Atmire Usage Stats are correct because they are over 2 million the last few months… < / li >
< li > For 2019-01 alone the Usage Stats are already around 1.2 million< / li >
2020-01-27 16:20:44 +02:00
< li > I tried to look in the nginx logs to see how many raw requests there are so far this month and it’ s about 1.4 million:< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-16 17:10:50 +02:00
< pre > < code > # time zcat --force /var/log/nginx/* | grep -cE " [0-9]{1,2}/Jan/2019"
1442874
real 0m17.161s
user 0m16.205s
sys 0m2.396s
2019-12-17 14:49:24 +02:00
< / code > < / pre > < h2 id = "2019-01-17" > 2019-01-17< / h2 >
2019-01-17 13:28:41 +02:00
< ul >
< li > Send reminder to Atmire about purchasing the < a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657" > MQM module< / a > < / li >
2019-01-17 19:53:00 +02:00
< li > Trying to decide the solid action points for CGSpace on the CG Core 2.0 metadata… < / li >
2020-01-27 16:20:44 +02:00
< li > It’ s difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!< / li >
< li > Also, there is not a good Dublin Core reference (or maybe I just don’ t understand?)< / li >
2019-01-17 13:28:41 +02:00
< li > Several authoritative documents on Dublin Core appear to be:
< ul >
< li > < a href = "http://dublincore.org/documents/dces/" > Dublin Core Metadata Element Set, Version 1.1: Reference Description< / a > < / li >
< li > < a href = "http://www.dublincore.org/documents/dcmi-terms/" > DCMI Metadata Terms< / a > < / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 13:28:41 +02:00
< li > And what is the relationship between DC and DCTERMS?< / li >
< li > DSpace uses DCTERMS in the metadata it embeds in XMLUI item views!< / li >
< li > We really need to look at this more carefully and see the impacts that might be made from switching core fields like languages, abstract, authors, etc< / li >
< li > We can check WorldFish and MELSpace repositories to see what effects these changes have had on theirs because they have already adopted some of these changes… < / li >
2019-01-17 19:53:00 +02:00
< li > I think I understand the difference between DC and DCTERMS finally: DC is the original set of fifteen elements and DCTERMS is the newer version that was supposed to address much of the drawbacks of the original with regards to digital content< / li >
< li > We might be able to use some proper fields for citation, abstract, etc that are part of DCTERMS< / li >
< li > To make matters more confusing, there is also “ qualified Dublin Core” that uses the original fifteen elements of legacy DC and qualifies them, like < code > dc.date.accessioned< / code >
< ul >
< li > According to Wikipedia < a href = "https://en.wikipedia.org/wiki/Dublin_Core" > Qualified Dublin Core was superseded by DCTERMS in 2008< / a > !< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
2019-01-17 19:53:00 +02:00
< li > So we should be trying to use DCTERMS where possible, unless it is some internal thing that might mess up DSpace (like dates)< / li >
< li > “ Elements 1.1” means legacy DC< / li >
< li > Possible action list for CGSpace:
< ul >
< li > dc.description.abstract → dcterms.abstract< / li >
< li > dc.description.version → cg.peer-reviewed (or cg.peer-review-status?)< / li >
< li > dc.description.sponsorship → cg.contributor.donor< / li >
< li > dc.contributor.author → dc.creator< / li >
< li > dc.language.iso → dcterms.language< / li >
< li > cg.identifier.url → dcterms.identifier< / li >
< li > dc.identifier.citation → dcterms.bibliographicCitation< / li >
< li > dc.relation.ispartofseries → dcterms.isPartOf< / li >
< li > cg.targetaudience → dcterms.audience< / li >
2019-01-17 13:28:41 +02:00
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
2019-01-20 15:48:52 +02:00
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-19" > 2019-01-19< / h2 >
2019-11-28 17:30:45 +02:00
< ul >
< li >
2020-01-27 16:20:44 +02:00
< p > There’ s no official set of Dublin Core qualifiers so I can’ t tell if things like < code > dc.contributor.author< / code > that are used by DSpace are official< / p >
2019-11-28 17:30:45 +02:00
< / li >
< li >
< p > I found a great < a href = "https://www.dri.ie/sites/default/files/files/qualified-dublin-core-metadata-guidelines.pdf" > presentation from 2015 by the Digital Repository of Ireland< / a > that discusses using MARC Relator Terms with Dublin Core elements< / p >
< / li >
< li >
< p > It seems that < code > dc.contributor.author< / code > would be a supported term according to this < a href = "https://memory.loc.gov/diglib/loc.terms/relators/dc-contributor.html" > Library of Congress list< / a > linked from the < a href = "http://dublincore.org/usage/documents/relators/" > Dublin Core website< / a > < / p >
< / li >
< li >
< p > The Library of Congress document specifically says:< / p >
2019-01-20 15:48:52 +02:00
< p > These terms conform with the DCMI Abstract Model and may be used in DCMI application profiles. DCMI endorses their use with Dublin Core elements as indicated.< / p >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-20" > 2019-01-20< / h2 >
2019-01-20 15:48:52 +02:00
< ul >
2020-01-27 16:20:44 +02:00
< li > That’ s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-20 15:48:52 +02:00
< pre > < code > # w
2019-11-28 17:30:45 +02:00
04:46:14 up 213 days, 7:25, 4 users, load average: 1.94, 1.50, 1.35
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > I’ ve definitely rebooted it several times in the past few months… according to < code > journalctl -b< / code > it was a few weeks ago on 2019-01-02< / li >
2019-11-28 17:30:45 +02:00
< li > I re-ran the Ansible DSpace tag, ran all system updates, and rebooted the host< / li >
< li > After rebooting I notice that the Linode kernel went down from 4.19.8 to 4.18.16… < / li >
< li > Atmire sent a quote on our < a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657" > ticket about purchasing the Metadata Quality Module (MQM) for DSpace 5.8< / a > < / li >
< li > Abenet asked me for an < a href = "https://cgspace.cgiar.org/open-search/discover?query=crpsubject:Livestock&sort_by=3&order=DESC" > OpenSearch query that could generate and RSS feed for items in the Livestock CRP< / a > < / li >
< li > According to my notes, < code > sort_by=3< / code > is accession date (as configured in `dspace.cfg)< / li >
< li > The query currently shows 3023 items, but a < a href = "https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&filter_relational_operator_1=equals&filter_1=Livestock&submit_apply_filter=&query=" > Discovery search for Livestock CRP only returns 858 items< / a > < / li >
< li > That query seems to return items tagged with < code > Livestock and Fish< / code > CRP as well… hmm.< / li >
2019-01-20 15:48:52 +02:00
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-21" > 2019-01-21< / h2 >
2019-01-21 12:54:29 +02:00
< ul >
2020-01-27 16:20:44 +02:00
< li > Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04’ s Tomcat 8.5< / li >
2019-11-28 17:30:45 +02:00
< li > I could either run with a simple < code > tomcat7.service< / code > like this:< / li >
< / ul >
2019-01-21 12:54:29 +02:00
< pre > < code > [Unit]
Description=Apache Tomcat 7 Web Application Container
After=network.target
[Service]
Type=forking
ExecStart=/path/to/apache-tomcat-7.0.92/bin/startup.sh
ExecStop=/path/to/apache-tomcat-7.0.92/bin/shutdown.sh
User=aorth
Group=aorth
[Install]
WantedBy=multi-user.target
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > Or try to use adapt a real systemd service like Arch Linux’ s:< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-21 12:54:29 +02:00
< pre > < code > [Unit]
Description=Tomcat 7 servlet container
After=network.target
[Service]
Type=forking
PIDFile=/var/run/tomcat7.pid
Environment=CATALINA_PID=/var/run/tomcat7.pid
Environment=TOMCAT_JAVA_HOME=/usr/lib/jvm/default-runtime
Environment=CATALINA_HOME=/usr/share/tomcat7
Environment=CATALINA_BASE=/usr/share/tomcat7
Environment=CATALINA_OPTS=
Environment=ERRFILE=SYSLOG
Environment=OUTFILE=SYSLOG
ExecStart=/usr/bin/jsvc \
2019-11-28 17:30:45 +02:00
-Dcatalina.home=${CATALINA_HOME} \
-Dcatalina.base=${CATALINA_BASE} \
-Djava.io.tmpdir=/var/tmp/tomcat7/temp \
-cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \
-user tomcat7 \
-java-home ${TOMCAT_JAVA_HOME} \
-pidfile /var/run/tomcat7.pid \
-errfile ${ERRFILE} \
-outfile ${OUTFILE} \
$CATALINA_OPTS \
org.apache.catalina.startup.Bootstrap
2019-01-21 12:54:29 +02:00
ExecStop=/usr/bin/jsvc \
2019-11-28 17:30:45 +02:00
-pidfile /var/run/tomcat7.pid \
-stop \
org.apache.catalina.startup.Bootstrap
2019-01-21 12:54:29 +02:00
[Install]
WantedBy=multi-user.target
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > I see that < code > jsvc< / code > and < code > libcommons-daemon-java< / code > are both available on Ubuntu so that should be easy to port< / li >
2020-01-27 16:20:44 +02:00
< li > We probably don’ t need Eclipse Java Bytecode Compiler (ecj)< / li >
2019-11-28 17:30:45 +02:00
< li > I tested Tomcat 7.0.92 on Arch Linux using the < code > tomcat7.service< / code > with < code > jsvc< / code > and it works… nice!< / li >
< li > I think I might manage this the same way I do the restic releases in the < a href = "https://github.com/ilri/rmg-ansible-public" > Ansible infrastructure scripts< / a > , where I download a specific version and symlink to some generic location without the version number< / li >
< li > I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:< / li >
< / ul >
2019-01-21 12:54:29 +02:00
< pre > < code > $ http 'http://localhost:3000/solr/statistics/select?indent=on& rows=0& q=type:2+id:11576& fq=isBot:false& fq=statistics_type:view' | grep numFound
< result name=" response" numFound=" 33" start=" 0" >
$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on& rows=0& q=type:2+id:11576& fq=isBot:false& fq=statistics_type:view' | grep numFound
< result name=" response" numFound=" 241" start=" 0" >
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > I opened an issue on the GitHub issue tracker (< a href = "https://github.com/ilri/dspace-statistics-api/issues/10" > #10< / a > )< / li >
2020-01-27 16:20:44 +02:00
< li > I don’ t think the < a href = "https://solrclient.readthedocs.io/en/latest/" > SolrClient library< / a > we are currently using supports these type of queries so we might have to just do raw queries with requests< / li >
2019-11-28 17:30:45 +02:00
< li > The < a href = "https://github.com/django-haystack/pysolr" > pysolr< / a > library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):< / li >
< / ul >
2019-01-21 14:16:56 +02:00
< pre > < code > import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
print(results.facets['facet_fields'])
{'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > If I double check one item from above, for example < code > 77572< / code > , it appears this is only working on the current statistics core and not the shards:< / li >
< / ul >
2019-01-21 14:16:56 +02:00
< pre > < code > import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
print(results.hits)
646
solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
print(results.hits)
595
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON< / li >
< li > This enumerates the list of Solr cores and returns JSON format:< / li >
< / ul >
2019-01-21 23:54:39 +02:00
< pre > < code > http://localhost:3000/solr/admin/cores?action=STATUS& wt=json
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > I think I figured out how to search across shards, I needed to give the whole URL to each other core< / li >
< li > Now I get more results when I start adding the other statistics cores:< / li >
< / ul >
2019-01-21 23:54:39 +02:00
< pre > < code > $ http 'http://localhost:3000/solr/statistics/select?& indent=on& rows=0& q=*:*' | grep numFound< result name=" response" numFound=" 2061320" start=" 0" >
$ http 'http://localhost:3000/solr/statistics/select?& shards=localhost:8081/solr/statistics-2018& indent=on& rows=0& q=*:*' | grep numFound
< result name=" response" numFound=" 16280292" start=" 0" maxScore=" 1.0" >
$ http 'http://localhost:3000/solr/statistics/select?& shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017& indent=on& rows=0& q=*:*' | grep numFound
< result name=" response" numFound=" 25606142" start=" 0" maxScore=" 1.0" >
$ http 'http://localhost:3000/solr/statistics/select?& shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016& indent=on& rows=0& q=*:*' | grep numFound
< result name=" response" numFound=" 31532212" start=" 0" maxScore=" 1.0" >
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the < code > shards< / code > parameter to each query to make the search distributed among the cores< / li >
< li > I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a < code > shards< / code > query string< / li >
< li > A few things I noticed:
2019-01-21 23:54:39 +02:00
< ul >
2020-01-27 16:20:44 +02:00
< li > Solr doesn’ t mind if you use an empty < code > shards< / code > parameter< / li >
< li > Solr doesn’ t mind if you have an extra comma at the end of the < code > shards< / code > parameter< / li >
2019-01-21 23:54:39 +02:00
< li > If you are searching multiple cores, you need to include the base core in the < code > shards< / code > parameter as well< / li >
2019-11-28 17:30:45 +02:00
< li > For example, compare the following two queries, first including the base core and the shard in the < code > shards< / code > parameter, and then only including the shard:< / li >
< / ul >
< / li >
< / ul >
2019-01-21 23:54:39 +02:00
< pre > < code > $ http 'http://localhost:8081/solr/statistics/select?indent=on& rows=0& q=type:2+id:11576& fq=isBot:false& fq=statistics_type:view& shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
< result name=" response" numFound=" 275" start=" 0" maxScore=" 12.205825" >
$ http 'http://localhost:8081/solr/statistics/select?indent=on& rows=0& q=type:2+id:11576& fq=isBot:false& fq=statistics_type:view& shards=localhost:8081/solr/statistics-2018' | grep numFound
< result name=" response" numFound=" 241" start=" 0" maxScore=" 12.205825" >
2019-12-17 14:49:24 +02:00
< / code > < / pre > < h2 id = "2019-01-22" > 2019-01-22< / h2 >
2019-01-22 09:16:11 +02:00
< ul >
< li > Release < a href = "https://github.com/ilri/dspace-statistics-api/releases/tag/v0.9.0" > version 0.9.0 of the dspace-statistics-api< / a > to address the issue of querying multiple Solr statistics shards< / li >
< li > I deployed it on DSpace Test (linode19) and restarted the indexer and now it shows all the stats from 2018 as well (756 pages of views, intead of 6)< / li >
2019-01-23 10:46:23 +02:00
< li > I deployed it on CGSpace (linode18) and restarted the indexer as well< / li >
2019-11-28 17:30:45 +02:00
< li > Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:< / li >
< / ul >
2019-01-23 17:27:09 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
155 40.77.167.106
176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
189 107.21.16.70
217 54.83.93.85
310 46.174.208.142
346 83.103.94.48
360 45.5.186.2
595 154.113.73.30
716 196.191.127.37
915 35.237.175.180
< / code > < / pre > < ul >
< li > 35.237.175.180 is known to us< / li >
2020-01-27 16:20:44 +02:00
< li > I don’ t think we’ ve seen 196.191.127.37 before. Its user agent is:< / li >
2019-11-28 17:30:45 +02:00
< / ul >
2019-01-23 17:27:09 +02:00
< pre > < code > Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > Interestingly this IP is located in Addis Ababa… < / li >
< li > Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:< / li >
2019-05-05 16:45:12 +03:00
< / ul >
2019-11-28 17:30:45 +02:00
< pre > < code > Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
2019-12-17 14:49:24 +02:00
< / code > < / pre > < h2 id = "2019-01-23" > 2019-01-23< / h2 >
2019-01-23 10:46:23 +02:00
< ul >
< li > Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:< / li >
< / ul >
< blockquote class = "twitter-tweet" > < p lang = "en" dir = "ltr" > < a href = "https://twitter.com/hashtag/ILRI?src=hash&ref_src=twsrc%5Etfw" > #ILRI< / a > research: Towards unlocking the potential of the hides and skins value chain in Somaliland < a href = "https://t.co/EZH7ALW4dp" > https://t.co/EZH7ALW4dp< / a > < / p > — ILRI Communications (@ILRI) < a href = "https://twitter.com/ILRI/status/1086330519904673793?ref_src=twsrc%5Etfw" > January 18, 2019< / a > < / blockquote >
< script async src = "https://platform.twitter.com/widgets.js" charset = "utf-8" > < / script >
< ul >
< li > The shortened link is < a href = "goo.gl/fb/VRj9Gq" > goo.gl/fb/VRj9Gq< / a > and it shows a “ Dynamic Link not found” error from Firebase:< / li >
< / ul >
2019-11-28 17:30:45 +02:00
< p > < img src = "/cgspace-notes/2019/01/firebase-link-not-found.png" alt = "Dynamic Link not found" > < / p >
2019-01-23 10:46:23 +02:00
< ul >
2019-11-28 17:30:45 +02:00
< li >
< p > Apparently Google announced last year that they plan to < a href = "https://developers.googleblog.com/2018/03/transitioning-google-url-shortener.html" > discontinue the shortner and transition to Firebase Dynamic Links in March, 2019< / a > , so maybe this is related… < / p >
< / li >
< li >
< p > Very interesting discussion of methods for < a href = "https://jdebp.eu/FGA/systemd-house-of-horror/tomcat.html" > running Tomcat under systemd< / a > < / p >
< / li >
< li >
2020-01-27 16:20:44 +02:00
< p > We can set the ulimit options that used to be in < code > /etc/default/tomcat7< / code > with systemd’ s < code > LimitNOFILE< / code > and < code > LimitAS< / code > (see the < code > systemd.exec< / code > man page)< / p >
2019-01-23 13:38:00 +02:00
< ul >
< li > Note that we need to use < code > infinity< / code > instead of < code > unlimited< / code > for the address space< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
< li >
< p > Create accounts for Bosun from IITA and Valerio from ICARDA / CGMEL on DSpace Test< / p >
< / li >
< li >
< p > Maria Garruccio asked me for a list of author affiliations from all of their submitted items so she can clean them up< / p >
< / li >
< li >
< p > I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:< / p >
< / li >
< / ul >
2019-01-23 17:27:09 +02:00
< pre > < code > dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
COPY 1109
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP< / li >
< li > Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:< / li >
< / ul >
2019-01-23 17:27:09 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
222 54.226.25.74
241 40.77.167.13
272 46.101.86.248
297 35.237.175.180
332 45.5.184.72
355 34.218.226.147
404 66.249.64.155
4637 205.186.128.185
4637 70.32.83.92
9265 45.5.186.2
< / code > < / pre > < ul >
< li >
2020-01-27 16:20:44 +02:00
< p > I think it’ s the usual IPs:< / p >
2019-01-23 17:27:09 +02:00
< ul >
< li > 45.5.186.2 is CIAT< / li >
< li > 70.32.83.92 is CCAFS< / li >
< li > 205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
< li >
< p > Following up on the thumbnail issue that we had in < a href = "/cgspace-notes/2018-12/" > 2018-12< / a > < / p >
< / li >
< li >
< p > It looks like the two items with problematic PDFs both have thumbnails now:< / p >
< ul >
< li > < a href = "https://hdl.handle.net/10568/98390" > 10568/98390< / a > < / li >
< li > < a href = "https://hdl.handle.net/10568/98391" > 10568/98391< / a > < / li >
< / ul >
< / li >
< li >
2020-01-27 16:20:44 +02:00
< p > Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’ s < code > filter-media< / code > :< / p >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-01-23 18:21:06 +02:00
< pre > < code > $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > Both of these were successful, so there must have been an update to ImageMagick or Ghostscript in Ubuntu since early 2018-12< / li >
< li > Looking at the apt history logs I see that on 2018-12-07 a security update for Ghostscript was installed (version 9.26~dfsg+0-0ubuntu0.16.04.3)< / li >
< li > I think this Launchpad discussion is relevant: < a href = "https://bugs.launchpad.net/ubuntu/+source/ghostscript/+bug/1806517" > https://bugs.launchpad.net/ubuntu/+source/ghostscript/+bug/1806517< / a > < / li >
< li > As well as the original Ghostscript bug report: < a href = "https://bugs.ghostscript.com/show_bug.cgi?id=699815" > https://bugs.ghostscript.com/show_bug.cgi?id=699815< / a > < / li >
2019-01-22 09:16:11 +02:00
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-24" > 2019-01-24< / h2 >
2019-01-24 10:59:03 +02:00
< ul >
2020-01-27 16:20:44 +02:00
< li > I noticed Ubuntu’ s Ghostscript 9.26 works on some troublesome PDFs where Arch’ s Ghostscript 9.26 doesn’ t, so the fix for the first/last page crash is not the patch I found yesterday< / li >
< li > Ubuntu’ s Ghostscript uses another < a href = "http://git.ghostscript.com/?p=ghostpdl.git;h=fae21f1668d2b44b18b84cf0923a1d5f3008a696" > patch from Ghostscript git< / a > (< a href = "https://bugs.ghostscript.com/show_bug.cgi?id=700315" > upstream bug report< / a > )< / li >
< li > I re-compiled Arch’ s ghostscript with the patch and then I was able to generate a thumbnail from one of the < a href = "https://cgspace.cgiar.org/handle/10568/98390" > troublesome PDFs< / a > < / li >
2019-11-28 17:30:45 +02:00
< li > Before and after:< / li >
< / ul >
2019-01-24 10:59:03 +02:00
< pre > < code > $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
Food safety Kenya fruits.pdf[0]=> Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
< li > I reported it to the Arch Linux bug tracker (< a href = "https://bugs.archlinux.org/task/61513" > 61513< / a > )< / li >
< li > I told Atmire to go ahead with the Metadata Quality Module addition based on our < code > 5_x-dev< / code > branch (< a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657" > 657< / a > )< / li >
< li > Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:< / li >
< / ul >
2019-01-24 10:59:03 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
305 3.81.136.184
306 3.83.14.11
306 52.54.252.47
325 54.221.57.180
378 66.249.64.157
424 54.70.40.11
497 47.29.247.74
783 35.237.175.180
1108 66.249.64.155
2378 45.5.186.2
< / code > < / pre > < ul >
< li > 45.5.186.2 is CIAT and 66.249.64.155 is Google… hmmm.< / li >
< li > Linode sent another alert this morning, here are the top ten IPs active during that time:< / li >
< / ul >
2019-01-24 10:59:03 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
360 3.89.134.93
362 34.230.15.139
366 100.24.48.177
369 18.212.208.240
377 3.81.136.184
404 54.221.57.180
506 66.249.64.155
4642 70.32.83.92
4643 205.186.128.185
8593 45.5.186.2
< / code > < / pre > < ul >
< li > Just double checking what CIAT is doing, they are mainly hitting the REST API:< / li >
< / ul >
2019-01-24 10:59:03 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 24/Jan/2019:" | grep 45.5.186.2 | grep -Eo " GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
2019-11-28 17:30:45 +02:00
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > CIAT’ s community currently has 12,000 items in it so this is normal< / li >
2019-11-28 17:30:45 +02:00
< li > The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again… < / li >
< li > For example: < a href = "https://goo.gl/fb/VRj9Gq" > https://goo.gl/fb/VRj9Gq< / a > < / li >
< li > The full < a href = "http://id.loc.gov/vocabulary/relators.html" > list of MARC Relators on the Library of Congress website< / a > linked from the < a href = "http://dublincore.org/usage/documents/relators/" > DMCI relators page< / a > is very confusing< / li >
< li > Looking at the default DSpace XMLUI crosswalk in < a href = "https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/config/crosswalks/xhtml-head-item.properties" > xhtml-head-item.properties< / a > I see a very complete mapping of DSpace DC and QDC fields to DCTERMS
2019-01-24 16:25:16 +02:00
< ul >
< li > This is good for standards-compliant web crawlers, but what about for those harvesting via REST or OAI APIs?< / li >
2019-01-24 10:59:03 +02:00
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< li > I sent a message titled “ < a href = "https://groups.google.com/forum/#!topic/dspace-tech/phV_t51TGuE" > DC, QDC, and DCTERMS: reviewing our metadata practices< / a > ” to the dspace-tech mailing list to ask about some of this< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-25" > 2019-01-25< / h2 >
2019-01-25 14:09:49 +02:00
< ul >
< li > A little bit more work on getting Tomcat to run from a tarball on our < a href = "https://github.com/ilri/rmg-ansible-public" > Ansible infrastructure playbooks< / a >
< ul >
< li > I tested by doing a Tomcat 7.0.91 installation, then switching it to 7.0.92 and it worked… nice!< / li >
2019-01-25 19:45:15 +02:00
< li > I refined the tasks so much that I was confident enough to deploy them on DSpace Test and it went very well< / li >
2020-01-27 16:20:44 +02:00
< li > Basically I just stopped tomcat7, created a dspace user, removed tomcat7, chown’ d everything to the dspace user, then ran the playbook< / li >
2019-01-25 19:45:15 +02:00
< li > So now DSpace Test (linode19) is running Tomcat 7.0.92… w00t< / li >
2020-01-27 16:20:44 +02:00
< li > Now we need to monitor it for a few weeks to see if there is anything we missed, and then I can change CGSpace (linode18) as well, and we’ re ready for Ubuntu 18.04 too!< / li >
2019-01-25 14:09:49 +02:00
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-27" > 2019-01-27< / h2 >
2019-01-27 17:25:19 +02:00
< ul >
2019-11-28 17:30:45 +02:00
< li > Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:< / li >
< / ul >
2019-01-27 17:25:19 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
189 40.77.167.108
191 157.55.39.2
263 34.218.226.147
283 45.5.184.2
332 45.5.184.72
608 5.9.6.51
679 66.249.66.223
1116 66.249.66.219
4644 205.186.128.185
4644 70.32.83.92
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > I think it’ s the usual IPs:
2019-01-27 17:25:19 +02:00
< ul >
< li > 70.32.83.92 is CCAFS< / li >
< li > 205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)< / li >
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-28" > 2019-01-28< / h2 >
2019-02-01 21:45:50 +02:00
< ul >
< li > Udana from WLE asked me about the interaction between their publication website and their items on CGSpace
< ul >
< li > There is an item that is mapped into their collection from IWMI and is missing their < code > cg.identifier.wletheme< / code > metadata< / li >
< li > I told him that, as far as I remember, when WLE introduced Phase II research themes in 2017 we decided to infer theme ownership from the collection hierarchy and we created a < a href = "https://cgspace.cgiar.org/handle/10568/81268" > WLE Phase II Research Themes< / a > subCommunity< / li >
< li > Perhaps they need to ask Macaroni Bros about the mapping< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
< li > Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:< / li >
< / ul >
2019-02-01 21:45:50 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
67 207.46.13.50
105 41.204.190.40
117 34.218.226.147
126 35.237.175.180
203 213.55.99.121
332 45.5.184.72
377 5.9.6.51
512 45.5.184.2
4644 205.186.128.185
4644 70.32.83.92
< / code > < / pre > < ul >
< li > There seems to be a pattern with < code > 70.32.83.92< / code > and < code > 205.186.128.185< / code > lately!< / li >
< li > Every morning at 8AM they are the top users… I should tell them to stagger their requests… < / li >
< li > I signed up for a < a href = "https://visualping.io/" > VisualPing< / a > of the < a href = "https://jdbc.postgresql.org/download.html" > PostgreSQL JDBC driver download page< / a > to my CGIAR email address
2019-02-01 21:45:50 +02:00
< ul >
< li > Hopefully this will one day alert me that a new driver is released!< / li >
2019-11-28 17:30:45 +02:00
< / ul >
< / li >
< li > Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:< / li >
< / ul >
2019-02-01 21:45:50 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
310 45.5.184.2
425 5.143.231.39
526 54.70.40.11
1003 199.47.87.141
1374 35.237.175.180
1455 5.9.6.51
1501 66.249.66.223
1771 66.249.66.219
2107 199.47.87.140
2540 45.5.186.2
< / code > < / pre > < ul >
2020-01-27 16:20:44 +02:00
< li > Of course there is CIAT’ s < code > 45.5.186.2< / code > , but also < code > 45.5.184.2< / code > appears to be CIAT… I wonder why they have two harvesters?< / li >
2019-11-28 17:30:45 +02:00
< li > < code > 199.47.87.140< / code > and < code > 199.47.87.141< / code > is TurnItIn with the following user agent:< / li >
2019-05-05 16:45:12 +03:00
< / ul >
2019-11-28 17:30:45 +02:00
< pre > < code > TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
2019-12-17 14:49:24 +02:00
< / code > < / pre > < h2 id = "2019-01-29" > 2019-01-29< / h2 >
2019-02-01 21:45:50 +02:00
< ul >
2019-11-28 17:30:45 +02:00
< li > Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:< / li >
< / ul >
2019-02-01 21:45:50 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
334 45.5.184.72
429 66.249.66.223
522 35.237.175.180
555 34.218.226.147
655 66.249.66.221
844 5.9.6.51
2507 66.249.66.219
4645 70.32.83.92
4646 205.186.128.185
9329 45.5.186.2
< / code > < / pre > < ul >
< li > < code > 45.5.186.2< / code > is CIAT as usual… < / li >
< li > < code > 70.32.83.92< / code > and < code > 205.186.128.185< / code > are CCAFS as usual… < / li >
< li > < code > 66.249.66.219< / code > is Google… < / li >
2020-01-27 16:20:44 +02:00
< li > I’ m thinking it might finally be time to increase the threshold of the Linode CPU alerts
2019-02-01 21:45:50 +02:00
< ul >
< li > I adjusted the alert threshold from 250% to 275%< / li >
< / ul >
2019-11-28 17:30:45 +02:00
< / li >
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-30" > 2019-01-30< / h2 >
2019-02-01 21:45:50 +02:00
< ul >
2019-11-28 17:30:45 +02:00
< li > Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:< / li >
< / ul >
2019-02-01 21:45:50 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
273 46.101.86.248
301 35.237.175.180
334 45.5.184.72
387 5.9.6.51
527 2a01:4f8:13b:1296::2
1021 34.218.226.147
1448 66.249.66.219
4649 205.186.128.185
4649 70.32.83.92
5163 45.5.184.2
< / code > < / pre > < ul >
< li > I might need to adjust the threshold again, because the load average this morning was 296% and the activity looks pretty normal (as always recently)< / li >
2019-02-01 21:45:50 +02:00
< / ul >
2019-12-17 14:49:24 +02:00
< h2 id = "2019-01-31" > 2019-01-31< / h2 >
2019-02-01 21:45:50 +02:00
< ul >
2019-11-28 17:30:45 +02:00
< li > Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:< / li >
< / ul >
2019-02-01 21:45:50 +02:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
436 18.196.196.108
460 157.55.39.168
460 207.46.13.96
500 197.156.105.116
728 54.70.40.11
1560 5.9.6.51
1562 35.237.175.180
1601 85.25.237.71
1894 66.249.66.219
2610 45.5.184.2
2019-02-01 21:45:50 +02:00
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 31/Jan/2019:0(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 17:30:45 +02:00
318 207.46.13.242
334 45.5.184.72
486 35.237.175.180
609 34.218.226.147
620 66.249.66.219
1054 5.9.6.51
4391 70.32.83.92
4428 205.186.128.185
6758 85.25.237.71
9239 45.5.186.2
< / code > < / pre > < ul >
< li > < code > 45.5.186.2< / code > and < code > 45.5.184.2< / code > are CIAT as always< / li >
2020-01-27 16:20:44 +02:00
< li > < code > 85.25.237.71< / code > is some new server in Germany that I’ ve never seen before with the user agent:< / li >
2019-05-05 16:45:12 +03:00
< / ul >
2019-11-28 17:30:45 +02:00
< pre > < code > Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
< / code > < / pre > <!-- raw HTML omitted -->
2019-01-02 09:59:01 +02:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2020-08-06 10:56:13 +03:00
< li > < a href = "/cgspace-notes/2020-08/" > August, 2020< / a > < / li >
2020-08-02 22:14:16 +03:00
2020-07-01 15:37:20 +03:00
< li > < a href = "/cgspace-notes/2020-07/" > July, 2020< / a > < / li >
2020-06-02 15:12:32 +03:00
< li > < a href = "/cgspace-notes/2020-06/" > June, 2020< / a > < / li >
2020-05-02 10:08:14 +03:00
2020-06-02 15:12:32 +03:00
< li > < a href = "/cgspace-notes/2020-05/" > May, 2020< / a > < / li >
2020-06-01 17:08:25 +03:00
2020-04-02 10:54:46 +03:00
< li > < a href = "/cgspace-notes/2020-04/" > April, 2020< / a > < / li >
2019-01-02 09:59:01 +02:00
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
2019-10-11 11:19:42 +03:00
< p dir = "auto" >
2019-01-02 09:59:01 +02:00
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >