CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

January, 2019

2019-01-02

  • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
  • I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     92 40.77.167.4
     99 210.7.29.100
    120 38.126.157.45
    177 35.237.175.180
    177 40.77.167.32
    216 66.249.75.219
    225 18.203.76.93
    261 46.101.86.248
    357 207.46.13.1
    903 54.70.40.11
  • Analyzing the types of requests made by the top few IPs during that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
     30 bitstream
    534 discover
    352 handle
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 207.46.13.1 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
    194 bitstream
    345 handle
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
    261 handle
  • It’s not clear to me what was causing the outbound traffic spike
  • Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):
Moving: 81742 into core statistics-2010
Moving: 1837285 into core statistics-2011
Moving: 3764612 into core statistics-2012
Moving: 4557946 into core statistics-2013
Moving: 5483684 into core statistics-2014
Moving: 2941736 into core statistics-2015
Moving: 5926070 into core statistics-2016
Moving: 10562554 into core statistics-2017
Moving: 18497180 into core statistics-2018
  • This could by why the outbound traffic rate was high, due to the S3 backup that run at 3:30AM…
  • Run all system updates on DSpace Test (linode19) and reboot the server

2019-01-03

  • Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:
$ sudo docker pull postgres:9.6-alpine
$ sudo docker rm dspacedb
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
  • Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire’s Listings and Reports still doesn’t work
    • After logging in via XMLUI and clicking the Listings and Reports link from the sidebar it redirects me to a JSPUI login page
    • If I log in again there the Listings and Reports work… hmm.
  • The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:
2019-01-03 14:45:21,727 INFO  org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
2019-01-03 14:45:21,971 INFO  org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
2019-01-03 14:45:22,115 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error
-- Method: GET
-- Parameters were:

org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discovery/static-tagcloud-facet.jsp (line: [57], column: [8]) No tag [tagcloud] defined in tag library imported with prefix [dspace]
    at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41)
    at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291)
    at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97)
    at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347)
    at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380)
    at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481)
    at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445)
    at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683)
    at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016)
    at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291)
    at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470)
    at org.apache.jasper.compiler.Parser.parse(Parser.java:144)
    at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244)
    at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105)
    at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202)
    at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373)
    at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350)
    at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334)
    at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595)
    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728)
    at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470)
    at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395)
    at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316)
    at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60)
    at org.apache.jsp.index_jsp._jspService(index_jsp.java:191)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
    at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800)
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)
  • I notice that I get different JSESSIONID cookies for / (XMLUI) and /jspui (JSPUI) on Tomcat 8.5.37, I wonder if it’s the same on Tomcat 7.0.92… yes I do.
  • Hmm, on Tomcat 7.0.92 I see that I get a dspace.current.user.id session cookie after logging into XMLUI, and then when I browse to JSPUI I am still logged in…
    • I didn’t see that cookie being set on Tomcat 8.5.37
  • I sent a message to the dspace-tech mailing list to ask

2019-01-04

  • Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’t see anything around that time in the web server logs:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    189 207.46.13.192
    217 31.6.77.23
    340 66.249.70.29
    349 40.77.167.86
    417 34.218.226.147
    630 207.46.13.173
    710 35.237.175.180
    790 40.77.167.87
   1776 66.249.70.27
   2099 54.70.40.11
  • I’m thinking about trying to validate our dc.subject terms against AGROVOC webservices
  • There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for SOIL:
$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Length: 493
Content-Type: application/json; charset=utf-8
Date: Fri, 04 Jan 2019 13:44:27 GMT
Keep-Alive: timeout=5, max=100
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubdomains
Vary: Accept
X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org

{
    "@context": {
        "@language": "en",
        "altLabel": "skos:altLabel",
        "hiddenLabel": "skos:hiddenLabel",
        "isothes": "http://purl.org/iso25964/skos-thes#",
        "onki": "http://schema.onki.fi/onki#",
        "prefLabel": "skos:prefLabel",
        "results": {
            "@container": "@list",
            "@id": "onki:results"
        },
        "skos": "http://www.w3.org/2004/02/skos/core#",
        "type": "@type",
        "uri": "@id"
    },
    "results": [
        {
            "lang": "en",
            "prefLabel": "soil",
            "type": [
                "skos:Concept"
            ],
            "uri": "http://aims.fao.org/aos/agrovoc/c_7156",
            "vocab": "agrovoc"
        }
    ],
    "uri": ""
}
  • The API does not appear to be case sensitive (searches for SOIL and soil return the same thing)
  • I’m a bit confused that there’s no obvious return code or status when a term is not found, for example SOILS:
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Length: 367
Content-Type: application/json; charset=utf-8
Date: Fri, 04 Jan 2019 13:48:31 GMT
Keep-Alive: timeout=5, max=100
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubdomains
Vary: Accept
X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org

{
    "@context": {
        "@language": "en",
        "altLabel": "skos:altLabel",
        "hiddenLabel": "skos:hiddenLabel",
        "isothes": "http://purl.org/iso25964/skos-thes#",
        "onki": "http://schema.onki.fi/onki#",
        "prefLabel": "skos:prefLabel",
        "results": {
            "@container": "@list",
            "@id": "onki:results"
        },
        "skos": "http://www.w3.org/2004/02/skos/core#",
        "type": "@type",
        "uri": "@id"
    },
    "results": [],
    "uri": ""
}
  • I guess the results object will just be empty…
  • Another way would be to try with SPARQL, perhaps using the Python 2.7 sparql-client:
$ python2.7 -m virtualenv /tmp/sparql
$ . /tmp/sparql/bin/activate
$ pip install sparql-client ipython
$ ipython
In [10]: import sparql
In [11]: s = sparql.Service("http://agrovoc.uniroma2.it:3030/agrovoc/sparql", "utf-8", "GET")
In [12]: statement=('PREFIX skos: <http://www.w3.org/2004/02/skos/core#> '
    ...: 'SELECT '
    ...: '?label '
    ...: 'WHERE { '
    ...: '{  ?concept  skos:altLabel ?label . } UNION {  ?concept  skos:prefLabel ?label . } '
    ...: 'FILTER regex(str(?label), "^fish", "i") . '
    ...: '} LIMIT 10')
In [13]: result = s.query(statement)
In [14]: for row in result.fetchone():
   ...:     print(row)
   ...:
(<Literal "fish catching"@en>,)
(<Literal "fish harvesting"@en>,)
(<Literal "fish meat"@en>,)
(<Literal "fish roe"@en>,)
(<Literal "fish conversion"@en>,)
(<Literal "fisheries catches (composition)"@en>,)
(<Literal "fishtail palm"@en>,)
(<Literal "fishflies"@en>,)
(<Literal "fishery biology"@en>,)
(<Literal "fish production"@en>,)
  • The SPARQL query comes from my notes in 2017-08

2019-01-06

  • I built a clean DSpace 5.8 installation from the upstream dspace-5.8 tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37
    • If I log into XMLUI and then nagivate to JSPUI I need to log in again
    • XMLUI does not set the dspace.current.user.id session cookie in Tomcat 8.5.37 for some reason
    • I sent an update to the dspace-tech mailing list to ask for more help troubleshooting

2019-01-07

  • I built a clean DSpace 6.3 installation from the upstream dspace-6.3 tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37
    • If I log into XMLUI and then nagivate to JSPUI I need to log in again
    • XMLUI does not set the dspace.current.user.id session cookie in Tomcat 8.5.37 for some reason
    • I sent an update to the dspace-tech mailing list to ask for more help troubleshooting

2019-01-08

  • Tim Donohue responded to my thread about the cookies on the dspace-tech mailing list

2019-01-11

  • Tezira wrote to say she has stopped receiving the DSpace Submission Approved and Archived emails from CGSpace as of January 2nd
    • I told her that I haven’t done anything to disable it lately, but that I would check
    • Bizu also says she hasn’t received them lately

2019-01-14

  • Day one of CGSpace AReS meeting in Amman

2019-01-15

  • Day two of CGSpace AReS meeting in Amman
    • Discuss possibly extending the dspace-statistics-api to make community and collection statistics available
    • Discuss new “final” CG Core document and some changes that we’ll need to do on CGSpace and other repositories
    • We agreed to try to stick to pure Dublin Core where possible, then use fields that exist in standard DSpace, and use “cg” namespace for everything else
    • Major changes are to move dc.contributor.author to dc.creator (which MELSpace and WorldFish are already using in their DSpace repositories)
  • I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in 2018-10:
$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'

0.16s user 0.03s system 3% cpu 5.185 total
0.17s user 0.02s system 2% cpu 7.123 total
0.18s user 0.02s system 6% cpu 3.047 total
  • In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    157 31.6.77.23
    192 54.70.40.11
    202 66.249.64.157
    207 40.77.167.204
    220 157.55.39.140
    326 197.156.105.116
    385 207.46.13.158
   1211 35.237.175.180
   1830 66.249.64.155
   2482 45.5.186.2

2019-01-16

  • Day three of CGSpace AReS meeting in Amman
    • We discussed CG Core 2.0 metadata and decided some action points
    • We discussed branding of AReS tool
  • Notes from our CG Core 2.0 metadata discussion:
    • Not Dublin Core:
    • dc.subtype
    • dc.peer-reviewed
    • Dublin Core, possible action for CGSpace:
    • dc.description:
      • We use dc.description.abstract, dc.description (Notes), dc.description.version (Peer review status), dc.description.sponsorship (Funder)
      • Maybe move abstract to dc.description
      • Maybe notes moves to cg.description.notes???
      • Maybe move dc.description.version to cg.peer-reviewed or cg.peer-review-status???
      • Move dc.description.sponsorship to cg.contributor.donor???
    • dc.subject:
      • Wait for guidance, evaluate technical implications (Google indexing, OAI, etc)
    • Move dc.contributor.author to dc.creator
    • dc.contributor Project
      • Recommend against creating new fields for all projects
      • We use collections projects/themes/etc
    • dc.contributor Project Lead Center
      • MELSpace uses cg.contributor.project-lead-institute (institute is more generic than center)
      • Maybe we use?
    • dc.contributor Partner
      • Wait for guidance
      • MELSpace uses cg.contibutor.center (?)
    • dc.contributor Donor
      • Use cg.contributor.donor
    • dc.date
      • Wait for guidance, maybe move dc.date.issued?
      • dc.date.accessioned and dc.date.available are automatic in DSpace
    • dc.language
      • Move dc.language.iso to dc.language
    • dc.identifier
      • Move cg.identifier.url to dc.identifier
    • dc.identifier bibliographicCitation
      • dc.identifier.citation should move to dc.bibliographicCitation
    • dc.description.notes
      • Wait for guidance, maybe move to cg.description.notes ???
    • dc.relation
      • Maybe move cg.link.reference
      • Perhaps consolodate cg.link.audio etc there…?
    • dc.relation.isPartOf
      • Move dc.relation.ispartofseries to dc.relation.isPartOf
    • dc.audience
      • Move cg.targetaudience to dc.audience
  • Something happened to the Solr usage statistics on CGSpace
    • I looked on the server and the Solr cores are there (56GB!), and I don’t see any obvious errors in dmesg or anything
    • I see that the server hasn’t been rebooted in 26 days so I rebooted it
  • After reboot the Solr stats are still messed up in the Atmire Usage Stats module, it only shows 2019-01!

Solr stats fucked up

  • In the Solr admin UI I see the following error:
statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
  • Looking in the Solr log I see this:
2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    ... 31 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    ... 33 more
2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
    at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
    at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Unable to create core [statistics-2018]
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
    at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
    ... 27 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
    ... 29 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
    ... 31 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
    ... 33 more
  • I found some threads on StackOverflow etc discussing this and several suggested increasing the address space for the shell with ulimit
  • I added ulimit -v unlimited to the /etc/default/tomcat7 and restarted Tomcat and now Solr is working again:

Solr stats working

# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
1442874

real    0m17.161s
user    0m16.205s
sys     0m2.396s

2019-01-17

  • Send reminder to Atmire about purchasing the MQM module
  • Trying to decide the solid action points for CGSpace on the CG Core 2.0 metadata…
  • It’s difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!
  • Also, there is not a good Dublin Core reference (or maybe I just don’t understand?)
  • Several authoritative documents on Dublin Core appear to be:
  • And what is the relationship between DC and DCTERMS?
  • DSpace uses DCTERMS in the metadata it embeds in XMLUI item views!
  • We really need to look at this more carefully and see the impacts that might be made from switching core fields like languages, abstract, authors, etc
  • We can check WorldFish and MELSpace repositories to see what effects these changes have had on theirs because they have already adopted some of these changes…
  • I think I understand the difference between DC and DCTERMS finally: DC is the original set of fifteen elements and DCTERMS is the newer version that was supposed to address much of the drawbacks of the original with regards to digital content
  • We might be able to use some proper fields for citation, abstract, etc that are part of DCTERMS
  • To make matters more confusing, there is also “qualified Dublin Core” that uses the original fifteen elements of legacy DC and qualifies them, like dc.date.accessioned
  • So we should be trying to use DCTERMS where possible, unless it is some internal thing that might mess up DSpace (like dates)
  • “Elements 1.1” means legacy DC
  • Possible action list for CGSpace:
    • dc.description.abstract → dcterms.abstract
    • dc.description.version → cg.peer-reviewed (or cg.peer-review-status?)
    • dc.description.sponsorship → cg.contributor.donor
    • dc.contributor.author → dc.creator
    • dc.language.iso → dcterms.language
    • cg.identifier.url → dcterms.identifier
    • dc.identifier.citation → dcterms.bibliographicCitation
    • dc.relation.ispartofseries → dcterms.isPartOf
    • cg.targetaudience → dcterms.audience

2019-01-19

These terms conform with the DCMI Abstract Model and may be used in DCMI application profiles. DCMI endorses their use with Dublin Core elements as indicated.

2019-01-20

  • That’s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:
# w
 04:46:14 up 213 days,  7:25,  4 users,  load average: 1.94, 1.50, 1.35

2019-01-21

  • Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04’s Tomcat 8.5
  • I could either run with a simple tomcat7.service like this:
[Unit]
Description=Apache Tomcat 7 Web Application Container
After=network.target
[Service]
Type=forking
ExecStart=/path/to/apache-tomcat-7.0.92/bin/startup.sh
ExecStop=/path/to/apache-tomcat-7.0.92/bin/shutdown.sh
User=aorth
Group=aorth
[Install]
WantedBy=multi-user.target
  • Or try to use adapt a real systemd service like Arch Linux’s:
[Unit]
Description=Tomcat 7 servlet container
After=network.target

[Service]
Type=forking
PIDFile=/var/run/tomcat7.pid
Environment=CATALINA_PID=/var/run/tomcat7.pid
Environment=TOMCAT_JAVA_HOME=/usr/lib/jvm/default-runtime
Environment=CATALINA_HOME=/usr/share/tomcat7
Environment=CATALINA_BASE=/usr/share/tomcat7
Environment=CATALINA_OPTS=
Environment=ERRFILE=SYSLOG
Environment=OUTFILE=SYSLOG

ExecStart=/usr/bin/jsvc \
            -Dcatalina.home=${CATALINA_HOME} \
            -Dcatalina.base=${CATALINA_BASE} \
            -Djava.io.tmpdir=/var/tmp/tomcat7/temp \
            -cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \
            -user tomcat7 \
            -java-home ${TOMCAT_JAVA_HOME} \
            -pidfile /var/run/tomcat7.pid \
            -errfile ${ERRFILE} \
            -outfile ${OUTFILE} \
            $CATALINA_OPTS \
            org.apache.catalina.startup.Bootstrap

ExecStop=/usr/bin/jsvc \
            -pidfile /var/run/tomcat7.pid \
            -stop \
            org.apache.catalina.startup.Bootstrap

[Install]
WantedBy=multi-user.target
  • I see that jsvc and libcommons-daemon-java are both available on Ubuntu so that should be easy to port
  • We probably don’t need Eclipse Java Bytecode Compiler (ecj)
  • I tested Tomcat 7.0.92 on Arch Linux using the tomcat7.service with jsvc and it works… nice!
  • I think I might manage this the same way I do the restic releases in the Ansible infrastructure scripts, where I download a specific version and symlink to some generic location without the version number
  • I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:
$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
<result name="response" numFound="33" start="0">
$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
<result name="response" numFound="241" start="0">
  • I opened an issue on the GitHub issue tracker (#10)
  • I don’t think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests
  • The pysolr library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):
import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
print(results.facets['facet_fields'])
{'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
  • If I double check one item from above, for example 77572, it appears this is only working on the current statistics core and not the shards:
import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
print(results.hits)
646
solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
print(results.hits)
595
  • So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON
  • This enumerates the list of Solr cores and returns JSON format:
http://localhost:3000/solr/admin/cores?action=STATUS&wt=json
  • I think I figured out how to search across shards, I needed to give the whole URL to each other core
  • Now I get more results when I start adding the other statistics cores:
$ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound
<result name="response" numFound="16280292" start="0" maxScore="1.0">
$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&indent=on&rows=0&q=*:*' | grep numFound
<result name="response" numFound="25606142" start="0" maxScore="1.0">
$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&indent=on&rows=0&q=*:*' | grep numFound
<result name="response" numFound="31532212" start="0" maxScore="1.0">
  • I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the shards parameter to each query to make the search distributed among the cores
  • I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a shards query string
  • A few things I noticed:
    • Solr doesn’t mind if you use an empty shards parameter
    • Solr doesn’t mind if you have an extra comma at the end of the shards parameter
    • If you are searching multiple cores, you need to include the base core in the shards parameter as well
    • For example, compare the following two queries, first including the base core and the shard in the shards parameter, and then only including the shard:
$ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
<result name="response" numFound="275" start="0" maxScore="12.205825">
$ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound
<result name="response" numFound="241" start="0" maxScore="12.205825">

2019-01-22

  • Release version 0.9.0 of the dspace-statistics-api to address the issue of querying multiple Solr statistics shards
  • I deployed it on DSpace Test (linode19) and restarted the indexer and now it shows all the stats from 2018 as well (756 pages of views, intead of 6)
  • I deployed it on CGSpace (linode18) and restarted the indexer as well
  • Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    155 40.77.167.106
    176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
    189 107.21.16.70
    217 54.83.93.85
    310 46.174.208.142
    346 83.103.94.48
    360 45.5.186.2
    595 154.113.73.30
    716 196.191.127.37
    915 35.237.175.180
  • 35.237.175.180 is known to us
  • I don’t think we’ve seen 196.191.127.37 before. Its user agent is:
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
  • Interestingly this IP is located in Addis Ababa…
  • Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

2019-01-23

  • Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:
  • The shortened link is goo.gl/fb/VRj9Gq and it shows a “Dynamic Link not found” error from Firebase:

Dynamic Link not found

  • Apparently Google announced last year that they plan to discontinue the shortner and transition to Firebase Dynamic Links in March, 2019, so maybe this is related…

  • Very interesting discussion of methods for running Tomcat under systemd

  • We can set the ulimit options that used to be in /etc/default/tomcat7 with systemd’s LimitNOFILE and LimitAS (see the systemd.exec man page)

    • Note that we need to use infinity instead of unlimited for the address space
  • Create accounts for Bosun from IITA and Valerio from ICARDA / CGMEL on DSpace Test

  • Maria Garruccio asked me for a list of author affiliations from all of their submitted items so she can clean them up

  • I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:

dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
COPY 1109
  • Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP
  • Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    222 54.226.25.74
    241 40.77.167.13
    272 46.101.86.248
    297 35.237.175.180
    332 45.5.184.72
    355 34.218.226.147
    404 66.249.64.155
   4637 205.186.128.185
   4637 70.32.83.92
   9265 45.5.186.2
  • I think it’s the usual IPs:

    • 45.5.186.2 is CIAT
    • 70.32.83.92 is CCAFS
    • 205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)
  • Following up on the thumbnail issue that we had in 2018-12

  • It looks like the two items with problematic PDFs both have thumbnails now:

  • Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’s filter-media:

$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391

2019-01-24

  • I noticed Ubuntu’s Ghostscript 9.26 works on some troublesome PDFs where Arch’s Ghostscript 9.26 doesn’t, so the fix for the first/last page crash is not the patch I found yesterday
  • Ubuntu’s Ghostscript uses another patch from Ghostscript git (upstream bug report)
  • I re-compiled Arch’s ghostscript with the patch and then I was able to generate a thumbnail from one of the troublesome PDFs
  • Before and after:
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
Food safety Kenya fruits.pdf[0]=>Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
  • I reported it to the Arch Linux bug tracker (61513)
  • I told Atmire to go ahead with the Metadata Quality Module addition based on our 5_x-dev branch (657)
  • Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    305 3.81.136.184
    306 3.83.14.11
    306 52.54.252.47
    325 54.221.57.180
    378 66.249.64.157
    424 54.70.40.11
    497 47.29.247.74
    783 35.237.175.180
   1108 66.249.64.155
   2378 45.5.186.2
  • 45.5.186.2 is CIAT and 66.249.64.155 is Google… hmmm.
  • Linode sent another alert this morning, here are the top ten IPs active during that time:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    360 3.89.134.93
    362 34.230.15.139
    366 100.24.48.177
    369 18.212.208.240
    377 3.81.136.184
    404 54.221.57.180
    506 66.249.64.155
   4642 70.32.83.92
   4643 205.186.128.185
   8593 45.5.186.2
  • Just double checking what CIAT is doing, they are mainly hitting the REST API:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n