mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-01 09:45:58 +01:00
1063 lines
56 KiB
Markdown
1063 lines
56 KiB
Markdown
---
|
|
title: "January, 2019"
|
|
date: 2019-01-02T09:48:30+02:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2019-01-02
|
|
|
|
- Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
|
|
- I don't see anything interesting in the web server logs around that time though:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
92 40.77.167.4
|
|
99 210.7.29.100
|
|
120 38.126.157.45
|
|
177 35.237.175.180
|
|
177 40.77.167.32
|
|
216 66.249.75.219
|
|
225 18.203.76.93
|
|
261 46.101.86.248
|
|
357 207.46.13.1
|
|
903 54.70.40.11
|
|
```
|
|
|
|
<!--more-->
|
|
|
|
- Analyzing the types of requests made by the top few IPs during that time:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
|
|
30 bitstream
|
|
534 discover
|
|
352 handle
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 207.46.13.1 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
|
|
194 bitstream
|
|
345 handle
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 46.101.86.248 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
|
|
261 handle
|
|
```
|
|
|
|
- It's not clear to me what was causing the outbound traffic spike
|
|
- Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):
|
|
|
|
```
|
|
Moving: 81742 into core statistics-2010
|
|
Moving: 1837285 into core statistics-2011
|
|
Moving: 3764612 into core statistics-2012
|
|
Moving: 4557946 into core statistics-2013
|
|
Moving: 5483684 into core statistics-2014
|
|
Moving: 2941736 into core statistics-2015
|
|
Moving: 5926070 into core statistics-2016
|
|
Moving: 10562554 into core statistics-2017
|
|
Moving: 18497180 into core statistics-2018
|
|
```
|
|
|
|
- This could by why the outbound traffic rate was high, due to the S3 backup that run at 3:30AM...
|
|
- Run all system updates on DSpace Test (linode19) and reboot the server
|
|
|
|
## 2019-01-03
|
|
|
|
- Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:
|
|
|
|
```
|
|
$ sudo docker pull postgres:9.6-alpine
|
|
$ sudo docker rm dspacedb
|
|
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
|
|
```
|
|
|
|
- Testing DSpace 5.9 with Tomcat 8.5.37 on my local machine and I see that Atmire's Listings and Reports still doesn't work
|
|
- After logging in via XMLUI and clicking the Listings and Reports link from the sidebar it redirects me to a JSPUI login page
|
|
- If I log in again there the Listings and Reports work... hmm.
|
|
- The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:
|
|
|
|
```
|
|
2019-01-03 14:45:21,727 INFO org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
|
|
2019-01-03 14:45:21,971 INFO org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
|
|
2019-01-03 14:45:22,115 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error
|
|
-- Method: GET
|
|
-- Parameters were:
|
|
|
|
org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discovery/static-tagcloud-facet.jsp (line: [57], column: [8]) No tag [tagcloud] defined in tag library imported with prefix [dspace]
|
|
at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:41)
|
|
at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:291)
|
|
at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:97)
|
|
at org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:347)
|
|
at org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:380)
|
|
at org.apache.jasper.compiler.Parser.parseDirective(Parser.java:481)
|
|
at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1445)
|
|
at org.apache.jasper.compiler.Parser.parseBody(Parser.java:1683)
|
|
at org.apache.jasper.compiler.Parser.parseOptionalBody(Parser.java:1016)
|
|
at org.apache.jasper.compiler.Parser.parseCustomTag(Parser.java:1291)
|
|
at org.apache.jasper.compiler.Parser.parseElements(Parser.java:1470)
|
|
at org.apache.jasper.compiler.Parser.parse(Parser.java:144)
|
|
at org.apache.jasper.compiler.ParserController.doParse(ParserController.java:244)
|
|
at org.apache.jasper.compiler.ParserController.parse(ParserController.java:105)
|
|
at org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:202)
|
|
at org.apache.jasper.compiler.Compiler.compile(Compiler.java:373)
|
|
at org.apache.jasper.compiler.Compiler.compile(Compiler.java:350)
|
|
at org.apache.jasper.compiler.Compiler.compile(Compiler.java:334)
|
|
at org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:595)
|
|
at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:399)
|
|
at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
|
|
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
|
|
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
|
|
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
|
|
at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:728)
|
|
at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:470)
|
|
at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:395)
|
|
at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:316)
|
|
at org.dspace.app.webui.util.JSPManager.showJSP(JSPManager.java:60)
|
|
at org.apache.jsp.index_jsp._jspService(index_jsp.java:191)
|
|
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
|
|
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
|
|
at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:476)
|
|
at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:386)
|
|
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:330)
|
|
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
|
|
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
|
|
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
|
|
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
|
|
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
|
|
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493)
|
|
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
|
|
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
|
|
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
|
|
at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:650)
|
|
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
|
|
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
|
|
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800)
|
|
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
|
|
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806)
|
|
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
|
|
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
|
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
|
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
|
|
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
|
|
at java.lang.Thread.run(Thread.java:748)
|
|
```
|
|
|
|
- I notice that I get different JSESSIONID cookies for `/` (XMLUI) and `/jspui` (JSPUI) on Tomcat 8.5.37, I wonder if it's the same on Tomcat 7.0.92... yes I do.
|
|
- Hmm, on Tomcat 7.0.92 I see that I get a `dspace.current.user.id` session cookie after logging into XMLUI, and then when I browse to JSPUI I am still logged in...
|
|
- I didn't see that cookie being set on Tomcat 8.5.37
|
|
- I sent a message to the dspace-tech mailing list to ask
|
|
|
|
## 2019-01-04
|
|
|
|
- Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don't see anything around that time in the web server logs:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
189 207.46.13.192
|
|
217 31.6.77.23
|
|
340 66.249.70.29
|
|
349 40.77.167.86
|
|
417 34.218.226.147
|
|
630 207.46.13.173
|
|
710 35.237.175.180
|
|
790 40.77.167.87
|
|
1776 66.249.70.27
|
|
2099 54.70.40.11
|
|
```
|
|
|
|
- I'm thinking about trying to validate our `dc.subject` terms against [AGROVOC webservices](http://aims.fao.org/agrovoc/webservices)
|
|
- There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for `SOIL`:
|
|
|
|
```
|
|
$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
|
|
HTTP/1.1 200 OK
|
|
Access-Control-Allow-Origin: *
|
|
Connection: Keep-Alive
|
|
Content-Length: 493
|
|
Content-Type: application/json; charset=utf-8
|
|
Date: Fri, 04 Jan 2019 13:44:27 GMT
|
|
Keep-Alive: timeout=5, max=100
|
|
Server: Apache
|
|
Strict-Transport-Security: max-age=63072000; includeSubdomains
|
|
Vary: Accept
|
|
X-Content-Type-Options: nosniff
|
|
X-Frame-Options: ALLOW-FROM http://aims.fao.org
|
|
|
|
{
|
|
"@context": {
|
|
"@language": "en",
|
|
"altLabel": "skos:altLabel",
|
|
"hiddenLabel": "skos:hiddenLabel",
|
|
"isothes": "http://purl.org/iso25964/skos-thes#",
|
|
"onki": "http://schema.onki.fi/onki#",
|
|
"prefLabel": "skos:prefLabel",
|
|
"results": {
|
|
"@container": "@list",
|
|
"@id": "onki:results"
|
|
},
|
|
"skos": "http://www.w3.org/2004/02/skos/core#",
|
|
"type": "@type",
|
|
"uri": "@id"
|
|
},
|
|
"results": [
|
|
{
|
|
"lang": "en",
|
|
"prefLabel": "soil",
|
|
"type": [
|
|
"skos:Concept"
|
|
],
|
|
"uri": "http://aims.fao.org/aos/agrovoc/c_7156",
|
|
"vocab": "agrovoc"
|
|
}
|
|
],
|
|
"uri": ""
|
|
}
|
|
```
|
|
|
|
- The API does not appear to be case sensitive (searches for `SOIL` and `soil` return the same thing)
|
|
- I'm a bit confused that there's no obvious return code or status when a term is not found, for example `SOILS`:
|
|
|
|
```
|
|
HTTP/1.1 200 OK
|
|
Access-Control-Allow-Origin: *
|
|
Connection: Keep-Alive
|
|
Content-Length: 367
|
|
Content-Type: application/json; charset=utf-8
|
|
Date: Fri, 04 Jan 2019 13:48:31 GMT
|
|
Keep-Alive: timeout=5, max=100
|
|
Server: Apache
|
|
Strict-Transport-Security: max-age=63072000; includeSubdomains
|
|
Vary: Accept
|
|
X-Content-Type-Options: nosniff
|
|
X-Frame-Options: ALLOW-FROM http://aims.fao.org
|
|
|
|
{
|
|
"@context": {
|
|
"@language": "en",
|
|
"altLabel": "skos:altLabel",
|
|
"hiddenLabel": "skos:hiddenLabel",
|
|
"isothes": "http://purl.org/iso25964/skos-thes#",
|
|
"onki": "http://schema.onki.fi/onki#",
|
|
"prefLabel": "skos:prefLabel",
|
|
"results": {
|
|
"@container": "@list",
|
|
"@id": "onki:results"
|
|
},
|
|
"skos": "http://www.w3.org/2004/02/skos/core#",
|
|
"type": "@type",
|
|
"uri": "@id"
|
|
},
|
|
"results": [],
|
|
"uri": ""
|
|
}
|
|
```
|
|
|
|
- I guess the `results` object will just be empty...
|
|
- Another way would be to try with SPARQL, perhaps using the Python 2.7 [sparql-client](https://pypi.org/project/sparql-client/):
|
|
|
|
```
|
|
$ python2.7 -m virtualenv /tmp/sparql
|
|
$ . /tmp/sparql/bin/activate
|
|
$ pip install sparql-client ipython
|
|
$ ipython
|
|
In [10]: import sparql
|
|
In [11]: s = sparql.Service("http://agrovoc.uniroma2.it:3030/agrovoc/sparql", "utf-8", "GET")
|
|
In [12]: statement=('PREFIX skos: <http://www.w3.org/2004/02/skos/core#> '
|
|
...: 'SELECT '
|
|
...: '?label '
|
|
...: 'WHERE { '
|
|
...: '{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } '
|
|
...: 'FILTER regex(str(?label), "^fish", "i") . '
|
|
...: '} LIMIT 10')
|
|
In [13]: result = s.query(statement)
|
|
In [14]: for row in result.fetchone():
|
|
...: print(row)
|
|
...:
|
|
(<Literal "fish catching"@en>,)
|
|
(<Literal "fish harvesting"@en>,)
|
|
(<Literal "fish meat"@en>,)
|
|
(<Literal "fish roe"@en>,)
|
|
(<Literal "fish conversion"@en>,)
|
|
(<Literal "fisheries catches (composition)"@en>,)
|
|
(<Literal "fishtail palm"@en>,)
|
|
(<Literal "fishflies"@en>,)
|
|
(<Literal "fishery biology"@en>,)
|
|
(<Literal "fish production"@en>,)
|
|
```
|
|
|
|
- The SPARQL query comes from my notes in [2017-08]({{< relref "2017-08.md" >}})
|
|
|
|
## 2019-01-06
|
|
|
|
- I built a clean DSpace 5.8 installation from the upstream `dspace-5.8` tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37
|
|
- If I log into XMLUI and then nagivate to JSPUI I need to log in again
|
|
- XMLUI does not set the `dspace.current.user.id` session cookie in Tomcat 8.5.37 for some reason
|
|
- I sent an update to the dspace-tech mailing list to ask for more help troubleshooting
|
|
|
|
## 2019-01-07
|
|
|
|
- I built a clean DSpace 6.3 installation from the upstream `dspace-6.3` tag and the issue with the XMLUI/JSPUI login is still there with Tomcat 8.5.37
|
|
- If I log into XMLUI and then nagivate to JSPUI I need to log in again
|
|
- XMLUI does not set the `dspace.current.user.id` session cookie in Tomcat 8.5.37 for some reason
|
|
- I sent an update to the dspace-tech mailing list to ask for more help troubleshooting
|
|
|
|
## 2019-01-08
|
|
|
|
- Tim Donohue responded to my thread about the cookies on the dspace-tech mailing list
|
|
- He suspects it's a change of behavior in Tomcat 8.5, and indeed I see a mention of new cookie processing in the [Tomcat 8.5 migration guide](https://tomcat.apache.org/migration-85.html#Cookies)
|
|
- I tried to switch my XMLUI and JSPUI contexts to use the `LegacyCookieProcessor`, but it didn't seem to help
|
|
- I [filed DS-4140 on the DSpace issue tracker](https://jira.duraspace.org/browse/DS-4140)
|
|
|
|
## 2019-01-11
|
|
|
|
- Tezira wrote to say she has stopped receiving the `DSpace Submission Approved and Archived` emails from CGSpace as of January 2nd
|
|
- I told her that I haven't done anything to disable it lately, but that I would check
|
|
- Bizu also says she hasn't received them lately
|
|
|
|
## 2019-01-14
|
|
|
|
- Day one of CGSpace AReS meeting in Amman
|
|
|
|
## 2019-01-15
|
|
|
|
- Day two of CGSpace AReS meeting in Amman
|
|
- Discuss possibly extending the [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) to make community and collection statistics available
|
|
- Discuss new "final" CG Core document and some changes that we'll need to do on CGSpace and other repositories
|
|
- We agreed to try to stick to pure Dublin Core where possible, then use fields that exist in standard DSpace, and use "cg" namespace for everything else
|
|
- Major changes are to move `dc.contributor.author` to `dc.creator` (which MELSpace and WorldFish are already using in their DSpace repositories)
|
|
- I am testing the speed of the WorldFish DSpace repository's REST API and it's five to ten times faster than CGSpace as I tested in [2018-10]({{< relref "2018-10.md" >}}):
|
|
|
|
```
|
|
$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
|
|
|
|
0.16s user 0.03s system 3% cpu 5.185 total
|
|
0.17s user 0.02s system 2% cpu 7.123 total
|
|
0.18s user 0.02s system 6% cpu 3.047 total
|
|
```
|
|
|
|
- In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
157 31.6.77.23
|
|
192 54.70.40.11
|
|
202 66.249.64.157
|
|
207 40.77.167.204
|
|
220 157.55.39.140
|
|
326 197.156.105.116
|
|
385 207.46.13.158
|
|
1211 35.237.175.180
|
|
1830 66.249.64.155
|
|
2482 45.5.186.2
|
|
```
|
|
|
|
## 2019-01-16
|
|
|
|
- Day three of CGSpace AReS meeting in Amman
|
|
- We discussed CG Core 2.0 metadata and decided some action points
|
|
- We discussed branding of AReS tool
|
|
- Notes from our CG Core 2.0 metadata discussion:
|
|
- Not Dublin Core:
|
|
- dc.subtype
|
|
- dc.peer-reviewed
|
|
- Dublin Core, possible action for CGSpace:
|
|
- dc.description:
|
|
- We use dc.description.abstract, dc.description (Notes), dc.description.version (Peer review status), dc.description.sponsorship (Funder)
|
|
- Maybe move abstract to dc.description
|
|
- Maybe notes moves to cg.description.notes???
|
|
- Maybe move dc.description.version to cg.peer-reviewed or cg.peer-review-status???
|
|
- Move dc.description.sponsorship to cg.contributor.donor???
|
|
- dc.subject:
|
|
- Wait for guidance, evaluate technical implications (Google indexing, OAI, etc)
|
|
- Move dc.contributor.author to dc.creator
|
|
- dc.contributor Project
|
|
- Recommend against creating new fields for all projects
|
|
- We use collections projects/themes/etc
|
|
- dc.contributor Project Lead Center
|
|
- MELSpace uses cg.contributor.project-lead-institute (institute is more generic than center)
|
|
- Maybe we use?
|
|
- dc.contributor Partner
|
|
- Wait for guidance
|
|
- MELSpace uses cg.contibutor.center (?)
|
|
- dc.contributor Donor
|
|
- Use cg.contributor.donor
|
|
- dc.date
|
|
- Wait for guidance, maybe move dc.date.issued?
|
|
- dc.date.accessioned and dc.date.available are automatic in DSpace
|
|
- dc.language
|
|
- Move dc.language.iso to dc.language
|
|
- dc.identifier
|
|
- Move cg.identifier.url to dc.identifier
|
|
- dc.identifier bibliographicCitation
|
|
- dc.identifier.citation should move to dc.bibliographicCitation
|
|
- dc.description.notes
|
|
- Wait for guidance, maybe move to cg.description.notes ???
|
|
- dc.relation
|
|
- Maybe move cg.link.reference
|
|
- Perhaps consolodate cg.link.audio etc there...?
|
|
- dc.relation.isPartOf
|
|
- Move dc.relation.ispartofseries to dc.relation.isPartOf
|
|
- dc.audience
|
|
- Move cg.targetaudience to dc.audience
|
|
- Something happened to the Solr usage statistics on CGSpace
|
|
- I looked on the server and the Solr cores are there (56GB!), and I don't see any obvious errors in dmesg or anything
|
|
- I see that the server hasn't been rebooted in 26 days so I rebooted it
|
|
- After reboot the Solr stats are still messed up in the Atmire Usage Stats module, it only shows 2019-01!
|
|
|
|
![Solr stats fucked up](/cgspace-notes/2019/01/solr-stats-incorrect.png)
|
|
|
|
- In the Solr admin UI I see the following error:
|
|
|
|
```
|
|
statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
|
|
```
|
|
|
|
- Looking in the Solr log I see this:
|
|
|
|
```
|
|
2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
|
|
org.apache.solr.common.SolrException: Error opening new searcher
|
|
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
|
|
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
|
|
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
|
|
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
|
|
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
|
|
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
|
|
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
|
|
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
|
|
at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
|
|
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
|
|
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
|
|
at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
|
|
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
|
|
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
|
|
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
|
|
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
|
|
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
|
|
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
|
|
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
|
|
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
|
|
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
|
|
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
|
|
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
|
|
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
|
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
|
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
|
|
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
|
|
at java.lang.Thread.run(Thread.java:748)
|
|
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
|
|
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
|
|
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
|
|
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
|
|
... 31 more
|
|
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
|
|
at org.apache.lucene.store.Lock.obtain(Lock.java:89)
|
|
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
|
|
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
|
|
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
|
|
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
|
|
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
|
|
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
|
|
... 33 more
|
|
2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
|
|
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
|
|
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
|
|
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
|
|
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
|
|
at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)
|
|
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:258)
|
|
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
|
|
at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
|
|
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
|
|
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
|
|
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
|
|
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
|
|
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
|
|
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
|
|
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
|
|
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
|
|
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
|
|
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
|
|
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
|
|
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
|
|
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
|
|
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
|
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
|
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
|
|
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
|
|
at java.lang.Thread.run(Thread.java:748)
|
|
Caused by: org.apache.solr.common.SolrException: Unable to create core [statistics-2018]
|
|
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:507)
|
|
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466)
|
|
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
|
|
... 27 more
|
|
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
|
|
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
|
|
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
|
|
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
|
|
... 29 more
|
|
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
|
|
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
|
|
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
|
|
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
|
|
... 31 more
|
|
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
|
|
at org.apache.lucene.store.Lock.obtain(Lock.java:89)
|
|
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
|
|
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
|
|
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
|
|
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
|
|
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
|
|
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
|
|
... 33 more
|
|
```
|
|
|
|
- I found some threads on StackOverflow etc discussing this and several suggested increasing the address space for the shell with ulimit
|
|
- I added `ulimit -v unlimited` to the `/etc/default/tomcat7` and restarted Tomcat and now Solr is working again:
|
|
|
|
![Solr stats working](/cgspace-notes/2019/01/solr-stats-incorrect.png)
|
|
|
|
- Some StackOverflow discussions related to this:
|
|
- https://stackoverflow.com/questions/2895417/solrexception-internal-server-error/3035916#3035916
|
|
- https://stackoverflow.com/questions/11683850/how-much-memory-could-vm-use
|
|
- https://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed/8893684#8893684
|
|
- Abenet was asking if the Atmire Usage Stats are correct because they are over 2 million the last few months...
|
|
- For 2019-01 alone the Usage Stats are already around 1.2 million
|
|
- I tried to look in the nginx logs to see how many raw requests there are so far this month and it's about 1.4 million:
|
|
|
|
```
|
|
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
|
|
1442874
|
|
|
|
real 0m17.161s
|
|
user 0m16.205s
|
|
sys 0m2.396s
|
|
```
|
|
|
|
## 2019-01-17
|
|
|
|
- Send reminder to Atmire about purchasing the [MQM module](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657)
|
|
- Trying to decide the solid action points for CGSpace on the CG Core 2.0 metadata...
|
|
- It's difficult to decide some of these because the current CG Core 2.0 document does not provide guidance or rationale (yet)!
|
|
- Also, there is not a good Dublin Core reference (or maybe I just don't understand?)
|
|
- Several authoritative documents on Dublin Core appear to be:
|
|
- [Dublin Core Metadata Element Set, Version 1.1: Reference Description](http://dublincore.org/documents/dces/)
|
|
- [DCMI Metadata Terms](http://www.dublincore.org/documents/dcmi-terms/)
|
|
- And what is the relationship between DC and DCTERMS?
|
|
- DSpace uses DCTERMS in the metadata it embeds in XMLUI item views!
|
|
- We really need to look at this more carefully and see the impacts that might be made from switching core fields like languages, abstract, authors, etc
|
|
- We can check WorldFish and MELSpace repositories to see what effects these changes have had on theirs because they have already adopted some of these changes...
|
|
- I think I understand the difference between DC and DCTERMS finally: DC is the original set of fifteen elements and DCTERMS is the newer version that was supposed to address much of the drawbacks of the original with regards to digital content
|
|
- We might be able to use some proper fields for citation, abstract, etc that are part of DCTERMS
|
|
- To make matters more confusing, there is also "qualified Dublin Core" that uses the original fifteen elements of legacy DC and qualifies them, like `dc.date.accessioned`
|
|
- According to Wikipedia [Qualified Dublin Core was superseded by DCTERMS in 2008](https://en.wikipedia.org/wiki/Dublin_Core)!
|
|
- So we should be trying to use DCTERMS where possible, unless it is some internal thing that might mess up DSpace (like dates)
|
|
- "Elements 1.1" means legacy DC
|
|
- Possible action list for CGSpace:
|
|
- dc.description.abstract → dcterms.abstract
|
|
- dc.description.version → cg.peer-reviewed (or cg.peer-review-status?)
|
|
- dc.description.sponsorship → cg.contributor.donor
|
|
- dc.contributor.author → dc.creator
|
|
- dc.language.iso → dcterms.language
|
|
- cg.identifier.url → dcterms.identifier
|
|
- dc.identifier.citation → dcterms.bibliographicCitation
|
|
- dc.relation.ispartofseries → dcterms.isPartOf
|
|
- cg.targetaudience → dcterms.audience
|
|
|
|
## 2019-01-19
|
|
|
|
- There's no official set of Dublin Core qualifiers so I can't tell if things like `dc.contributor.author` that are used by DSpace are official
|
|
- I found a great [presentation from 2015 by the Digital Repository of Ireland](https://www.dri.ie/sites/default/files/files/qualified-dublin-core-metadata-guidelines.pdf) that discusses using MARC Relator Terms with Dublin Core elements
|
|
- It seems that `dc.contributor.author` would be a supported term according to this [Library of Congress list](https://memory.loc.gov/diglib/loc.terms/relators/dc-contributor.html) linked from the [Dublin Core website](http://dublincore.org/usage/documents/relators/)
|
|
- The Library of Congress document specifically says:
|
|
|
|
|
|
These terms conform with the DCMI Abstract Model and may be used in DCMI application profiles. DCMI endorses their use with Dublin Core elements as indicated.
|
|
|
|
## 2019-01-20
|
|
|
|
- That's weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:
|
|
|
|
```
|
|
# w
|
|
04:46:14 up 213 days, 7:25, 4 users, load average: 1.94, 1.50, 1.35
|
|
```
|
|
|
|
- I've definitely rebooted it several times in the past few months... according to `journalctl -b` it was a few weeks ago on 2019-01-02
|
|
- I re-ran the Ansible DSpace tag, ran all system updates, and rebooted the host
|
|
- After rebooting I notice that the Linode kernel went down from 4.19.8 to 4.18.16...
|
|
- Atmire sent a quote on our [ticket about purchasing the Metadata Quality Module (MQM) for DSpace 5.8](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657)
|
|
- Abenet asked me for an [OpenSearch query that could generate and RSS feed for items in the Livestock CRP](https://cgspace.cgiar.org/open-search/discover?query=crpsubject:Livestock&sort_by=3&order=DESC)
|
|
- According to my notes, `sort_by=3` is accession date (as configured in `dspace.cfg`)
|
|
- The query currently shows 3023 items, but a [Discovery search for Livestock CRP only returns 858 items](https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&filter_relational_operator_1=equals&filter_1=Livestock&submit_apply_filter=&query=)
|
|
- That query seems to return items tagged with `Livestock and Fish` CRP as well... hmm.
|
|
|
|
## 2019-01-21
|
|
|
|
- Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04's Tomcat 8.5
|
|
- I could either run with a simple `tomcat7.service` like this:
|
|
|
|
```
|
|
[Unit]
|
|
Description=Apache Tomcat 7 Web Application Container
|
|
After=network.target
|
|
[Service]
|
|
Type=forking
|
|
ExecStart=/path/to/apache-tomcat-7.0.92/bin/startup.sh
|
|
ExecStop=/path/to/apache-tomcat-7.0.92/bin/shutdown.sh
|
|
User=aorth
|
|
Group=aorth
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
- Or try to use adapt a real systemd service like Arch Linux's:
|
|
|
|
```
|
|
[Unit]
|
|
Description=Tomcat 7 servlet container
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=forking
|
|
PIDFile=/var/run/tomcat7.pid
|
|
Environment=CATALINA_PID=/var/run/tomcat7.pid
|
|
Environment=TOMCAT_JAVA_HOME=/usr/lib/jvm/default-runtime
|
|
Environment=CATALINA_HOME=/usr/share/tomcat7
|
|
Environment=CATALINA_BASE=/usr/share/tomcat7
|
|
Environment=CATALINA_OPTS=
|
|
Environment=ERRFILE=SYSLOG
|
|
Environment=OUTFILE=SYSLOG
|
|
|
|
ExecStart=/usr/bin/jsvc \
|
|
-Dcatalina.home=${CATALINA_HOME} \
|
|
-Dcatalina.base=${CATALINA_BASE} \
|
|
-Djava.io.tmpdir=/var/tmp/tomcat7/temp \
|
|
-cp /usr/share/java/commons-daemon.jar:/usr/share/java/eclipse-ecj.jar:${CATALINA_HOME}/bin/bootstrap.jar:${CATALINA_HOME}/bin/tomcat-juli.jar \
|
|
-user tomcat7 \
|
|
-java-home ${TOMCAT_JAVA_HOME} \
|
|
-pidfile /var/run/tomcat7.pid \
|
|
-errfile ${ERRFILE} \
|
|
-outfile ${OUTFILE} \
|
|
$CATALINA_OPTS \
|
|
org.apache.catalina.startup.Bootstrap
|
|
|
|
ExecStop=/usr/bin/jsvc \
|
|
-pidfile /var/run/tomcat7.pid \
|
|
-stop \
|
|
org.apache.catalina.startup.Bootstrap
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
- I see that `jsvc` and `libcommons-daemon-java` are both available on Ubuntu so that should be easy to port
|
|
- We probably don't need Eclipse Java Bytecode Compiler (ecj)
|
|
- I tested Tomcat 7.0.92 on Arch Linux using the `tomcat7.service` with `jsvc` and it works... nice!
|
|
- I think I might manage this the same way I do the restic releases in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public), where I download a specific version and symlink to some generic location without the version number
|
|
- I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:
|
|
|
|
```
|
|
$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
|
|
<result name="response" numFound="33" start="0">
|
|
$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
|
|
<result name="response" numFound="241" start="0">
|
|
```
|
|
|
|
- I opened an issue on the GitHub issue tracker ([#10](https://github.com/ilri/dspace-statistics-api/issues/10))
|
|
- I don't think the [SolrClient library](https://solrclient.readthedocs.io/en/latest/) we are currently using supports these type of queries so we might have to just do raw queries with requests
|
|
- The [pysolr](https://github.com/django-haystack/pysolr) library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):
|
|
|
|
```
|
|
import pysolr
|
|
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
|
|
results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
|
|
print(results.facets['facet_fields'])
|
|
{'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
|
|
```
|
|
|
|
- If I double check one item from above, for example `77572`, it appears this is only working on the current statistics core and not the shards:
|
|
|
|
```
|
|
import pysolr
|
|
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
|
|
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
|
|
print(results.hits)
|
|
646
|
|
solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
|
|
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
|
|
print(results.hits)
|
|
595
|
|
```
|
|
|
|
- So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON
|
|
- This enumerates the list of Solr cores and returns JSON format:
|
|
|
|
```
|
|
http://localhost:3000/solr/admin/cores?action=STATUS&wt=json
|
|
```
|
|
|
|
- I think I figured out how to search across shards, I needed to give the whole URL to each other core
|
|
- Now I get more results when I start adding the other statistics cores:
|
|
|
|
```
|
|
$ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
|
|
$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound
|
|
<result name="response" numFound="16280292" start="0" maxScore="1.0">
|
|
$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&indent=on&rows=0&q=*:*' | grep numFound
|
|
<result name="response" numFound="25606142" start="0" maxScore="1.0">
|
|
$ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&indent=on&rows=0&q=*:*' | grep numFound
|
|
<result name="response" numFound="31532212" start="0" maxScore="1.0">
|
|
```
|
|
|
|
- I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the `shards` parameter to each query to make the search distributed among the cores
|
|
- I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a `shards` query string
|
|
- A few things I noticed:
|
|
- Solr doesn't mind if you use an empty `shards` parameter
|
|
- Solr doesn't mind if you have an extra comma at the end of the `shards` parameter
|
|
- If you are searching multiple cores, you need to include the base core in the `shards` parameter as well
|
|
- For example, compare the following two queries, first including the base core and the shard in the `shards` parameter, and then only including the shard:
|
|
|
|
```
|
|
$ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
|
|
<result name="response" numFound="275" start="0" maxScore="12.205825">
|
|
$ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound
|
|
<result name="response" numFound="241" start="0" maxScore="12.205825">
|
|
```
|
|
|
|
## 2019-01-22
|
|
|
|
- Release [version 0.9.0 of the dspace-statistics-api](https://github.com/ilri/dspace-statistics-api/releases/tag/v0.9.0) to address the issue of querying multiple Solr statistics shards
|
|
- I deployed it on DSpace Test (linode19) and restarted the indexer and now it shows all the stats from 2018 as well (756 pages of views, intead of 6)
|
|
- I deployed it on CGSpace (linode18) and restarted the indexer as well
|
|
- Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
155 40.77.167.106
|
|
176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
|
|
189 107.21.16.70
|
|
217 54.83.93.85
|
|
310 46.174.208.142
|
|
346 83.103.94.48
|
|
360 45.5.186.2
|
|
595 154.113.73.30
|
|
716 196.191.127.37
|
|
915 35.237.175.180
|
|
```
|
|
|
|
- 35.237.175.180 is known to us
|
|
- I don't think we've seen 196.191.127.37 before. Its user agent is:
|
|
|
|
```
|
|
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
|
|
```
|
|
|
|
- Interestingly this IP is located in Addis Ababa...
|
|
- Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:
|
|
|
|
```
|
|
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
|
|
```
|
|
|
|
## 2019-01-23
|
|
|
|
- Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:
|
|
|
|
{{< tweet 1086330519904673793 >}}
|
|
|
|
- The shortened link is [goo.gl/fb/VRj9Gq](goo.gl/fb/VRj9Gq) and it shows a "Dynamic Link not found" error from Firebase:
|
|
|
|
![Dynamic Link not found](/cgspace-notes/2019/01/firebase-link-not-found.png)
|
|
|
|
- Apparently Google announced last year that they plan to [discontinue the shortner and transition to Firebase Dynamic Links in March, 2019](https://developers.googleblog.com/2018/03/transitioning-google-url-shortener.html), so maybe this is related...
|
|
|
|
- Very interesting discussion of methods for [running Tomcat under systemd](https://jdebp.eu/FGA/systemd-house-of-horror/tomcat.html)
|
|
- We can set the ulimit options that used to be in `/etc/default/tomcat7` with systemd's `LimitNOFILE` and `LimitAS` (see the `systemd.exec` man page)
|
|
- Note that we need to use `infinity` instead of `unlimited` for the address space
|
|
- Create accounts for Bosun from IITA and Valerio from ICARDA / CGMEL on DSpace Test
|
|
- Maria Garruccio asked me for a list of author affiliations from all of their submitted items so she can clean them up
|
|
- I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:
|
|
|
|
```
|
|
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
|
|
COPY 1109
|
|
```
|
|
|
|
- Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP
|
|
- Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
222 54.226.25.74
|
|
241 40.77.167.13
|
|
272 46.101.86.248
|
|
297 35.237.175.180
|
|
332 45.5.184.72
|
|
355 34.218.226.147
|
|
404 66.249.64.155
|
|
4637 205.186.128.185
|
|
4637 70.32.83.92
|
|
9265 45.5.186.2
|
|
```
|
|
|
|
- I think it's the usual IPs:
|
|
- 45.5.186.2 is CIAT
|
|
- 70.32.83.92 is CCAFS
|
|
- 205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)
|
|
|
|
- Following up on the thumbnail issue that we had in [2018-12]({{< relref "2018-12.md" >}})
|
|
- It looks like the two items with problematic PDFs both have thumbnails now:
|
|
- [10568/98390](https://hdl.handle.net/10568/98390)
|
|
- [10568/98391](https://hdl.handle.net/10568/98391)
|
|
- Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace's `filter-media`:
|
|
|
|
```
|
|
$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
|
|
$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391
|
|
```
|
|
|
|
- Both of these were successful, so there must have been an update to ImageMagick or Ghostscript in Ubuntu since early 2018-12
|
|
- Looking at the apt history logs I see that on 2018-12-07 a security update for Ghostscript was installed (version 9.26~dfsg+0-0ubuntu0.16.04.3)
|
|
- I think this Launchpad discussion is relevant: https://bugs.launchpad.net/ubuntu/+source/ghostscript/+bug/1806517
|
|
- As well as the original Ghostscript bug report: https://bugs.ghostscript.com/show_bug.cgi?id=699815
|
|
|
|
## 2019-01-24
|
|
|
|
- I noticed Ubuntu's Ghostscript 9.26 works on some troublesome PDFs where Arch's Ghostscript 9.26 doesn't, so the fix for the first/last page crash is not the patch I found yesterday
|
|
- Ubuntu's Ghostscript uses another [patch from Ghostscript git](http://git.ghostscript.com/?p=ghostpdl.git;h=fae21f1668d2b44b18b84cf0923a1d5f3008a696) ([upstream bug report](https://bugs.ghostscript.com/show_bug.cgi?id=700315))
|
|
- I re-compiled Arch's ghostscript with the patch and then I was able to generate a thumbnail from one of the [troublesome PDFs](https://cgspace.cgiar.org/handle/10568/98390)
|
|
- Before and after:
|
|
|
|
```
|
|
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
|
|
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
|
|
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
|
|
Food safety Kenya fruits.pdf[0]=>Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
|
|
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
|
|
```
|
|
|
|
- I reported it to the Arch Linux bug tracker ([61513](https://bugs.archlinux.org/task/61513))
|
|
- I told Atmire to go ahead with the Metadata Quality Module addition based on our `5_x-dev` branch ([657](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657))
|
|
- Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
305 3.81.136.184
|
|
306 3.83.14.11
|
|
306 52.54.252.47
|
|
325 54.221.57.180
|
|
378 66.249.64.157
|
|
424 54.70.40.11
|
|
497 47.29.247.74
|
|
783 35.237.175.180
|
|
1108 66.249.64.155
|
|
2378 45.5.186.2
|
|
```
|
|
|
|
- 45.5.186.2 is CIAT and 66.249.64.155 is Google... hmmm.
|
|
- Linode sent another alert this morning, here are the top ten IPs active during that time:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
360 3.89.134.93
|
|
362 34.230.15.139
|
|
366 100.24.48.177
|
|
369 18.212.208.240
|
|
377 3.81.136.184
|
|
404 54.221.57.180
|
|
506 66.249.64.155
|
|
4642 70.32.83.92
|
|
4643 205.186.128.185
|
|
8593 45.5.186.2
|
|
```
|
|
|
|
- Just double checking what CIAT is doing, they are mainly hitting the REST API:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
|
|
```
|
|
|
|
- CIAT's community currently has 12,000 items in it so this is normal
|
|
- The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again...
|
|
- For example: https://goo.gl/fb/VRj9Gq
|
|
- The full [list of MARC Relators on the Library of Congress website](http://id.loc.gov/vocabulary/relators.html) linked from the [DMCI relators page](http://dublincore.org/usage/documents/relators/) is very confusing
|
|
- Looking at the default DSpace XMLUI crosswalk in [xhtml-head-item.properties](https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/config/crosswalks/xhtml-head-item.properties) I see a very complete mapping of DSpace DC and QDC fields to DCTERMS
|
|
- This is good for standards-compliant web crawlers, but what about for those harvesting via REST or OAI APIs?
|
|
- I sent a message titled "[DC, QDC, and DCTERMS: reviewing our metadata practices](https://groups.google.com/forum/#!topic/dspace-tech/phV_t51TGuE)" to the dspace-tech mailing list to ask about some of this
|
|
|
|
## 2019-01-25
|
|
|
|
- A little bit more work on getting Tomcat to run from a tarball on our [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
|
|
- I tested by doing a Tomcat 7.0.91 installation, then switching it to 7.0.92 and it worked... nice!
|
|
- I refined the tasks so much that I was confident enough to deploy them on DSpace Test and it went very well
|
|
- Basically I just stopped tomcat7, created a dspace user, removed tomcat7, chown'd everything to the dspace user, then ran the playbook
|
|
- So now DSpace Test (linode19) is running Tomcat 7.0.92... w00t
|
|
- Now we need to monitor it for a few weeks to see if there is anything we missed, and then I can change CGSpace (linode18) as well, and we're ready for Ubuntu 18.04 too!
|
|
|
|
## 2019-01-27
|
|
|
|
- Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
189 40.77.167.108
|
|
191 157.55.39.2
|
|
263 34.218.226.147
|
|
283 45.5.184.2
|
|
332 45.5.184.72
|
|
608 5.9.6.51
|
|
679 66.249.66.223
|
|
1116 66.249.66.219
|
|
4644 205.186.128.185
|
|
4644 70.32.83.92
|
|
```
|
|
|
|
- I think it's the usual IPs:
|
|
- 70.32.83.92 is CCAFS
|
|
- 205.186.128.185 is CCAFS or perhaps another Macaroni Bros harvester (new ILRI website?)
|
|
|
|
## 2019-01-28
|
|
|
|
- Udana from WLE asked me about the interaction between their publication website and their items on CGSpace
|
|
- There is an item that is mapped into their collection from IWMI and is missing their `cg.identifier.wletheme` metadata
|
|
- I told him that, as far as I remember, when WLE introduced Phase II research themes in 2017 we decided to infer theme ownership from the collection hierarchy and we created a [WLE Phase II Research Themes](https://cgspace.cgiar.org/handle/10568/81268) subCommunity
|
|
- Perhaps they need to ask Macaroni Bros about the mapping
|
|
- Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
67 207.46.13.50
|
|
105 41.204.190.40
|
|
117 34.218.226.147
|
|
126 35.237.175.180
|
|
203 213.55.99.121
|
|
332 45.5.184.72
|
|
377 5.9.6.51
|
|
512 45.5.184.2
|
|
4644 205.186.128.185
|
|
4644 70.32.83.92
|
|
```
|
|
|
|
- There seems to be a pattern with `70.32.83.92` and `205.186.128.185` lately!
|
|
- Every morning at 8AM they are the top users... I should tell them to stagger their requests...
|
|
- I signed up for a [VisualPing](https://visualping.io/) of the [PostgreSQL JDBC driver download page](https://jdbc.postgresql.org/download.html) to my CGIAR email address
|
|
- Hopefully this will one day alert me that a new driver is released!
|
|
- Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
310 45.5.184.2
|
|
425 5.143.231.39
|
|
526 54.70.40.11
|
|
1003 199.47.87.141
|
|
1374 35.237.175.180
|
|
1455 5.9.6.51
|
|
1501 66.249.66.223
|
|
1771 66.249.66.219
|
|
2107 199.47.87.140
|
|
2540 45.5.186.2
|
|
```
|
|
|
|
- Of course there is CIAT's `45.5.186.2`, but also `45.5.184.2` appears to be CIAT... I wonder why they have two harvesters?
|
|
- `199.47.87.140` and `199.47.87.141` is TurnItIn with the following user agent:
|
|
|
|
```
|
|
TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
|
|
```
|
|
|
|
## 2019-01-29
|
|
|
|
- Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
334 45.5.184.72
|
|
429 66.249.66.223
|
|
522 35.237.175.180
|
|
555 34.218.226.147
|
|
655 66.249.66.221
|
|
844 5.9.6.51
|
|
2507 66.249.66.219
|
|
4645 70.32.83.92
|
|
4646 205.186.128.185
|
|
9329 45.5.186.2
|
|
```
|
|
|
|
- `45.5.186.2` is CIAT as usual...
|
|
- `70.32.83.92` and `205.186.128.185` are CCAFS as usual...
|
|
- `66.249.66.219` is Google...
|
|
- I'm thinking it might finally be time to increase the threshold of the Linode CPU alerts
|
|
- I adjusted the alert threshold from 250% to 275%
|
|
|
|
## 2019-01-30
|
|
|
|
- Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
273 46.101.86.248
|
|
301 35.237.175.180
|
|
334 45.5.184.72
|
|
387 5.9.6.51
|
|
527 2a01:4f8:13b:1296::2
|
|
1021 34.218.226.147
|
|
1448 66.249.66.219
|
|
4649 205.186.128.185
|
|
4649 70.32.83.92
|
|
5163 45.5.184.2
|
|
```
|
|
|
|
- I might need to adjust the threshold again, because the load average this morning was 296% and the activity looks pretty normal (as always recently)
|
|
|
|
## 2019-01-31
|
|
|
|
- Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
436 18.196.196.108
|
|
460 157.55.39.168
|
|
460 207.46.13.96
|
|
500 197.156.105.116
|
|
728 54.70.40.11
|
|
1560 5.9.6.51
|
|
1562 35.237.175.180
|
|
1601 85.25.237.71
|
|
1894 66.249.66.219
|
|
2610 45.5.184.2
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "31/Jan/2019:0(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
318 207.46.13.242
|
|
334 45.5.184.72
|
|
486 35.237.175.180
|
|
609 34.218.226.147
|
|
620 66.249.66.219
|
|
1054 5.9.6.51
|
|
4391 70.32.83.92
|
|
4428 205.186.128.185
|
|
6758 85.25.237.71
|
|
9239 45.5.186.2
|
|
```
|
|
|
|
- `45.5.186.2` and `45.5.184.2` are CIAT as always
|
|
- `85.25.237.71` is some new server in Germany that I've never seen before with the user agent:
|
|
|
|
```
|
|
Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
|
|
```
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|