From 2589683597eaa463c96fc488d4de549b838dc6fa Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Mon, 29 Jan 2018 09:46:48 +0200 Subject: [PATCH] Add notes for 2018-01-29 --- content/post/2018-01.md | 65 +++++++++++++++++++++++++++++++ public/2018-01/index.html | 80 +++++++++++++++++++++++++++++++++++++-- public/sitemap.xml | 10 ++--- 3 files changed, 147 insertions(+), 8 deletions(-) diff --git a/content/post/2018-01.md b/content/post/2018-01.md index 5f09d5373..550984728 100644 --- a/content/post/2018-01.md +++ b/content/post/2018-01.md @@ -1174,3 +1174,68 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j - It reduces the start up time of Catalina by using multiple threads to start web applications in parallel - On my local test machine the startup time went from 70 to 30 seconds - See: https://tomcat.apache.org/tomcat-7.0-doc/config/host.html + +## 2018-01-29 + +- CGSpace went down this morning for a few minutes, according to UptimeRobot +- Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down: + +``` +2018-01-29 05:30:22,226 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890 +2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32. +Was expecting one of: + "TO" ... + ... + ... + +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32. +Was expecting one of: + "TO" ... + ... + ... +``` + +- So is this an error caused by this particular client (which happens to be `Yahoo! Slurp`)? +- I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early +- Perhaps this from the nginx error log is relevant? + +``` +2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org" +``` + +- I think that must be unrelated, probably the client closed the request to nginx because DSpace (Tomcat) was taking too long +- An interesting [snippet to get the maximum and average nginx responses](https://gist.github.com/magnetikonline/11312172): + +``` +# awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log +Maximum: 2771268 +Average: 210483 +``` + +- I guess responses that don't fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated +- My best guess is that the Solr search error is related somehow but I can't figure it out +- We definitely have enough database connections, as I haven't seen a pool error in weeks: + +``` +$ grep -c "Timeout: Pool empty." dspace.log.2018-01-2* +dspace.log.2018-01-20:0 +dspace.log.2018-01-21:0 +dspace.log.2018-01-22:0 +dspace.log.2018-01-23:0 +dspace.log.2018-01-24:0 +dspace.log.2018-01-25:0 +dspace.log.2018-01-26:0 +dspace.log.2018-01-27:0 +dspace.log.2018-01-28:0 +dspace.log.2018-01-29:0 +``` + +- Adam Hunt from WLE complained that pages take "1-2 minutes" to load each, from France and Sri Lanka +- I asked him which particular pages, as right now pages load in 2 or 3 seconds for me +- UptimeRobot said CGSpace went down again, and I looked at PostgreSQL and saw 211 active database connections +- If it's not memory and it's not database, it's gotta be Tomcat threads...? +- I decided to change the Tomcat thread settings on CGSpace: + - `maxThreads` from 200 (default) to 400 + - `processorCache` from 200 (default) to 400, [recommended to be the same as `maxThreads`](https://tomcat.apache.org/tomcat-7.0-doc/config/http.html) + - `minSpareThreads` from 10 (default) to 20 + - `acceptorThreadCount` from 1 (default) to 2, [recommended to be 2 on multi-CPU systems](https://tomcat.apache.org/tomcat-7.0-doc/config/http.html) diff --git a/public/2018-01/index.html b/public/2018-01/index.html index 337e1a8bb..c2ccd00ba 100644 --- a/public/2018-01/index.html +++ b/public/2018-01/index.html @@ -92,7 +92,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -194,9 +194,9 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv "@type": "BlogPosting", "headline": "January, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-01/", - "wordCount": "6793", + "wordCount": "7218", "datePublished": "2018-01-02T08:35:54-08:00", - "dateModified": "2018-01-28T12:49:00+02:00", + "dateModified": "2018-01-28T18:34:24+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -1560,6 +1560,80 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
  • See: https://tomcat.apache.org/tomcat-7.0-doc/config/host.html
  • +

    2018-01-29

    + +
      +
    • CGSpace went down this morning for a few minutes, according to UptimeRobot
    • +
    • Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:
    • +
    + +
    2018-01-29 05:30:22,226 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
    +2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
    +Was expecting one of:
    +    "TO" ...
    +    <RANGE_QUOTED> ...
    +    <RANGE_GOOP> ...
    +    
    +org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
    +Was expecting one of:
    +    "TO" ...
    +    <RANGE_QUOTED> ...
    +    <RANGE_GOOP> ...
    +
    + +
      +
    • So is this an error caused by this particular client (which happens to be Yahoo! Slurp)?
    • +
    • I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early
    • +
    • Perhaps this from the nginx error log is relevant?
    • +
    + +
    2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"
    +
    + + + +
    # awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
    +Maximum: 2771268
    +Average: 210483
    +
    + +
      +
    • I guess responses that don’t fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated
    • +
    • My best guess is that the Solr search error is related somehow but I can’t figure it out
    • +
    • We definitely have enough database connections, as I haven’t seen a pool error in weeks:
    • +
    + +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
    +dspace.log.2018-01-20:0
    +dspace.log.2018-01-21:0
    +dspace.log.2018-01-22:0
    +dspace.log.2018-01-23:0
    +dspace.log.2018-01-24:0
    +dspace.log.2018-01-25:0
    +dspace.log.2018-01-26:0
    +dspace.log.2018-01-27:0
    +dspace.log.2018-01-28:0
    +dspace.log.2018-01-29:0
    +
    + +
      +
    • Adam Hunt from WLE complained that pages take “1-2 minutes” to load each, from France and Sri Lanka
    • +
    • I asked him which particular pages, as right now pages load in 2 or 3 seconds for me
    • +
    • UptimeRobot said CGSpace went down again, and I looked at PostgreSQL and saw 211 active database connections
    • +
    • If it’s not memory and it’s not database, it’s gotta be Tomcat threads…?
    • +
    • I decided to change the Tomcat thread settings on CGSpace: + +
    • +
    + diff --git a/public/sitemap.xml b/public/sitemap.xml index f3ee06fd4..482652030 100644 --- a/public/sitemap.xml +++ b/public/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-01/ - 2018-01-28T12:49:00+02:00 + 2018-01-28T18:34:24+02:00 @@ -144,7 +144,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-01-28T12:49:00+02:00 + 2018-01-28T18:34:24+02:00 0 @@ -155,7 +155,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-01-28T12:49:00+02:00 + 2018-01-28T18:34:24+02:00 0 @@ -167,13 +167,13 @@ https://alanorth.github.io/cgspace-notes/post/ - 2018-01-28T12:49:00+02:00 + 2018-01-28T18:34:24+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-01-28T12:49:00+02:00 + 2018-01-28T18:34:24+02:00 0