cgspace-notes/public/2018-01/index.html

<!DOCTYPE html>
<html lang="en">

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<meta property="og:title" content="January, 2018" />
<meta property="og:description" content="2018-01-02


Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary
The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 &#43;0000 when Uptime Robot got an HTTP 500
In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;
And just before that I see this:


Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].


Ah hah! So the pool was actually empty!
I need to increase that, let&rsquo;s try to bump it up from 50 to 75
After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw
I notice this error quite a few times in dspace.log:


2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.


And there are many of these errors every day for the past month:


$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34


Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains


" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-01/" />


<meta property="article:published_time" content="2018-01-02T08:35:54-08:00"/>

<meta property="article:modified_time" content="2018-01-11T10:42:35&#43;02:00"/>


<meta name="twitter:card" content="summary"/><meta name="twitter:title" content="January, 2018"/>
<meta name="twitter:description" content="2018-01-02


Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary
The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 &#43;0000 when Uptime Robot got an HTTP 500
In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;
And just before that I see this:


Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].


Ah hah! So the pool was actually empty!
I need to increase that, let&rsquo;s try to bump it up from 50 to 75
After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw
I notice this error quite a few times in dspace.log:


2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.


And there are many of these errors every day for the past month:


$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34


Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains


"/>
<meta name="generator" content="Hugo 0.32.4" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "January, 2018",
  "url": "https://alanorth.github.io/cgspace-notes/2018-01/",
  "wordCount": "2467",
  "datePublished": "2018-01-02T08:35:54-08:00",
  "dateModified": "2018-01-11T10:42:35&#43;02:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-01/">

    <title>January, 2018 | CGSpace Notes</title>

    <!-- combined, minified CSS -->
    <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-O8wjsnz02XiyrPxnhfF6AVOv6YLBaEGRCnVF&#43;DL3gCPBy9cieyHcpixIrVyD2JS5" crossorigin="anonymous">

    
  </head>

  <body>

    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
          
          
        </nav>
      </div>
    </div>

    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>

    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          
<article class="blog-post">
  <header>
    <h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-01/">January, 2018</a></h2>
    <p class="blog-post-meta"><time datetime="2018-01-02T08:35:54-08:00">Tue Jan 02, 2018</time> by Alan Orth in 

<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>

</p>
  </header>
  <h2 id="2018-01-02">2018-01-02</h2>

<ul>
<li>Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time</li>
<li>I didn&rsquo;t get any load alerts from Linode and the REST and XMLUI logs don&rsquo;t show anything out of the ordinary</li>
<li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li>
<li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
<li>And just before that I see this:</li>
</ul>

<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre>

<ul>
<li>Ah hah! So the pool was actually empty!</li>
<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>

<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
</code></pre>

<ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>

<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
</code></pre>

<ul>
<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains</li>
</ul>

<p></p>

<h2 id="2018-01-03">2018-01-03</h2>

<ul>
<li>I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM</li>
<li>Looks like I need to increase the database pool size again:</li>
</ul>

<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
</code></pre>

<ul>
<li>For some reason there were a lot of &ldquo;active&rdquo; connections last night:</li>
</ul>

<p><img src="/cgspace-notes/2018/01/postgres_connections-day.png" alt="CGSpace PostgreSQL connections" /></p>

<ul>
<li>The active IPs in XMLUI are:</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    607 40.77.167.141
    611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
    663 188.226.169.37
    759 157.55.39.245
    887 68.180.229.254
   1037 157.55.39.175
   1068 216.244.66.245
   1495 66.249.64.91
   1934 104.196.152.243
   2219 134.155.96.78
</code></pre>

<ul>
<li>134.155.96.78 appears to be at the University of Mannheim in Germany</li>
<li>They identify as: Mozilla/5.0 (compatible; heritrix/3.2.0 +<a href="http://ifm.uni-mannheim.de">http://ifm.uni-mannheim.de</a>)</li>
<li>This appears to be the <a href="https://github.com/internetarchive/heritrix3">Internet Archive&rsquo;s open source bot</a></li>
<li>They seem to be re-using their Tomcat session so I don&rsquo;t need to do anything to them just yet:</li>
</ul>

<pre><code>$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2
</code></pre>

<ul>
<li>The API logs show the normal users:</li>
</ul>

<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
     32 207.46.13.182
     38 40.77.167.132
     38 68.180.229.254
     43 66.249.64.91
     46 40.77.167.141
     49 157.55.39.245
     79 157.55.39.175
   1533 50.116.102.77
   4069 70.32.83.92
   9355 45.5.184.196
</code></pre>

<ul>
<li>In other related news I see a sizeable amount of requests coming from python-requests</li>
<li>For example, just in the last day there were 1700!</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
1773
</code></pre>

<ul>
<li>But they come from hundreds of IPs, many of which are 54.x.x.x:</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
      9 54.144.87.92
      9 54.146.222.143
      9 54.146.249.249
      9 54.158.139.206
      9 54.161.235.224
      9 54.163.41.19
      9 54.163.4.51
      9 54.196.195.107
      9 54.198.89.134
      9 54.80.158.113
     10 54.198.171.98
     10 54.224.53.185
     10 54.226.55.207
     10 54.227.8.195
     10 54.242.234.189
     10 54.242.238.209
     10 54.80.100.66
     11 54.161.243.121
     11 54.205.154.178
     11 54.234.225.84
     11 54.87.23.173
     11 54.90.206.30
     12 54.196.127.62
     12 54.224.242.208
     12 54.226.199.163
     13 54.162.149.249
     13 54.211.182.255
     19 50.17.61.150
     21 54.211.119.107
    139 164.39.7.62
</code></pre>

<ul>
<li>I have no idea what these are but they seem to be coming from Amazon&hellip;</li>
<li>I guess for now I just have to increase the database connection pool&rsquo;s max active</li>
<li>It&rsquo;s currently 75 and normally I&rsquo;d just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling</li>
</ul>

<h2 id="2018-01-04">2018-01-04</h2>

<ul>
<li>CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night</li>
<li>The XMLUI logs show this activity:</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;4/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    968 197.211.63.81
    981 213.55.99.121
   1039 66.249.64.93
   1258 157.55.39.175
   1273 207.46.13.182
   1311 157.55.39.191
   1319 157.55.39.197
   1775 66.249.64.78
   2216 104.196.152.243
   3366 66.249.64.91
</code></pre>

<ul>
<li>Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!</li>
</ul>

<pre><code>2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
</code></pre>

<ul>
<li>So for this week that is the number one problem!</li>
</ul>

<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559
</code></pre>

<ul>
<li>I will just bump the connection limit to 300 because I&rsquo;m fucking fed up with this shit</li>
<li>Once I get back to Amman I will have to try to create different database pools for different web applications, like recently discussed on the dspace-tech mailing list</li>
<li>Create accounts on CGSpace for two CTA staff km4ard@cta.int and bheenick@cta.int</li>
</ul>

<h2 id="2018-01-05">2018-01-05</h2>

<ul>
<li>Peter said that CGSpac was down last night and Tsega restarted Tomcat</li>
<li>I don&rsquo;t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:</li>
</ul>

<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559
dspace.log.2018-01-05:0
</code></pre>

<ul>
<li>Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space</li>
<li>I had a look and there is one Apache 2 log file that is 73GB, with lots of this:</li>
</ul>

<pre><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for &quot;9-16-1-RV.doc&quot; in &quot;/home/files/journals/6//articles/9/&quot;. Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
</code></pre>

<ul>
<li>I will delete the log file for now and tell Danny</li>
<li>Also, I&rsquo;m still seeing a hundred or so of the &ldquo;ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer&rdquo; errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is</li>
<li>I will run a full Discovery reindex in the mean time to see if it&rsquo;s something wrong with the Discovery Solr core</li>
</ul>

<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b

real    110m43.985s
user    15m24.960s
sys     3m14.890s
</code></pre>

<ul>
<li>Reboot CGSpace and DSpace Test for new kernels (4.14.12-x86_64-linode92) that partially mitigate the <a href="https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/">Spectre and Meltdown CPU vulnerabilities</a></li>
</ul>

<h2 id="2018-01-06">2018-01-06</h2>

<ul>
<li>I&rsquo;m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:</li>
</ul>

<pre><code>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
</code></pre>

<ul>
<li>I posted a message to the dspace-tech mailing list to see if anyone can help</li>
</ul>

<h2 id="2018-01-09">2018-01-09</h2>

<ul>
<li>Advise Sisay about blank lines in some IITA records</li>
<li>Generate a list of author affiliations for Peter to clean up:</li>
</ul>

<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4515
</code></pre>

<h2 id="2018-01-10">2018-01-10</h2>

<ul>
<li>I looked to see what happened to this year&rsquo;s Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:</li>
</ul>

<pre><code>Moving: 81742 into core statistics-2010
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.apache.http.client.ClientProtocolException
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
        ... 10 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity.  The cause lists the reason the original request failed.
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        ... 14 more
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:159)
        at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:179)
        at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
        at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
        at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
        at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
        at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
        at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
        at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
        at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        ... 16 more
</code></pre>

<ul>
<li>DSpace Test has the same error but with creating the 2017 core:</li>
</ul>

<pre><code>Moving: 2243021 into core statistics-2017
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.apache.http.client.ClientProtocolException
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
        ... 10 more
</code></pre>

<ul>
<li>There is interesting documentation about this on the DSpace Wiki: <a href="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear">https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear</a></li>
<li>I&rsquo;m looking to see maybe if we&rsquo;re hitting the issues mentioned in <a href="https://jira.duraspace.org/browse/DS-2212">DS-2212</a> that were apparently fixed in DSpace 5.2</li>
<li>I can apparently search for records in the Solr stats core that have an empty <code>owningColl</code> field using this in the Solr admin query: <code>-owningColl:*</code></li>
<li>On CGSpace I see 48,000,000 records that have an <code>owningColl</code> field and 34,000,000 that don&rsquo;t:</li>
</ul>

<pre><code>$ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound 
  &quot;response&quot;:{&quot;numFound&quot;:48476327,&quot;start&quot;:0,&quot;docs&quot;:[
$ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
  &quot;response&quot;:{&quot;numFound&quot;:34879872,&quot;start&quot;:0,&quot;docs&quot;:[
</code></pre>

<ul>
<li>I tested the <code>dspace stats-util -s</code> process on my local machine and it failed the same way</li>
<li>It doesn&rsquo;t seem to be helpful, but the dspace log shows this:</li>
</ul>

<pre><code>2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
</code></pre>

<ul>
<li>Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: <a href="https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues">https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues</a></li>
<li>Uptime Robot said that CGSpace went down at around 9:43 AM</li>
<li>I looked at PostgreSQL&rsquo;s <code>pg_stat_activity</code> table and saw 161 active connections, but no pool errors in the DSpace logs:</li>
</ul>

<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-10 
0
</code></pre>

<ul>
<li>The XMLUI logs show quite a bit of activity today:</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    951 207.46.13.159
    954 157.55.39.123
   1217 95.108.181.88
   1503 104.196.152.243
   6455 70.36.107.50
  11412 70.36.107.190
  16730 70.36.107.49
  17386 2607:fa98:40:9:26b6:fdff:feff:1c96
  21566 2607:fa98:40:9:26b6:fdff:feff:195d
  45384 2607:fa98:40:9:26b6:fdff:feff:1888
</code></pre>

<ul>
<li>The user agent for the top six or so IPs are all the same:</li>
</ul>

<pre><code>&quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot;
</code></pre>

<ul>
<li><code>whois</code> says they come from <a href="http://www.perfectip.net/">Perfect IP</a></li>
<li>I&rsquo;ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:</li>
</ul>

<pre><code>$ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
49096
</code></pre>

<ul>
<li>Rather than blocking their IPs, I think I might just add their user agent to the &ldquo;badbots&rdquo; zone with Baidu, because they seem to be the only ones using that user agent:</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
/537.36&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
   6796 70.36.107.50
  11870 70.36.107.190
  17323 70.36.107.49
  19204 2607:fa98:40:9:26b6:fdff:feff:1c96
  23401 2607:fa98:40:9:26b6:fdff:feff:195d 
  47875 2607:fa98:40:9:26b6:fdff:feff:1888
</code></pre>

<ul>
<li>I added the user agent to nginx&rsquo;s badbots limit req zone but upon testing the config I got an error:</li>
</ul>

<pre><code># nginx -t
nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
nginx: configuration file /etc/nginx/nginx.conf test failed
</code></pre>

<ul>
<li>According to nginx docs the <a href="https://nginx.org/en/docs/hash.html">bucket size should be a multiple of the CPU&rsquo;s cache alignment</a>, which is 64 for us:</li>
</ul>

<pre><code># cat /proc/cpuinfo | grep cache_alignment | head -n1
cache_alignment : 64
</code></pre>

<ul>
<li>On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx</li>
<li>Almost immediately the PostgreSQL connections dropped back down to 40 or so, and UptimeRobot said the site was back up</li>
<li>So that&rsquo;s interesting that we&rsquo;re not out of PostgreSQL connections (current pool maxActive is 300!) but the system is &ldquo;down&rdquo; to UptimeRobot and very slow to use</li>
<li>Linode continues to test mitigations for Meltdown and Spectre: <a href="https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/">https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/</a></li>
<li>I rebooted DSpace Test to see if the kernel will be updated (currently Linux 4.14.12-x86_64-linode92)&hellip; nope.</li>
<li>It looks like Linode will reboot the KVM hosts later this week, though</li>
<li>Udana from WLE asked if we could give him permission to upload CSVs to CGSpace (which would require super admin access)</li>
<li>Citing concerns with metadata quality, I suggested adding him on DSpace Test first</li>
<li>I opened a ticket with Atmire to ask them about DSpace 5.8 compatibility: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560</a></li>
</ul>

<h2 id="2018-01-11">2018-01-11</h2>

<ul>
<li>The PostgreSQL and firewall graphs from this week show clearly the load from the new bot from PerfectIP.net yesterday:</li>
</ul>

<p><img src="/cgspace-notes/2018/01/postgres_connections-day-perfectip.png" alt="PostgreSQL load" />
<img src="/cgspace-notes/2018/01/firewall-perfectip.png" alt="Firewall load" /></p>

<ul>
<li>Linode rebooted DSpace Test and CGSpace for their host hypervisor kernel updates</li>
<li>Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat <code>localhost_access_log</code> at the time of my sharding attempt on my test machine:</li>
</ul>

<pre><code>127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-18YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 447
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 76
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&quot; 200 2137630
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 16253
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 409 156
</code></pre>

<ul>
<li>The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error</li>
<li>This is apparently a common Solr error code that means &ldquo;version conflict&rdquo;: <a href="http://yonik.com/solr/optimistic-concurrency/">http://yonik.com/solr/optimistic-concurrency/</a></li>
<li>Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:</li>
</ul>

<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot; | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
  21572 70.36.107.50
  30722 70.36.107.190
  34566 70.36.107.49
 101829 2607:fa98:40:9:26b6:fdff:feff:195d
 111535 2607:fa98:40:9:26b6:fdff:feff:1c96
 161797 2607:fa98:40:9:26b6:fdff:feff:1888
</code></pre>

<ul>
<li>Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat&rsquo;s <code>server.xml</code>:</li>
</ul>

<pre><code>&lt;Resource name=&quot;jdbc/dspaceWeb&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
          driverClassName=&quot;org.postgresql.Driver&quot;
          url=&quot;jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb&quot;
          username=&quot;dspace&quot;
          password=&quot;dspace&quot;
          initialSize='5'
          maxActive='75'
          maxIdle='15'
          minIdle='5'
          maxWait='5000'
          validationQuery='SELECT 1'
          testOnBorrow='true' /&gt;
</code></pre>

<ul>
<li>So theoretically I could name each connection &ldquo;xmlui&rdquo; or &ldquo;dspaceWeb&rdquo; or something meaningful and it would show up in PostgreSQL&rsquo;s <code>pg_stat_activity</code> table!</li>
<li>This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)</li>
<li>Also, I realized that the <code>db.jndi</code> parameter in dspace.cfg needs to match the <code>name</code> value in your applicaiton&rsquo;s context—not the <code>global</code> one</li>
<li>Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:</li>
</ul>

<pre><code>db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
</code></pre>

<ul>
<li>With that it is super easy to see where PostgreSQL connections are coming from in <code>pg_stat_activity</code></li>
</ul>

  
</article> 


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">
  

        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2018-01/">January, 2018</a></li>

<li><a href="/cgspace-notes/2017-12/">December, 2017</a></li>

<li><a href="/cgspace-notes/2017-11/">November, 2017</a></li>

<li><a href="/cgspace-notes/2017-10/">October, 2017</a></li>

<li><a href="/cgspace-notes/cgiar-library-migration/">CGIAR Library Migration</a></li>

    </ol>
  </section>

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->

    <footer class="blog-footer">
      <p>
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>

  </body>

</html>