cgspace-notes/2018-01.md at 09f8bab513166ae4eae50513c04d752d643ec897

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-25 16:08:19 +01:00

Alan Orth d3378fd7f3

Add notes

2018-03-28 09:48:08 +03:00

78 KiB

Raw Blame History

title

date

author

2018-01-02

Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
In dspace.log around that time I see many errors like "Client closed the connection before file download was complete"
And just before that I see this:

Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].

Ah hah! So the pool was actually empty!
I need to increase that, let's try to bump it up from 50 to 75
After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
I notice this error quite a few times in dspace.log:

2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.

And there are many of these errors every day for the past month:

$ grep -c "Error while searching for sidebar facets" dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34

Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains

2018-01-03

I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM
Looks like I need to increase the database pool size again:

$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909

For some reason there were a lot of "active" connections last night:

The active IPs in XMLUI are:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    607 40.77.167.141
    611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
    663 188.226.169.37
    759 157.55.39.245
    887 68.180.229.254
   1037 157.55.39.175
   1068 216.244.66.245
   1495 66.249.64.91
   1934 104.196.152.243
   2219 134.155.96.78

134.155.96.78 appears to be at the University of Mannheim in Germany
They identify as: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://ifm.uni-mannheim.de)
This appears to be the Internet Archive's open source bot
They seem to be re-using their Tomcat session so I don't need to do anything to them just yet:

$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2

The API logs show the normal users:

# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
     32 207.46.13.182
     38 40.77.167.132
     38 68.180.229.254
     43 66.249.64.91
     46 40.77.167.141
     49 157.55.39.245
     79 157.55.39.175
   1533 50.116.102.77
   4069 70.32.83.92
   9355 45.5.184.196

In other related news I see a sizeable amount of requests coming from python-requests
For example, just in the last day there were 1700!

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
1773

But they come from hundreds of IPs, many of which are 54.x.x.x:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
      9 54.144.87.92
      9 54.146.222.143
      9 54.146.249.249
      9 54.158.139.206
      9 54.161.235.224
      9 54.163.41.19
      9 54.163.4.51
      9 54.196.195.107
      9 54.198.89.134
      9 54.80.158.113
     10 54.198.171.98
     10 54.224.53.185
     10 54.226.55.207
     10 54.227.8.195
     10 54.242.234.189
     10 54.242.238.209
     10 54.80.100.66
     11 54.161.243.121
     11 54.205.154.178
     11 54.234.225.84
     11 54.87.23.173
     11 54.90.206.30
     12 54.196.127.62
     12 54.224.242.208
     12 54.226.199.163
     13 54.162.149.249
     13 54.211.182.255
     19 50.17.61.150
     21 54.211.119.107
    139 164.39.7.62

I have no idea what these are but they seem to be coming from Amazon...
I guess for now I just have to increase the database connection pool's max active
It's currently 75 and normally I'd just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling

2018-01-04

CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night
The XMLUI logs show this activity:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    968 197.211.63.81
    981 213.55.99.121
   1039 66.249.64.93
   1258 157.55.39.175
   1273 207.46.13.182
   1311 157.55.39.191
   1319 157.55.39.197
   1775 66.249.64.78
   2216 104.196.152.243
   3366 66.249.64.91

Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!

2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].

So for this week that is the number one problem!

$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559

I will just bump the connection limit to 300 because I'm fucking fed up with this shit
Once I get back to Amman I will have to try to create different database pools for different web applications, like recently discussed on the dspace-tech mailing list
Create accounts on CGSpace for two CTA staff km4ard@cta.int and bheenick@cta.int

2018-01-05

Peter said that CGSpace was down last night and Tsega restarted Tomcat
I don't see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:

$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559
dspace.log.2018-01-05:0

Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space
I had a look and there is one Apache 2 log file that is 73GB, with lots of this:

[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade

I will delete the log file for now and tell Danny
Also, I'm still seeing a hundred or so of the "ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer" errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is
I will run a full Discovery reindex in the mean time to see if it's something wrong with the Discovery Solr core

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b

real    110m43.985s
user    15m24.960s
sys     3m14.890s

Reboot CGSpace and DSpace Test for new kernels (4.14.12-x86_64-linode92) that partially mitigate the Spectre and Meltdown CPU vulnerabilities

2018-01-06

I'm still seeing Solr errors in the DSpace logs even after the full reindex yesterday:

org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.

I posted a message to the dspace-tech mailing list to see if anyone can help

2018-01-09

Advise Sisay about blank lines in some IITA records
Generate a list of author affiliations for Peter to clean up:

dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4515

2018-01-10

I looked to see what happened to this year's Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:

Moving: 81742 into core statistics-2010
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.apache.http.client.ClientProtocolException
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
        ... 10 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity.  The cause lists the reason the original request failed.
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        ... 14 more
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:159)
        at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:179)
        at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
        at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
        at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
        at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
        at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
        at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
        at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
        at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        ... 16 more

DSpace Test has the same error but with creating the 2017 core:

Moving: 2243021 into core statistics-2017
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2243)
        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.apache.http.client.ClientProtocolException
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
        ... 10 more

There is interesting documentation about this on the DSpace Wiki: https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear
I'm looking to see maybe if we're hitting the issues mentioned in DS-2212 that were apparently fixed in DSpace 5.2
I can apparently search for records in the Solr stats core that have an empty owningColl field using this in the Solr admin query: -owningColl:*
On CGSpace I see 48,000,000 records that have an owningColl field and 34,000,000 that don't:

$ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
  "response":{"numFound":48476327,"start":0,"docs":[
$ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=json&indent=true' | grep numFound
  "response":{"numFound":34879872,"start":0,"docs":[

I tested the dspace stats-util -s process on my local machine and it failed the same way
It doesn't seem to be helpful, but the dspace log shows this:

2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016

Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues
Uptime Robot said that CGSpace went down at around 9:43 AM
I looked at PostgreSQL's pg_stat_activity table and saw 161 active connections, but no pool errors in the DSpace logs:

$ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
0

The XMLUI logs show quite a bit of activity today:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    951 207.46.13.159
    954 157.55.39.123
   1217 95.108.181.88
   1503 104.196.152.243
   6455 70.36.107.50
  11412 70.36.107.190
  16730 70.36.107.49
  17386 2607:fa98:40:9:26b6:fdff:feff:1c96
  21566 2607:fa98:40:9:26b6:fdff:feff:195d
  45384 2607:fa98:40:9:26b6:fdff:feff:1888

The user agent for the top six or so IPs are all the same:

"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"

whois says they come from Perfect IP
I've never seen those top IPs before, but they have created 50,000 Tomcat sessions today:

$ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
49096

Rather than blocking their IPs, I think I might just add their user agent to the "badbots" zone with Baidu, because they seem to be the only ones using that user agent:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
/537.36" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
   6796 70.36.107.50
  11870 70.36.107.190
  17323 70.36.107.49
  19204 2607:fa98:40:9:26b6:fdff:feff:1c96
  23401 2607:fa98:40:9:26b6:fdff:feff:195d 
  47875 2607:fa98:40:9:26b6:fdff:feff:1888

I added the user agent to nginx's badbots limit req zone but upon testing the config I got an error:

# nginx -t
nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
nginx: configuration file /etc/nginx/nginx.conf test failed

According to nginx docs the bucket size should be a multiple of the CPU's cache alignment, which is 64 for us:

# cat /proc/cpuinfo | grep cache_alignment | head -n1
cache_alignment : 64

On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx
Almost immediately the PostgreSQL connections dropped back down to 40 or so, and UptimeRobot said the site was back up
So that's interesting that we're not out of PostgreSQL connections (current pool maxActive is 300!) but the system is "down" to UptimeRobot and very slow to use
Linode continues to test mitigations for Meltdown and Spectre: https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/
I rebooted DSpace Test to see if the kernel will be updated (currently Linux 4.14.12-x86_64-linode92)... nope.
It looks like Linode will reboot the KVM hosts later this week, though
Udana from WLE asked if we could give him permission to upload CSVs to CGSpace (which would require super admin access)
Citing concerns with metadata quality, I suggested adding him on DSpace Test first
I opened a ticket with Atmire to ask them about DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560

2018-01-11

The PostgreSQL and firewall graphs from this week show clearly the load from the new bot from PerfectIP.net yesterday:

Linode rebooted DSpace Test and CGSpace for their host hypervisor kernel updates
Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat localhost_access_log at the time of my sharding attempt on my test machine:

127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-18YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 447
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 76
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 2137630
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16253
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156

The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error
This is apparently a common Solr error code that means "version conflict": http://yonik.com/solr/optimistic-concurrency/
Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
  21572 70.36.107.50
  30722 70.36.107.190
  34566 70.36.107.49
 101829 2607:fa98:40:9:26b6:fdff:feff:195d
 111535 2607:fa98:40:9:26b6:fdff:feff:1c96
 161797 2607:fa98:40:9:26b6:fdff:feff:1888

Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat's server.xml:

<Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
          driverClassName="org.postgresql.Driver"
          url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb"
          username="dspace"
          password="dspace"
          initialSize='5'
          maxActive='75'
          maxIdle='15'
          minIdle='5'
          maxWait='5000'
          validationQuery='SELECT 1'
          testOnBorrow='true' />

So theoretically I could name each connection "xmlui" or "dspaceWeb" or something meaningful and it would show up in PostgreSQL's pg_stat_activity table!
This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)
Also, I realized that the db.jndi parameter in dspace.cfg needs to match the name value in your applicaiton's context—not the global one
Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:

db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault

With that it is super easy to see where PostgreSQL connections are coming from in pg_stat_activity

2018-01-12

I'm looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:

<!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
<Connector port="8080"
           maxThreads="150"
           minSpareThreads="25"
           maxSpareThreads="75"
           enableLookups="false"
           redirectPort="8443"
           acceptCount="100"
           connectionTimeout="20000"
           disableUploadTimeout="true"
           URIEncoding="UTF-8"/>

In Tomcat 8.5 the maxThreads defaults to 200 which is probably fine, but tweaking minSpareThreads could be good
I don't see a setting for maxSpareThreads in the docs so that might be an error
Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don't need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html
Ooh, I just saw the acceptorThreadCount setting (in Tomcat 7 and 8.5):

The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.

That could be very interesting

2018-01-13

Still testing DSpace 6.2 on Tomcat 8.5.24
Catalina errors at Tomcat 8.5 startup:

13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.

I looked in my Tomcat 7.0.82 logs and I don't see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing
DBCP2 appears to be Tomcat 8.0.x and up according to the Tomcat 8.0 migration guide
I have updated our Ansible infrastructure scripts so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)
When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:

13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
 java.lang.ExceptionInInitializerError
        at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
        at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:547)
        at org.dspace.core.Context.<clinit>(Context.java:103)
        ... 15 more

Interesting blog post benchmarking Tomcat JDBC vs Apache Commons DBCP2, with configuration snippets: http://www.tugay.biz/2016/07/tomcat-connection-pool-vs-apache.html
The Tomcat vs Apache pool thing is confusing, but apparently we're using Apache Commons DBCP2 because we don't specify factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" in our global resource
So at least I know that I'm not looking for documentation or troubleshooting on the Tomcat JDBC pool!
I looked at pg_stat_activity during Tomcat's startup and I see that the pool created in server.xml is indeed connecting, just that nothing uses it
Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used
Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434
I'll comment on that issue

2018-01-14

Looking at the authors Peter had corrected
Some had multiple and he's corrected them by adding || in the correction column, but I can't process those this way so I will just have to flag them and do those manually later
Also, I can flag the values that have "DELETE"
Then I need to facet the correction column on isBlank(value) and not flagged

2018-01-15

Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload
I'm going to apply these ~130 corrections on CGSpace:

update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';

Continue proofing Peter's author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names

Apply corrections using fix-metadata-values.py:

$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'

In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:

dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
 metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
           2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
(1 row)

dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
 handle
--------
(0 rows)

Even searching in the DSpace advanced search for author equals "Tarawali" produces nothing...
Otherwise, the DSpace 5 SQL Helper Functions provide ds5_item2itemhandle(), which is much easier than my long query above that I always have to go search for
For example, to find the Handle for an item that has the author "Erni":

dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
 metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
           2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
(1 row)
dspace=# select ds5_item2itemhandle(70308);
 ds5_item2itemhandle 
---------------------
 10568/68609
(1 row)

Next I apply the author deletions:

$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'

Now working on the affiliation corrections from Peter:

$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'

Now I made a new list of affiliations for Peter to look through:

dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4552

Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)
For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930
Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture
So some submitters don't know to use the controlled vocabulary lookup
Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder
CGSpace users were having problems logging in, I think something's wrong with LDAP because I see this in the logs:

2018-01-15 12:53:15,810 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]

Looks like we processed 2.9 million requests on CGSpace in 2017-12:

# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Dec/2017"
2890041

real    0m25.756s
user    0m28.016s
sys     0m2.210s

2018-01-16

Meeting with CGSpace team, a few action items:
- Discuss standardized names for CRPs and centers with ICARDA (don't wait for CG Core)
- Re-send DC rights implementation and forward to everyone so we can move forward with it (without the URI field for now)
- Start looking at where I was with the AGROVOC API
- Have a controlled vocabulary for CGIAR authors' names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)
- Need to find the metadata field name that ICARDA is using for their ORCIDs
- Update text for DSpace version plan on wiki
- Come up with an SLA, something like: In return for your contribution we will, to the best of our ability, ensure 99.5% ("two and a half nines") uptime of CGSpace, ensure data is stored in open formats and safely backed up, follow CG Core metadata standards, ...
- Add Sisay and Danny to Uptime Robot and allow them to restart Tomcat on CGSpace ✔
I removed Tsega's SSH access to the web and DSpace servers, and asked Danny to check whether there is anything he needs from Tsega's home directories so we can delete the accounts completely
I removed Tsega's access to Linode dashboard as well
I ended up creating a Jira issue for my db.jndi documentation fix: DS-3803
The DSpace developers said they wanted each pull request to be associated with a Jira issue

2018-01-17

Abenet asked me to proof and upload 54 records for LIVES
A few records were missing countries (even though they're all from Ethiopia)
Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses
In any case, importing them like this:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &> lives.log

And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload
When I looked there were 210 PostgreSQL connections!
I don't see any high load in XMLUI or REST/OAI:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    381 40.77.167.124
    403 213.55.99.121
    431 207.46.13.60
    445 157.55.39.113
    445 157.55.39.231
    449 95.108.181.88
    453 68.180.229.254
    593 54.91.48.104
    757 104.196.152.243
    776 66.249.66.90
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
     11 205.201.132.14
     11 40.77.167.124
     15 35.226.23.240
     16 157.55.39.231
     16 66.249.64.155
     18 66.249.66.90
     22 95.108.181.88
     58 104.196.152.243
   4106 70.32.83.92
   9229 45.5.184.196

But I do see this strange message in the dspace log:

2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8081: The target server failed to respond
2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081

I have NEVER seen this error before, and there is no error before or after that in DSpace's solr.log
Tomcat's catalina.out does show something interesting, though, right at that time:

[====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
[====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
[====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
[====================>                              ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576)
        at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222)
        at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067)
        at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1557)
        at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
        at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:514)
        at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:485)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.dspace.solr.filters.LocalHostRestrictionFilter.doFilter(LocalHostRestrictionFilter.java:50)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318) 
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

You can see the timestamp above, which is some Atmire nightly task I think, but I can't figure out which one
So I restarted Tomcat and tried the import again, which finished very quickly and without errors!

$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log

Looking at the JVM graphs from Munin it does look like the heap ran out of memory (see the blue dip just before the green spike when I restarted Tomcat):

I'm playing with maven repository caching using Artifactory in a Docker instance: https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker

$ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
$ docker volume create --name artifactory5_data
$ docker network create dspace-build
$ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss:latest

Then configure the local maven to use it in settings.xml with the settings from "Set Me Up": https://www.jfrog.com/confluence/display/RTF/Using+Artifactory
This could be a game changer for testing and running the Docker DSpace image
Wow, I even managed to add the Atmire repository as a remote and map it into the libs-release virtual repository, then tell maven to use it for atmire.com-releases in settings.xml!
Hmm, some maven dependencies for the SWORDv2 web application in DSpace 5.5 are broken:

[ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -> org.apache.abdera:abdera-client:jar:1.1.1 -> org.apache.abdera:abdera-core:jar:1.1.1 -> org.apache.abdera:abdera-i18n:jar:1.1.1 -> org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -> [Help 1]

I never noticed because I build with that web application disabled:

$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package

UptimeRobot said CGSpace went down for a few minutes
I didn't do anything but it came back up on its own
I don't see anything unusual in the XMLUI or REST/OAI logs
Now Linode alert says the CPU load is high, sigh
Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I'm not sure how far these logs go back, as they are not strictly daily):

# zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
/var/log/tomcat7/catalina.out:2
/var/log/tomcat7/catalina.out.10.gz:7
/var/log/tomcat7/catalina.out.11.gz:1
/var/log/tomcat7/catalina.out.12.gz:2
/var/log/tomcat7/catalina.out.15.gz:1
/var/log/tomcat7/catalina.out.17.gz:2
/var/log/tomcat7/catalina.out.18.gz:3
/var/log/tomcat7/catalina.out.20.gz:1
/var/log/tomcat7/catalina.out.21.gz:4
/var/log/tomcat7/catalina.out.25.gz:1
/var/log/tomcat7/catalina.out.28.gz:1
/var/log/tomcat7/catalina.out.2.gz:6
/var/log/tomcat7/catalina.out.30.gz:2
/var/log/tomcat7/catalina.out.31.gz:1
/var/log/tomcat7/catalina.out.34.gz:1
/var/log/tomcat7/catalina.out.38.gz:1
/var/log/tomcat7/catalina.out.39.gz:1
/var/log/tomcat7/catalina.out.4.gz:3
/var/log/tomcat7/catalina.out.6.gz:2
/var/log/tomcat7/catalina.out.7.gz:14

Overall the heap space usage in the munin graph seems ok, though I usually increase it by 512MB over the average a few times per year as usage grows
But maybe I should increase it by more, like 1024MB, to give a bit more head room

2018-01-18

UptimeRobot said CGSpace was down for 1 minute last night
I don't see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499
I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b

Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the Bioversity Journal Articles collection
It's easy enough to do in OpenRefine, but you have to be careful to only get those items that are uploaded into Bioversity's collection, not the ones that are mapped from others!
Use this GREL in OpenRefine after isolating all the Limited Access items: value.startsWith("10568/35501")
UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me or each other!

Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
Jan 18 07:01:22 linode18 sudo[10805]: pam_unix(sudo:session): session opened for user root by dhmichael(uid=0)
Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat....
Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for user root by swebshet(uid=0)

I had to cancel the Discovery indexing and I'll have to re-try it another time when the server isn't so busy (it had already taken two hours and wasn't even close to being done)
For now I've increased the Tomcat JVM heap from 5632 to 6144m, to give ~1GB of free memory over the average usage to hopefully account for spikes caused by load or background jobs

2018-01-19

Linode alerted and said that the CPU load was 264.1% on CGSpace
Start the Discovery indexing again:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b

Linode alerted again and said that CGSpace was using 301% CPU
Peter emailed to ask why this item doesn't have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard
Looks like our badge code calls the handle endpoint which doesn't exist:

https://api.altmetric.com/v1/handle/10568/88090

I told Peter we should keep an eye out and try again next week

2018-01-20

Run the authority indexing script on CGSpace and of course it died:

$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority 
Retrieving all data 
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer 
Exception: null
java.lang.NullPointerException
        at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
        at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
        at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
        at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
 
real    7m2.241s
user    1m33.198s
sys     0m12.317s

I tested the abstract cleanups on Bioversity's Journal Articles collection again that I had started a few days ago
In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts
I want to document the workflow of adding a production PostgreSQL database to a development instance of DSpace in Docker:

$ docker exec dspace_db dropdb -U postgres dspace
$ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
$ docker cp test.dump dspace_db:/tmp/test.dump
$ docker exec dspace_db pg_restore -U postgres -d dspace /tmp/test.dump
$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace nocreateuser;'
$ docker exec dspace_db vacuumdb -U postgres dspace
$ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db:/tmp
$ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace

2018-01-22

Look over Udana's CSV of 25 WLE records from last week
I sent him some corrections:
- The file encoding is Windows-1252
- There were whitespace issues in the dc.identifier.citation field (spaces at the beginning and end, and multiple spaces in between some words)
- Also, the authors listed in the citation need to be in normal format, separated by commas or colons (however you prefer), not with ||
- There were spaces in the beginning and end of some cg.identifier.doi fields
- Make sure that the cg.coverage.countries field is just countries: ie, no "SOUTH ETHIOPIA" or "EAST AFRICA" (the first should just be ETHIOPIA, the second should be in cg.coverage.region instead)
- The current list of regions we use is here: https://github.com/ilri/DSpace/blob/5_x-prod/dspace/config/input-forms.xml#L5162
- You have a syntax error in your cg.coverage.regions (extra ||)
- The value of dc.identifier.issn should just be the ISSN but you have: eISSN: 1479-487X
I wrote a quick Python script to use the DSpace REST API to find all collections under a given community
The source code is here: rest-find-collections.py
Peter had said that found a bunch of ILRI collections that were called "untitled", but I don't see any:

$ ./rest-find-collections.py 10568/1 | wc -l
308
$ ./rest-find-collections.py 10568/1 | grep -i untitled

Looking at the Tomcat connector docs I think we really need to increase maxThreads
The default is 200, which can easily be taken up by bots considering that Google and Bing each browse with fifty (50) connections each sometimes!
Before I increase this I want to see if I can measure and graph this, and then benchmark
I'll probably also increase minSpareThreads to 20 (its default is 10)
I still want to bump up acceptorThreadCount from 1 to 2 as well, as the documentation says this should be increased on multi-core systems
I spent quite a bit of time looking at jvisualvm and jconsole today
Run system updates on DSpace Test and reboot it
I see I can monitor the number of Tomcat threads and some detailed JVM memory stuff if I install munin-plugins-java
I'd still like to get arbitrary mbeans like activeSessions etc, though
I can't remember if I had to configure the jmx settings in /etc/munin/plugin-conf.d/munin-node or not—I think all I did was re-run the munin-node-configure script and of course enable JMX in Tomcat's JVM options

2018-01-23

Thinking about generating a jmeter test plan for DSpace, along the lines of Georgetown's dspace-performance-test
I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:

# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
56405

Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:

# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
     38 /oai/
  14406 /bitstream/
  15179 /rest/
  15191 /handle/

And 3% were to the homepage or search:

# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
   1050 /
    413 /discover
    170 /open-search

The last 10% or so seem to be for static assets that would be served by nginx anyways:

# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
      2 .gif
      7 .css
     84 .js
    433 .php
    882 .txt
   2551 .png

I can definitely design a test plan on this!

2018-01-24

Looking at the REST requests, most of them are to expand all or metadata, but 5% are for retrieving bitstreams:

# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -E "^/rest" | grep -Eo "(retrieve|expand=[a-z].*)" | sort | uniq -c | sort -n
      1 expand=collections
     16 expand=all&limit=1
     45 expand=items
    775 retrieve
   5675 expand=all
   8633 expand=metadata

I finished creating the test plan for DSpace Test and ran it from my Linode with:

$ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl

Atmire responded to my issue from two weeks ago and said they will start looking into DSpace 5.8 compatibility for CGSpace
I set up a new Arch Linux Linode instance with 8192 MB of RAM and ran the test plan a few times to get a baseline:

# lscpu
# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           4
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:            2
CPU MHz:             2499.994
BogoMIPS:            5001.32
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti retpoline fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat
# free -m
              total        used        free      shared  buff/cache   available
Mem:           7970         107        7759           1         103        7771
Swap:           255           0         255
# pacman -Syu
# pacman -S git wget jre8-openjdk-headless mosh htop tmux
# useradd -m test
# su - test
$ git clone -b ilri https://github.com/alanorth/dspace-performance-test.git
$ wget http://www-us.apache.org/dist//jmeter/binaries/apache-jmeter-3.3.tgz
$ tar xf apache-jmeter-3.3.tgz
$ cd apache-jmeter-3.3/bin
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline2.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-24-linode5451120-baseline3.jtl -j ~/dspace-performance-test/2018-01-24-linode5451120-baseline3.log

Then I generated reports for these runs like this:

$ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline

2018-01-25

Run another round of tests on DSpace Test with jmeter after changing Tomcat's minSpareThreads to 20 (default is 10) and acceptorThreadCount to 2 (default is 1):

$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.log

I changed the parameters back to the baseline ones and switched the Tomcat JVM garbage collector to G1GC and re-ran the tests
JVM options for Tomcat changed from -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC to -Xms3072m -Xmx3072m -XX:+UseG1GC -XX:+PerfDisableSharedMem

$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log

I haven't had time to look at the results yet

2018-01-26

Peter followed up about some of the points from the Skype meeting last week
Regarding the ORCID field issue, I see ICARDA's MELSpace is using cg.creator.ID: 0000-0001-9156-7691
I had floated the idea of using a controlled vocabulary with values formatted something like: Orth, Alan S. (0000-0002-1735-7458)
Update PostgreSQL JDBC driver version from 42.1.4 to 42.2.1 on DSpace Test, see: https://jdbc.postgresql.org/
Reboot DSpace Test to get new Linode kernel (Linux 4.14.14-x86_64-linode94)
I am testing my old work on the dc.rights field, I had added a branch for it a few months ago
I added a list of Creative Commons and other licenses in input-forms.xml
The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn't possible (?)
So I used some creativity and made several fields display values, but not store any, ie:

<pair>
  <displayed-value>For products published by another party:</displayed-value>
  <stored-value></stored-value>
</pair>

I was worried that if a user selected this field for some reason that DSpace would store an empty value, but it simply doesn't register that as a valid option:

I submitted a test item with ORCiDs and dc.rights from a controlled vocabulary on DSpace Test: https://dspacetest.cgiar.org/handle/10568/97703
I will send it to Peter to check and give feedback (ie, about the ORCiD field name as well as allowing users to add ORCiDs manually or not)

2018-01-28

Assist Udana from WLE again to proof his 25 records and upload them to DSpace Test
I am playing with the startStopThreads="0" parameter in Tomcat <Engine> and <Host> configuration
It reduces the start up time of Catalina by using multiple threads to start web applications in parallel
On my local test machine the startup time went from 70 to 30 seconds
See: https://tomcat.apache.org/tomcat-7.0-doc/config/host.html

2018-01-29

CGSpace went down this morning for a few minutes, according to UptimeRobot
Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:

2018-01-29 05:30:22,226 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
Was expecting one of:
    "TO" ...
    <RANGE_QUOTED> ...
    <RANGE_GOOP> ...
    
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
Was expecting one of:
    "TO" ...
    <RANGE_QUOTED> ...
    <RANGE_GOOP> ...

So is this an error caused by this particular client (which happens to be Yahoo! Slurp)?
I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early
Perhaps this from the nginx error log is relevant?

2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"

I think that must be unrelated, probably the client closed the request to nginx because DSpace (Tomcat) was taking too long
An interesting snippet to get the maximum and average nginx responses:

# awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
Maximum: 2771268
Average: 210483

I guess responses that don't fit in RAM get saved to disk (a default of 1024M), so this is definitely not the issue here, and that warning is totally unrelated
My best guess is that the Solr search error is related somehow but I can't figure it out
We definitely have enough database connections, as I haven't seen a pool error in weeks:

$ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
dspace.log.2018-01-20:0
dspace.log.2018-01-21:0
dspace.log.2018-01-22:0
dspace.log.2018-01-23:0
dspace.log.2018-01-24:0
dspace.log.2018-01-25:0
dspace.log.2018-01-26:0
dspace.log.2018-01-27:0
dspace.log.2018-01-28:0
dspace.log.2018-01-29:0

Adam Hunt from WLE complained that pages take "1-2 minutes" to load each, from France and Sri Lanka
I asked him which particular pages, as right now pages load in 2 or 3 seconds for me
UptimeRobot said CGSpace went down again, and I looked at PostgreSQL and saw 211 active database connections
If it's not memory and it's not database, it's gotta be Tomcat threads, seeing as the default maxThreads is 200 anyways, it actually makes sense
I decided to change the Tomcat thread settings on CGSpace:
- maxThreads from 200 (default) to 400
- processorCache from 200 (default) to 400, recommended to be the same as maxThreads
- minSpareThreads from 10 (default) to 20
- acceptorThreadCount from 1 (default) to 2, recommended to be 2 on multi-CPU systems
Looks like I only enabled the new thread stuff on the connector used internally by Solr, so I probably need to match that by increasing them on the other connector that nginx proxies to
Jesus Christ I need to fucking fix the Munin monitoring so that I can tell how many fucking threads I have running
Wow, so apparently you need to specify which connector to check if you want any of the Munin Tomcat plugins besides "tomcat_jvm" to work (the connector name can be seen in the Catalina logs)
I modified /etc/munin/plugin-conf.d/tomcat to add the connector (with surrounding quotes!) and now the other plugins work (obviously the credentials are incorrect):

[tomcat_*]
    env.host 127.0.0.1
    env.port 8081
    env.connector "http-bio-127.0.0.1-8443"
    env.user munin
    env.password munin

For example, I can see the threads:

# munin-run tomcat_threads
busy.value 0
idle.value 20
max.value 400

Apparently you can't monitor more than one connector, so I guess the most important to monitor would be the one that nginx is sending stuff to
So for now I think I'll just monitor these and skip trying to configure the jmx plugins
Although following the logic of /usr/share/munin/plugins/jmx_tomcat_dbpools could be useful for getting the active Tomcat sessions
From debugging the jmx_tomcat_db_pools script from the munin-plugins-java package, I see that this is how you call arbitrary mbeans:

# port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"  maxActive       300

More notes here: https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx
Looking at the Munin graphs, I that the load is 200% every morning from 03:00 to almost 08:00
Tomcat's catalina.out log file is full of spam from this thing too, with lines like this

[===================>                               ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16

There are millions of these status lines, for example in just this one log file:

# zgrep -c "time remaining" /var/log/tomcat7/catalina.out.1.gz
1084741

I filed a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566

2018-01-31

UptimeRobot says CGSpace went down at 7:57 AM, and indeed I see a lot of HTTP 499 codes in nginx logs
PostgreSQL activity shows 222 database connections
Now PostgreSQL activity shows 265 database connections!
I don't see any errors anywhere...
Now PostgreSQL activity shows 308 connections!
Well this is interesting, there are 400 Tomcat threads busy:

# munin-run tomcat_threads
busy.value 400
idle.value 0
max.value 400

And wow, we finally exhausted the database connections, from dspace.log:

2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].

Now even the nightly Atmire background thing is getting HTTP 500 error:

Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException

For now I will restart Tomcat to clear this shit and bring the site back up
The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     67 66.249.66.70
     70 207.46.13.12
     71 197.210.168.174
     83 207.46.13.13
     85 157.55.39.79
     89 207.46.13.14
    123 68.180.228.157
    198 66.249.66.90
    219 41.204.190.40
    255 2405:204:a208:1e12:132:2a8e:ad28:46c0
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      2 65.55.210.187
      2 66.249.66.90
      3 157.55.39.79
      4 197.232.39.92
      4 34.216.252.127
      6 104.196.152.243
      6 213.55.85.89
     15 122.52.115.13
     16 213.55.107.186
    596 45.5.184.196

This looks reasonable to me, so I have no idea why we ran out of Tomcat threads

We need to start graphing the Tomcat sessions as well, though that requires JMX
Also, I wonder if I could disable the nightly Atmire thing
God, I don't know where this load is coming from
Since I bumped up the Tomcat threads from 200 to 400 the load on the server has been sustained at about 200% for almost a whole day:

I should make separate database pools for the web applications and the API applications like REST and OAI
Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat's activeSessions from JMX (using munin-plugins-java):

# port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
Catalina:type=Manager,context=/,host=localhost  activeSessions  8

If you connect to Tomcat in jvisualvm it's pretty obvious when you hover over the elements

78 KiB Raw Blame History

2018-01-02

2018-01-03

2018-01-04

2018-01-05

2018-01-06

2018-01-09

2018-01-10

2018-01-11

2018-01-12

2018-01-13

2018-01-14

2018-01-15

2018-01-16

2018-01-17

2018-01-18

2018-01-19

2018-01-20

2018-01-22

2018-01-23

2018-01-24

2018-01-25

2018-01-26

2018-01-28

2018-01-29

2018-01-31

78 KiB

Raw Blame History