The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
And just before that I see this:
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
Ah hah! So the pool was actually empty!
I need to increase that, let’s try to bump it up from 50 to 75
After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered ""]""] "" at line 1, column 32.
And there are many of these errors every day for the past month:
$ grep -c "Error while searching for sidebar facets" dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
And just before that I see this:
Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
Ah hah! So the pool was actually empty!
I need to increase that, let’s try to bump it up from 50 to 75
After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered ""]""] "" at line 1, column 32.
And there are many of these errors every day for the past month:
$ grep -c "Error while searching for sidebar facets" dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains
<li>The nginx logs show HTTP 200s until <code>02/Jan/2018:11:27:17 +0000</code> when Uptime Robot got an HTTP 500</li>
<li>In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”</li>
<li>And just before that I see this:</li>
</ul>
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre>
<ul>
<li>Ah hah! So the pool was actually empty!</li>
<li>I need to increase that, let’s try to bump it up from 50 to 75</li>
<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered ""]""] "" at line 1, column 32.
</code></pre>
<ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre><code>$ grep -c "Error while searching for sidebar facets" dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34
</code></pre>
<ul>
<li>Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains</li>
<li>I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM</li>
<li>Looks like I need to increase the database pool size again:</li>
</ul>
<pre><code>$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
</code></pre>
<ul>
<li>For some reason there were a lot of “active” connections last night:</li>
<li>I have no idea what these are but they seem to be coming from Amazon…</li>
<li>I guess for now I just have to increase the database connection pool’s max active</li>
<li>It’s currently 75 and normally I’d just bump it by 25 but let me be a bit daring and push it by 50 to 125, because I used to see at least 121 connections in pg_stat_activity before when we were using the shitty default pooling</li>
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
</code></pre>
<ul>
<li>So for this week that is the number one problem!</li>
</ul>
<pre><code>$ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
dspace.log.2018-01-04:1559
</code></pre>
<ul>
<li>I will just bump the connection limit to 300 because I’m fucking fed up with this shit</li>
<li>Once I get back to Amman I will have to try to create different database pools for different web applications, like recently discussed on the dspace-tech mailing list</li>
<li>Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space</li>
<li>I had a look and there is one Apache 2 log file that is 73GB, with lots of this:</li>
</ul>
<pre><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
</code></pre>
<ul>
<li>I will delete the log file for now and tell Danny</li>
<li>Also, I’m still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is</li>
<li>Reboot CGSpace and DSpace Test for new kernels (4.14.12-x86_64-linode92) that partially mitigate the <ahref="https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/">Spectre and Meltdown CPU vulnerabilities</a></li>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
... 10 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request failed.
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
... 10 more
</code></pre>
<ul>
<li>There is interesting documentation about this on the DSpace Wiki: <ahref="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear">https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-SolrShardingByYear</a></li>
<li>I’m looking to see maybe if we’re hitting the issues mentioned in <ahref="https://jira.duraspace.org/browse/DS-2212">DS-2212</a> that were apparently fixed in DSpace 5.2</li>
<li>I can apparently search for records in the Solr stats core that have an empty <code>owningColl</code> field using this in the Solr admin query: <code>-owningColl:*</code></li>
<li>On CGSpace I see 48,000,000 records that have an <code>owningColl</code> field and 34,000,000 that don’t:</li>
<li>I tested the <code>dspace stats-util -s</code> process on my local machine and it failed the same way</li>
<li>It doesn’t seem to be helpful, but the dspace log shows this:</li>
</ul>
<pre><code>2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
</code></pre>
<ul>
<li>Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: <ahref="https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues">https://wiki.duraspace.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues</a></li>
<li>Uptime Robot said that CGSpace went down at around 9:43 AM</li>
<li>I looked at PostgreSQL’s <code>pg_stat_activity</code> table and saw 161 active connections, but no pool errors in the DSpace logs:</li>
</ul>
<pre><code>$ grep -c "Timeout: Pool empty." dspace.log.2018-01-10
0
</code></pre>
<ul>
<li>The XMLUI logs show quite a bit of activity today:</li>
<li>Rather than blocking their IPs, I think I might just add their user agent to the “badbots” zone with Baidu, because they seem to be the only ones using that user agent:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
<li>I added the user agent to nginx’s badbots limit req zone but upon testing the config I got an error:</li>
</ul>
<pre><code># nginx -t
nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
nginx: configuration file /etc/nginx/nginx.conf test failed
</code></pre>
<ul>
<li>According to nginx docs the <ahref="https://nginx.org/en/docs/hash.html">bucket size should be a multiple of the CPU’s cache alignment</a>, which is 64 for us:</li>
</ul>
<pre><code># cat /proc/cpuinfo | grep cache_alignment | head -n1
cache_alignment : 64
</code></pre>
<ul>
<li>On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx</li>
<li>Almost immediately the PostgreSQL connections dropped back down to 40 or so, and UptimeRobot said the site was back up</li>
<li>So that’s interesting that we’re not out of PostgreSQL connections (current pool maxActive is 300!) but the system is “down” to UptimeRobot and very slow to use</li>
<li>Linode continues to test mitigations for Meltdown and Spectre: <ahref="https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/">https://blog.linode.com/2018/01/03/cpu-vulnerabilities-meltdown-spectre/</a></li>
<li>I rebooted DSpace Test to see if the kernel will be updated (currently Linux 4.14.12-x86_64-linode92)… nope.</li>
<li>It looks like Linode will reboot the KVM hosts later this week, though</li>
<li>I opened a ticket with Atmire to ask them about DSpace 5.8 compatibility: <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560</a></li>
</ul>
<h2id="2018-01-11">2018-01-11</h2>
<ul>
<li>The PostgreSQL and firewall graphs from this week show clearly the load from the new bot from PerfectIP.net yesterday:</li>
<li>Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat <code>localhost_access_log</code> at the time of my sharding attempt on my test machine:</li>
<li>The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error</li>
<li>This is apparently a common Solr error code that means “version conflict”: <ahref="http://yonik.com/solr/optimistic-concurrency/">http://yonik.com/solr/optimistic-concurrency/</a></li>
<li>Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:</li>
<li>So theoretically I could name each connection “xmlui” or “dspaceWeb” or something meaningful and it would show up in PostgreSQL’s <code>pg_stat_activity</code> table!</li>
<li>This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)</li>
<li>Also, I realized that the <code>db.jndi</code> parameter in dspace.cfg needs to match the <code>name</code> value in your applicaiton’s context—not the <code>global</code> one</li>
<li>Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:</li>
<li>I’m looking at the <ahref="https://wiki.duraspace.org/display/DSDOC6x/Installing+DSpace#InstallingDSpace-ServletEngine(ApacheTomcat7orlater,Jetty,CauchoResinorequivalent)">DSpace 6.0 Install docs</a> and notice they tweak the number of threads in their Tomcat connector:</li>
</ul>
<pre><code><!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
<Connector port="8080"
maxThreads="150"
minSpareThreads="25"
maxSpareThreads="75"
enableLookups="false"
redirectPort="8443"
acceptCount="100"
connectionTimeout="20000"
disableUploadTimeout="true"
URIEncoding="UTF-8"/>
</code></pre>
<ul>
<li>In Tomcat 8.5 the <code>maxThreads</code> defaults to 200 which is probably fine, but tweaking <code>minSpareThreads</code> could be good</li>
<li>I don’t see a setting for <code>maxSpareThreads</code> in the docs so that might be an error</li>
<li>Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don’t need to specify that manually anymore: <ahref="https://tomcat.apache.org/tomcat-8.5-doc/config/http.html">https://tomcat.apache.org/tomcat-8.5-doc/config/http.html</a></li>
<li></li>
<li>Ooh, I just the <code>acceptorThreadCount</code> setting (in Tomcat 7 and 8.5):</li>
</ul>
<pre><code>The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
<li>Still testing DSpace 6.2 on Tomcat 8.5.24</li>
<li>Catalina errors at Tomcat 8.5 startup:</li>
</ul>
<pre><code>13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
</code></pre>
<ul>
<li>I looked in my Tomcat 7.0.82 logs and I don’t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing</li>
<li>DBCP2 appears to be Tomcat 8.0.x and up according to the <ahref="https://tomcat.apache.org/migration-8.html">Tomcat 8.0 migration guide</a></li>
<li>I have updated our <ahref="https://github.com/ilri/rmg-ansible-public/commit/246f9d7b06d53794f189f0cc57ad5ddd80f0b014">Ansible infrastructure scripts</a> so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)</li>
<li>When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:</li>
</ul>
<pre><code>13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
java.lang.ExceptionInInitializerError
at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:547)
at org.dspace.core.Context.<clinit>(Context.java:103)
... 15 more
</code></pre>
<ul>
<li>Interesting blog post benchmarking Tomcat JDBC vs Apache Commons DBCP2, with configuration snippets: <ahref="http://www.tugay.biz/2016/07/tomcat-connection-pool-vs-apache.html">http://www.tugay.biz/2016/07/tomcat-connection-pool-vs-apache.html</a></li>
<li>The Tomcat vs Apache pool thing is confusing, but apparently we’re using Apache Commons DBCP2 because we don’t specify <code>factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"</code> in our global resource</li>
<li>So at least I know that I’m not looking for documentation or troubleshooting on the Tomcat JDBC pool!</li>
<li>I looked at <code>pg_stat_activity</code> during Tomcat’s startup and I see that the pool created in server.xml is indeed connecting, just that nothing uses it</li>
<li>Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used</li>
<li>Shit, this might actually be a DSpace error: <ahref="https://jira.duraspace.org/browse/DS-3434">https://jira.duraspace.org/browse/DS-3434</a></li>
<li>Looking at the authors Peter had corrected</li>
<li>Some had multiple and he’s corrected them by adding <code>||</code> in the correction column, but I can’t process those this way so I will just have to flag them and do those manually later</li>
<li>Also, I can flag the values that have “DELETE”</li>
<li>Then I need to facet the correction column on isBlank(value) and not flagged</li>
</ul>
<h2id="2018-01-15">2018-01-15</h2>
<ul>
<li>Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload</li>
<li>I’m going to apply these ~130 corrections on CGSpace:</li>
</ul>
<pre><code>update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
</code></pre>
<ul>
<li>Continue proofing Peter’s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names</li>
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
handle
--------
(0 rows)
</code></pre>
<ul>
<li>Even searching in the DSpace advanced search for author equals “Tarawali” produces nothing…</li>
<li>Otherwise, the <ahref="https://wiki.duraspace.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL Helper Functions</a> provide <code>ds5_item2itemhandle()</code>, which is much easier than my long query above that I always have to go search for</li>
<li>For example, to find the Handle for an item that has the author “Erni”:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
<li>Now I made a new list of affiliations for Peter to look through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4552
</code></pre>
<ul>
<li>Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)</li>
<li>For example, this one is from just last month: <ahref="https://cgspace.cgiar.org/handle/10568/89930">https://cgspace.cgiar.org/handle/10568/89930</a></li>
<li>Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture</li>
<li>So some submitters don’t know to use the controlled vocabulary lookup</li>