cgspace-notes/2017-12.md at 39d8d0876c25280f78bd16e86b3e2015ab1b0d70

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-09 16:45:45 +01:00

They moved to Lyrasis in 2019.

2020-04-13 15:30:34 +03:00

36 KiB

Raw Blame History

title

date

author

2017-12-01

Uptime Robot noticed that CGSpace went down
The logs say "Timeout waiting for idle object"
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:

# cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    763 2.86.122.76
    907 207.46.13.94
   1018 157.55.39.206
   1021 157.55.39.235
   1407 66.249.66.70
   1411 104.196.152.243
   1503 50.116.102.77
   1805 66.249.66.90
   4007 70.32.83.92
   6061 45.5.184.196

The number of DSpace sessions isn't even that high:

$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
5815

Connections in the last two hours:

# cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail                                                      
     78 93.160.60.22
    101 40.77.167.122
    113 66.249.66.70
    129 157.55.39.206
    130 157.55.39.235
    135 40.77.167.58
    164 68.180.229.254
    177 87.100.118.220
    188 66.249.66.90
    314 2.86.122.76

What the fuck is going on?
I've never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:

$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
822

Appears to be some new bot:

2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)"

I restarted Tomcat and everything came back up
I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx
I will also add 'Drupal' to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots

# cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
      3 54.75.205.145
      6 70.32.83.92
     14 2a01:7e00::f03c:91ff:fe18:7396
     46 2001:4b99:1:1:216:3eff:fe2c:dc6c
    319 2001:4b99:1:1:216:3eff:fe76:205b

2017-12-03

Linode alerted that CGSpace's load was 327.5% from 6 to 8 AM again

2017-12-04

Linode alerted that CGSpace's load was 255.5% from 8 to 10 AM again
I looked at the Munin stats on DSpace Test (linode02) again to see how the PostgreSQL tweaks from a few weeks ago were holding up:

The results look fantastic! So the random_page_cost tweak is massively important for informing the PostgreSQL scheduler that there is no "cost" to accessing random pages, as we're on an SSD!
I guess we could probably even reduce the PostgreSQL connections in DSpace / PostgreSQL after using this
Run system updates on DSpace Test (linode02) and reboot it
I'm going to enable the PostgreSQL random_page_cost tweak on CGSpace
For reference, here is the past month's connections:

2017-12-05

Linode alerted again that the CPU usage on CGSpace was high this morning from 8 to 10 AM
CORE updated the entry for CGSpace on their index: https://core.ac.uk/search?q=repositories.id:(1016)&fullTextOnly=false
Linode alerted again that the CPU usage on CGSpace was high this evening from 8 to 10 PM

2017-12-06

Linode alerted again that the CPU usage on CGSpace was high this morning from 6 to 8 AM
Uptime Robot alerted that the server went down and up around 8:53 this morning
Uptime Robot alerted that CGSpace was down and up again a few minutes later
I don't see any errors in the DSpace logs but I see in nginx's access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
Looking at the REST API logs I see some new client IP I haven't noticed before:

# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
     18 95.108.181.88
     19 68.180.229.254
     30 207.46.13.151
     33 207.46.13.110
     38 40.77.167.20
     41 157.55.39.223
     82 104.196.152.243
   1529 50.116.102.77
   4005 70.32.83.92
   6045 45.5.184.196

50.116.102.77 is apparently in the US on websitewelcome.com

2017-12-07

Uptime Robot reported a few times today that CGSpace was down and then up
At one point Tsega restarted Tomcat
I never got any alerts about high load from Linode though...
I looked just now and see that there are 121 PostgreSQL connections!
The top users right now are:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
    838 40.77.167.11
    939 66.249.66.223
   1149 66.249.66.206
   1316 207.46.13.110
   1322 207.46.13.151
   1323 2001:da8:203:2224:c912:1106:d94f:9189
   1414 157.55.39.223
   2378 104.196.152.243
   2662 66.249.66.219
   5110 124.17.34.60

We've never seen 124.17.34.60 yet, but it's really hammering us!
Apparently it is from China, and here is one of its user agents:

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)

It is responsible for 4,500 Tomcat sessions today alone:

$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
4574

I've adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it's the same bot on the same subnet
I was running the DSpace cleanup task manually and it hit an error:

$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
...
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle".

The solution is like I discovered in 2017-04, to set the primary_bitstream_id to null:

dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
UPDATE 1

2017-12-13

Linode alerted that CGSpace was using high CPU from 10:13 to 12:13 this morning

2017-12-16

Re-work the XMLUI base theme to allow child themes to override the header logo's image and link destination: #349
This required a little bit of work to restructure the XSL templates
Optimize PNG and SVG image assets in the CGIAR base theme using pngquant and svgo: #350

2017-12-17

Reboot DSpace Test to get new Linode Linux kernel
Looking at CCAFS bulk import for Magdalena Haman (she originally sent them in November but some of the thumbnails were missing and dates were messed up so she resent them now)
A few issues with the data and thumbnails:
- Her thumbnail files all use capital JPG so I had to rename them to lowercase: rename -fc *.JPG
- thumbnail20.jpg is 1.7MB so I have to resize it
- I also had to add the .jpg to the thumbnail string in the CSV
- The thumbnail11.jpg is missing
- The dates are in super long ISO8601 format (from Excel?) like 2016-02-07T00:00:00Z so I converted them to simpler forms in GREL: value.toString("yyyy-MM-dd")
- I trimmed the whitespaces in a few fields but it wasn't many
- Rename her thumbnail column to filename, and format it so SAFBuilder adds the files to the thumbnail bundle with this GREL in OpenRefine: value + "__bundle:THUMBNAIL"
- Rename dc.identifier.status and dc.identifier.url columns to cg.identifier.status and cg.identifier.url
- Item 4 has weird characters in citation, ie: Nagoya et de Trait
- Some author names need normalization, ie: Aggarwal, Pramod and Aggarwal, Pramod K.
- Something weird going on with duplicate authors that have the same text value, like Berto, Jayson C. and Balmeo, Katherine P.
- I will send her feedback on some author names like UNEP and ICRISAT and ask her for the missing thumbnail11.jpg
I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the collection field)

$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log

It's the same on DSpace Test, I can't import the SAF bundle without specifying the collection:

$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
No collections given. Assuming 'collections' file inside item directory
Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
Generating mapfile: /tmp/ccafs.map
Processing collections file: collections
Adding item from directory item_1
java.lang.NullPointerException
        at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:865)
        at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java:736)
        at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:498)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
java.lang.NullPointerException
Started: 1513521856014
Ended: 1513521858573
Elapsed time: 2 secs (2559 msecs)

I even tried to debug it by adding verbose logging to the JAVA_OPTS:

-Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true

... but the error message was the same, just with more INFO noise around it
For now I'll import into a collection in DSpace Test but I'm really not sure what's up with this!
Linode alerted that CGSpace was using high CPU from 4 to 6 PM
The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    671 66.249.66.70
    885 95.108.181.88
    904 157.55.39.96
    923 157.55.39.179
   1159 207.46.13.107
   1184 104.196.152.243
   1230 66.249.66.91
   1414 68.180.229.254
   4137 66.249.66.90
  46401 137.108.70.7

And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:

# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
     33 68.180.229.254
     48 157.55.39.96
     51 157.55.39.179
     56 207.46.13.107
    102 104.196.152.243
    102 66.249.66.90
    691 137.108.70.7
   1531 50.116.102.77
   4014 70.32.83.92
  11030 45.5.184.196

That's probably ok, as I don't think the REST API connections use up a Tomcat session...
CIP emailed a few days ago to ask about unique IDs for authors and organizations, and if we can provide them via an API
Regarding the import issue above it seems to be a known issue that has a patch in DSpace 5.7:
- https://jira.duraspace.org/browse/DS-2633
- https://jira.duraspace.org/browse/DS-3583
We're on DSpace 5.5 but there is a one-word fix to the addItem() function here: https://github.com/DSpace/DSpace/pull/1731
I will apply it on our branch but I need to make a note to NOT cherry-pick it when I rebase on to the latest 5.x upstream later
Pull request: #351

2017-12-18

Linode alerted this morning that there was high outbound traffic from 6 to 8 AM
The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    190 207.46.13.146
    191 197.210.168.174
    202 86.101.203.216
    268 157.55.39.134
    297 66.249.66.91
    314 213.55.99.121
    402 66.249.66.90
    532 68.180.229.254
    644 104.196.152.243
  32220 137.108.70.7

On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:

# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
      7 104.198.9.108
      8 185.29.8.111
      8 40.77.167.176
      9 66.249.66.91
      9 68.180.229.254
     10 157.55.39.134
     15 66.249.66.90
     59 104.196.152.243
   4014 70.32.83.92
   8619 45.5.184.196

I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
Update text on CGSpace about page to give some tips to developers about using the resources more wisely (#352)
Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM
The REST and OAI API logs look pretty much the same as earlier this morning, but there's a new IP harvesting XMLUI:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
    360 95.108.181.88
    477 66.249.66.90
    526 86.101.203.216
    691 207.46.13.13
    698 197.210.168.174
    819 207.46.13.146
    878 68.180.229.254
   1965 104.196.152.243
  17701 2.86.72.181
  52532 137.108.70.7

2.86.72.181 appears to be from Greece, and has the following user agent:

Mozilla/3.0 (compatible; Indy Library)

Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:

$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
1

I guess there's nothing I can do to them for now
In other news, I am curious how many PostgreSQL connection pool errors we've had in the last month:

$ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
dspace.log.2017-11-07:15695
dspace.log.2017-11-08:135
dspace.log.2017-11-17:1298
dspace.log.2017-11-26:4160
dspace.log.2017-11-28:107
dspace.log.2017-11-29:3972
dspace.log.2017-12-01:1601
dspace.log.2017-12-02:1274
dspace.log.2017-12-07:2769

I made a small fix to my move-collections.sh script so that it handles the case when a "to" or "from" community doesn't exist
The script lives here: https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515
Major reorganization of four of CTA's French collections
Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities
Move collection 10568/51821 from 10568/42212 to 10568/42211
Move collection 10568/51400 from 10568/42214 to 10568/42211
Move collection 10568/56992 from 10568/42216 to 10568/42211
Move collection 10568/42218 from 10568/42217 to 10568/42211
Export CSV of collection 10568/63484 and move items to collection 10568/51400
Export CSV of collection 10568/64403 and move items to collection 10568/56992
Export CSV of collection 10568/56994 and move items to collection 10568/42218
There are blank lines in this metadata, which causes DSpace to not detect changes in the CSVs
I had to use OpenRefine to remove all columns from the CSV except id and collection, and then update the collection field for the new mappings
Remove empty sub-communities: 10568/42212, 10568/42214, 10568/42216, 10568/42217
I was in the middle of applying the metadata imports on CGSpace and the system ran out of PostgreSQL connections...
There were 128 PostgreSQL connections at the time... grrrr.
So I restarted Tomcat 7 and restarted the imports
I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:

$ dspace index-discovery -r 10568/42211
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery

The PostgreSQL issues are getting out of control, I need to figure out how to enable connection pools in Tomcat!

2017-12-19

Briefly had PostgreSQL connection issues on CGSpace for the millionth time
I'm fucking sick of this!
The connection graph on CGSpace shows shit tons of connections idle

And I only now just realized that DSpace's db.maxidle parameter is not seconds, but number of idle connections to allow.
So theoretically, because each webapp has its own pool, this could be 20 per app—so no wonder we have 50 idle connections!
I notice that this number will be set to 10 by default in DSpace 6.1 and 7.0: https://jira.duraspace.org/browse/DS-3564
So I'm going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI
I re-deployed the 5_x-prod branch on CGSpace, applied all system updates, and restarted the server
Looking through the dspace.log I see this error:

2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock

I don't have time now to look into this but the Solr sharding has long been an issue!
Looking into using JDBC / JNDI to provide a database pool to DSpace
The DSpace 6.x configuration docs have more notes about setting up the database pool than the 5.x ones (which actually have none!)
First, I uncomment db.jndi in dspace/config/dspace.cfg
Then I create a global Resource in the main Tomcat server.xml (inside GlobalNamingResources):

<Resource name="jdbc/dspace" auth="Container" type="javax.sql.DataSource"
	  driverClassName="org.postgresql.Driver"
	  url="jdbc:postgresql://localhost:5432/dspace"
	  username="dspace"
	  password="dspace"
      initialSize='5'
      maxActive='50'
      maxIdle='15'
      minIdle='5'
      maxWait='5000'
      validationQuery='SELECT 1'
      testOnBorrow='true' />

Most of the parameters are from comments by Mark Wood about his JNDI setup: https://jira.duraspace.org/browse/DS-3564
Then I add a ResourceLink to each web application context:

<ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>

I am not sure why several guides show configuration snippets for server.xml and web application contexts that use a Local and Global jdbc...
When DSpace can't find the JNDI context (for whatever reason) you will see this in the dspace logs:

2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
        at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
        at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1414)
        at org.dspace.storage.rdbms.DatabaseManager.initialize(DatabaseManager.java:1331)
        at org.dspace.storage.rdbms.DatabaseManager.getDataSource(DatabaseManager.java:648)
        at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:627)
        at org.dspace.core.Context.init(Context.java:121)
        at org.dspace.core.Context.<init>(Context.java:95)
        at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:79)
        at org.dspace.app.util.DSpaceContextListener.contextInitialized(DSpaceContextListener.java:128)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5110)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5633)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1015)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:991)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
        at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:712)
        at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:2002)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2017-12-19 13:12:08,798 INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
2017-12-19 13:12:08,798 INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool

And indeed the Catalina logs show that it failed to set up the JDBC driver:

org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'

There are several copies of the PostgreSQL driver installed by DSpace:

$ find ~/dspace/ -iname "postgresql*jdbc*.jar"
/Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/rest/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/lib/postgresql-9.1-901-1.jdbc4.jar

These apparently come from the main DSpace pom.xml:

<dependency>
   <groupId>postgresql</groupId>
   <artifactId>postgresql</artifactId>
   <version>9.1-901-1.jdbc4</version>
</dependency>

So WTF? Let's try copying one to Tomcat's lib folder and restarting Tomcat:

$ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib

Oh that's fantastic, now at least Tomcat doesn't print an error during startup so I guess it succeeds to create the JNDI pool
DSpace starts up but I have no idea if it's using the JNDI configuration because I see this in the logs:

2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
2017-12-19 13:26:54,277 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
2017-12-19 13:26:54,293 INFO  org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
2017-12-19 13:26:54,306 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)

Let's try again, but this time explicitly blank the PostgreSQL connection parameters in dspace.cfg and see if DSpace starts...
Wow, ok, that works, but having to copy the PostgreSQL JDBC JAR to Tomcat's lib folder totally blows
Also, it's likely this is only a problem on my local macOS + Tomcat test environment
Ubuntu's Tomcat distribution will probably handle this differently
So for reference I have:
- a <Resource> defined globally in server.xml
- a <ResourceLink> defined in each web application's context XML
- unset the db.url, db.username, and db.password parameters in dspace.cfg
- set the db.jndi in dspace.cfg to the name specified in the web application context
After adding the Resource to server.xml on Ubuntu I get this in Catalina's logs:

SEVERE: Unable to create initial connections of pool.
java.sql.SQLException: org.postgresql.Driver
...
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver

The username and password are correct, but maybe I need to copy the fucking lib there too?
I tried installing Ubuntu's libpostgresql-jdbc-java package but Tomcat still can't find the class
Let me try to symlink the lib into Tomcat's libs:

# ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib

Now Tomcat starts but the localhost container has errors:

SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract

Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace's are 9.1...
Let me try to remove it and copy in DSpace's:

# rm /usr/share/tomcat7/lib/postgresql.jar
# cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/

Wow, I think that actually works...
I wonder if I could get the JDBC driver from postgresql.org instead of relying on the one from the DSpace build: https://jdbc.postgresql.org/
I notice our version is 9.1-901, which isn't even available anymore! The latest in the archived versions is 9.1-903
Also, since I commented out all the db parameters in DSpace.cfg, how does the command line dspace tool work?
Let's try the upstream JDBC driver first:

# rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
# wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar

DSpace command line fails unless db settings are present in dspace.cfg:

$ dspace database info
Caught exception:
java.sql.SQLException: java.lang.ClassNotFoundException: 
        at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1438)
        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: java.lang.ClassNotFoundException: 
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:41)
        ... 8 more

And in the logs:

2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file:  java.naming.factory.initial
        at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
        at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
        at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:350)
        at javax.naming.InitialContext.lookup(InitialContext.java:417)
        at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1413)
        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2017-12-19 18:26:56,983 INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
2017-12-19 18:26:56,983 INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
2017-12-19 18:26:56,992 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxconnections
2017-12-19 18:26:56,992 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxwait
2017-12-19 18:26:56,993 WARN  org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxidle

If I add the db values back to dspace.cfg the dspace database info command succeeds but the log still shows errors retrieving the JNDI connection
Perhaps something to report to the dspace-tech mailing list when I finally send my comments
Oh cool! select * from pg_stat_activity shows "PostgreSQL JDBC Driver" for the application name! That's how you know it's working!
If you monitor the pg_stat_activity while you run dspace database info you can see that it doesn't use the JNDI and creates ~9 extra PostgreSQL connections!
And in the middle of all of this Linode sends an alert that CGSpace has high CPU usage from 2 to 4 PM

2017-12-20

The database connection pooling is definitely better!

Now there is only one set of idle connections shared among all the web applications, instead of 10+ per application
There are short bursts of connections up to 10, but it generally stays around 5
Test and import 13 records to CGSpace for Abenet:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log

The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.
Since I had to restart Tomcat anyways, I decided to just deploy the new JNDI connection pooling stuff on CGSpace
There was an initial connection storm of 50 PostgreSQL connections, but then it settled down to 7
After that CGSpace came up fine and I was able to import the 13 items just fine:

$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287

The final code for the JNDI work in the Ansible infrastructure scripts is here: 1959d9cb7a

2017-12-24

Linode alerted that CGSpace was using high CPU this morning around 6 AM
I'm playing with reading all of a month's nginx logs into goaccess:

# find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -

I can see interesting things using this approach, for example:
- 50.116.102.77 checked our status almost 40,000 times so far this month—I think it's the CGNet uptime tool
- Also, we've handled 2.9 million requests this month from 172,000 unique IP addresses!
- Total bandwidth so far this month is 640GiB
- The user that made the most requests so far this month is 45.5.184.196 (267,000 requests)

2017-12-25

The PostgreSQL connection pooling is much better when using the Tomcat JNDI pool
Here are the Munin stats for the past week on CGSpace:

2017-12-29

Looking at some old notes for metadata to clean up, I found a few hundred corrections in cg.fulltextstatus and dc.language.iso:

# update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
DELETE 17
# update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
UPDATE 49
# update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
UPDATE 4
# update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
UPDATE 16
# update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
UPDATE 9
# update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
UPDATE 1
# update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
DELETE 20

I need to figure out why we have records with language in because that's not a language!

2017-12-30

Linode alerted that CGSpace was using 259% CPU from 4 to 6 AM
Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM
Here's the XMLUI logs:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    637 207.46.13.106
    641 157.55.39.186
    715 68.180.229.254
    924 104.196.152.243
   1012 66.249.64.95
   1060 216.244.66.245
   1120 54.175.208.220
   1287 66.249.64.93
   1586 66.249.64.78
   3653 66.249.64.91

Looks pretty normal actually, but I don't know who 54.175.208.220 is
They identify as "com.plumanalytics", which Google says is associated with Elsevier
They only seem to have used one Tomcat session so that's good, I guess I don't need to add them to the Tomcat Crawler Session Manager valve:

$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l          
1

216.244.66.245 seems to be moz.com's DotBot

2017-12-31

I finished working on the 42 records for CCAFS after Magdalena sent the remaining corrections
After that I uploaded them to CGSpace:

$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &> ccafs.log

36 KiB Raw Blame History Unescape Escape

2017-12-01

2017-12-03

2017-12-04

2017-12-05

2017-12-06

2017-12-07

2017-12-13

2017-12-16

2017-12-17

2017-12-18

2017-12-19

2017-12-20

2017-12-24

2017-12-25

2017-12-29

2017-12-30

2017-12-31

36 KiB

Raw Blame History