<li>I restarted Tomcat and everything came back up</li>
<li>I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx</li>
<li>I will also add ‘Drupal’ to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots</li>
<li>The results look fantastic! So the <code>random_page_cost</code> tweak is massively important for informing the PostgreSQL scheduler that there is no “cost” to accessing random pages, as we’re on an SSD!</li>
<li>I guess we could probably even reduce the PostgreSQL connections in DSpace / PostgreSQL after using this</li>
<li>Linode alerted again that the CPU usage on CGSpace was high this morning from 8 to 10 AM</li>
<li>CORE updated the entry for CGSpace on their index: <ahref="https://core.ac.uk/search?q=repositories.id:(1016)&fullTextOnly=false">https://core.ac.uk/search?q=repositories.id:(1016)&fullTextOnly=false</a></li>
<li>Uptime Robot alerted that the server went down and up around 8:53 this morning</li>
<li>Uptime Robot alerted that CGSpace was down and up again a few minutes later</li>
<li>I don’t see any errors in the DSpace logs but I see in nginx’s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)</li>
<li>I’ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it’s the same bot on the same subnet</li>
<li>Re-work the XMLUI base theme to allow child themes to override the header logo’s image and link destination: <ahref="https://github.com/ilri/DSpace/pull/349">#349</a></li>
<li>This required a little bit of work to restructure the XSL templates</li>
<li>Optimize PNG and SVG image assets in the CGIAR base theme using pngquant and svgo: <ahref="https://github.com/ilri/DSpace/pull/350">#350</a></li>
<li>Looking at CCAFS bulk import for Magdalena Haman (she originally sent them in November but some of the thumbnails were missing and dates were messed up so she resent them now)</li>
<li>A few issues with the data and thumbnails:
<ul>
<li>Her thumbnail files all use capital JPG so I had to rename them to lowercase: <code>rename -fc *.JPG</code></li>
<li>thumbnail20.jpg is 1.7MB so I have to resize it</li>
<li>I also had to add the .jpg to the thumbnail string in the CSV</li>
<li>The thumbnail11.jpg is missing</li>
<li>The dates are in super long ISO8601 format (from Excel?) like <code>2016-02-07T00:00:00Z</code> so I converted them to simpler forms in GREL: <code>value.toString("yyyy-MM-dd")</code></li>
<li>I trimmed the whitespaces in a few fields but it wasn’t many</li>
<li>Rename her thumbnail column to filename, and format it so SAFBuilder adds the files to the thumbnail bundle with this GREL in OpenRefine: <code>value + "__bundle:THUMBNAIL"</code></li>
<li>Rename dc.identifier.status and dc.identifier.url columns to cg.identifier.status and cg.identifier.url</li>
<li>Item 4 has weird characters in citation, ie: Nagoya et de Trait</li>
<li>Some author names need normalization, ie: <code>Aggarwal, Pramod</code> and <code>Aggarwal, Pramod K.</code></li>
<li>Something weird going on with duplicate authors that have the same text value, like <code>Berto, Jayson C.</code> and <code>Balmeo, Katherine P.</code></li>
<li>I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the <code>collection</code> field)</li>
<li>We’re on DSpace 5.5 but there is a one-word fix to the addItem() function here: <ahref="https://github.com/DSpace/DSpace/pull/1731">https://github.com/DSpace/DSpace/pull/1731</a></li>
<li>I will apply it on our branch but I need to make a note to NOT cherry-pick it when I rebase on to the latest 5.x upstream later</li>
<li>On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:</li>
<li>I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: <ahref="https://jira.duraspace.org/browse/DS-3551">https://jira.duraspace.org/browse/DS-3551</a></li>
<li>Update text on CGSpace about page to give some tips to developers about using the resources more wisely (<ahref="https://github.com/ilri/DSpace/pull/352">#352</a>)</li>
<li>I made a small fix to my <code>move-collections.sh</code> script so that it handles the case when a “to” or “from” community doesn’t exist</li>
<li>Major reorganization of four of CTA’s French collections</li>
<li>Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities</li>
<li>Move collection <sup>10568</sup>⁄<sub>51821</sub> from <sup>10568</sup>⁄<sub>42212</sub> to <sup>10568</sup>⁄<sub>42211</sub></li>
<li>Move collection <sup>10568</sup>⁄<sub>51400</sub> from <sup>10568</sup>⁄<sub>42214</sub> to <sup>10568</sup>⁄<sub>42211</sub></li>
<li>Move collection <sup>10568</sup>⁄<sub>56992</sub> from <sup>10568</sup>⁄<sub>42216</sub> to <sup>10568</sup>⁄<sub>42211</sub></li>
<li>Move collection <sup>10568</sup>⁄<sub>42218</sub> from <sup>10568</sup>⁄<sub>42217</sub> to <sup>10568</sup>⁄<sub>42211</sub></li>
<li>Export CSV of collection <sup>10568</sup>⁄<sub>63484</sub> and move items to collection <sup>10568</sup>⁄<sub>51400</sub></li>
<li>Export CSV of collection <sup>10568</sup>⁄<sub>64403</sub> and move items to collection <sup>10568</sup>⁄<sub>56992</sub></li>
<li>Export CSV of collection <sup>10568</sup>⁄<sub>56994</sub> and move items to collection <sup>10568</sup>⁄<sub>42218</sub></li>
<li>There are blank lines in this metadata, which causes DSpace to not detect changes in the CSVs</li>
<li>I had to use OpenRefine to remove all columns from the CSV except <code>id</code> and <code>collection</code>, and then update the <code>collection</code> field for the new mappings</li>
<li>I was in the middle of applying the metadata imports on CGSpace and the system ran out of PostgreSQL connections…</li>
<li>There were 128 PostgreSQL connections at the time… grrrr.</li>
<li>So I restarted Tomcat 7 and restarted the imports</li>
<li>I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:</li>
<li>Briefly had PostgreSQL connection issues on CGSpace for the millionth time</li>
<li>I’m fucking sick of this!</li>
<li>The connection graph on CGSpace shows shit tons of connections idle</li>
</ul>
<p><imgsrc="/cgspace-notes/2017/12/postgres-connections-month-cgspace-2.png"alt="Idle PostgreSQL connections on CGSpace"/></p>
<ul>
<li>And I only now just realized that DSpace’s <code>db.maxidle</code> parameter is not seconds, but number of idle connections to allow.</li>
<li>So theoretically, because each webapp has its own pool, this could be 20 per app—so no wonder we have 50 idle connections!</li>
<li>I notice that this number will be set to 10 by default in DSpace 6.1 and 7.0: <ahref="https://jira.duraspace.org/browse/DS-3564">https://jira.duraspace.org/browse/DS-3564</a></li>
<li>So I’m going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI</li>
<li>I re-deployed the <code>5_x-prod</code> branch on CGSpace, applied all system updates, and restarted the server</li>
<li>Looking into using JDBC / JNDI to provide a database pool to DSpace</li>
<li>The <ahref="https://wiki.duraspace.org/display/DSDOC6x/Configuration+Reference">DSpace 6.x configuration docs</a> have more notes about setting up the database pool than the 5.x ones (which actually have none!)</li>
<li>First, I uncomment <code>db.jndi</code> in <em>dspace/config/dspace.cfg</em></li>
<li>Then I create a global <code>Resource</code> in the main Tomcat <em>server.xml</em> (inside <code>GlobalNamingResources</code>):</li>
<li>Most of the parameters are from comments by Mark Wood about his JNDI setup: <ahref="https://jira.duraspace.org/browse/DS-3564">https://jira.duraspace.org/browse/DS-3564</a></li>
<li>Then I add a <code>ResourceLink</code> to each web application context:</li>
<li>I am not sure why several guides show configuration snippets for <em>server.xml</em> and web application contexts that use a Local and Global jdbc…</li>
<li>When DSpace can’t find the JNDI context (for whatever reason) you will see this in the dspace logs:</li>
<li>I wonder if I could get the JDBC driver from postgresql.org instead of relying on the one from the DSpace build: <ahref="https://jdbc.postgresql.org/">https://jdbc.postgresql.org/</a></li>
<li>I notice our version is 9.1-901, which isn’t even available anymore! The latest in the archived versions is 9.1-903</li>
<li>Also, since I commented out all the db parameters in DSpace.cfg, how does the command line <code>dspace</code> tool work?</li>
<li>Let’s try the upstream JDBC driver first:</li>
javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file: java.naming.factory.initial
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:350)
at javax.naming.InitialContext.lookup(InitialContext.java:417)
at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1413)
at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2017-12-19 18:26:56,983 INFO org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
2017-12-19 18:26:56,983 INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
2017-12-19 18:26:56,992 WARN org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxconnections
2017-12-19 18:26:56,992 WARN org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxwait
2017-12-19 18:26:56,993 WARN org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxidle
</code></pre>
<ul>
<li>If I add the db values back to dspace.cfg the <code>dspace database info</code> command succeeds but the log still shows errors retrieving the JNDI connection</li>
<li>Perhaps something to report to the dspace-tech mailing list when I finally send my comments</li>
<li>Oh cool! <code>select * from pg_stat_activity</code> shows “PostgreSQL JDBC Driver” for the application name! That’s how you know it’s working!</li>
<li>If you monitor the <code>pg_stat_activity</code> while you run <code>dspace database info</code> you can see that it doesn’t use the JNDI and creates ~9 extra PostgreSQL connections!</li>
<li>And in the middle of all of this Linode sends an alert that CGSpace has high CPU usage from 2 to 4 PM</li>
<li>The final code for the JNDI work in the Ansible infrastructure scripts is here: <ahref="https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b">https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b</a></li>
<li>Looking at some old notes for metadata to clean up, I found a few hundred corrections in <code>cg.fulltextstatus</code> and <code>dc.language.iso</code>:</li>
</ul>
<pre><code># update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
DELETE 17
# update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
UPDATE 49
# update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
UPDATE 4
# update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
UPDATE 16
# update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
UPDATE 9
# update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
UPDATE 1
# update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
DELETE 20
</code></pre>
<ul>
<li>I need to figure out why we have records with language <code>in</code> because that’s not a language!</li>
<li>Looks pretty normal actually, but I don’t know who 54.175.208.220 is</li>
<li>They identify as “com.plumanalytics”, which Google says is associated with Elsevier</li>
<li>They only seem to have used one Tomcat session so that’s good, I guess I don’t need to add them to the Tomcat Crawler Session Manager valve:</li>