I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
<li>I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
<li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…</li>
<li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li>
<li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li>
<li>Most worryingly, there are encoding errors in the abstracts for eleven items, for example:</li>
<li>68.15% <20> 9.45 instead of 68.15% ± 9.45</li>
<li>After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:</li>
<li>DSpace’s export function doesn’t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something</li>
<li>After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the <code>dspace cleanup</code> script</li>
<li>Merge the IITA research theme changes from last month to the <code>5_x-prod</code> branch (<ahref="https://github.com/ilri/DSpace/pull/413">#413</a>)
<li>Generate a controlled vocabulary of 1187 AGROVOC subjects from the top 1500 that I checked last month, dumping the terms themselves using <code>csvcut</code> and then applying XML controlled vocabulary format in vim and then checking with tidy for good measure:</li>
<li>I tested the AGROVOC controlled vocabulary locally and will deploy it on DSpace Test soon so people can see it</li>
<li>Atmire noticed my message about the “solr_update_time_stamp” error on the dspace-tech mailing list and created an issue on their tracker to discuss it with me
<li><del>Interestingly, if I check an item in the REST API it is also mostly blank: only the title and the ID!</del> On second thought I realize I probably was just seeing the default view without any “expands”</li>
<li>We want to try to crowd source the correction of invalid AGROVOC terms starting with the ~313 invalid ones from our top 1500</li>
<li>We will share a Google Docs spreadsheet with the partners and ask them to mark the deletions and corrections</li>
<li>Abenet and Alan to spend some time identifying correct DCTERMS fields to move to, with preference over CG Core 2.0 as we want to be globally compliant (use information from SEO crosswalks)</li>
<li>I need to follow up on the privacy page that Sisay worked on</li>
<li>We want to try to migrate the 600 <ahref="https://livestock.cgiar.org">Livestock CRP blog posts</a> to CGSpace, Peter will try to export the XML from WordPress so I can try to parse it with a script</li>
<li>I shared a post on Yammer informing our editors to try to AGROVOC controlled list</li>
<li>The SPDX legal committee had a meeting and discussed the addition of CC-BY-ND-3.0-IGO and other IGO licenses to their list, but it seems unlikely (<ahref="https://github.com/spdx/license-list-XML/issues/767#issuecomment-470709673">spdx/license-list-XML/issues/767</a>)</li>
<li>The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms</li>
<pre><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
<pre><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
UPDATE 43
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
<li>Working on tagging IITA’s items with their new research theme (<code>cg.identifier.iitatheme</code>) based on their existing IITA subjects (see <ahref="/cgspace-notes/2018-02/">notes from 2019-02</a>)</li>
<p>After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a <code>||</code>)</p>
</li>
<li>
<p>I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:</p>
<li>In total this would add research themes to 1,755 items</li>
<li>I want to double check one last time with Bosede that they would like to do this, because I also see that this will tag a few hundred items from the 1970s and 1980s</li>
<li>This is a bit ugly, but it works (using the <ahref="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL helper function</a> to resolve ID to handle):</li>
<pre><code>for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do
<li>Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:</li>
<li>And looking at those items more closely, only one of them has an <em>issue date</em> of after 2018-04, so I will only update that one (as the countrie’s name only changed in 2018-04)</li>
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
at org.dspace.authorize.AuthorizeManager.getPolicies(AuthorizeManager.java:612)
at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:154)
at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:300)
</code></pre><ul>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, <del>but spikes of over 1,000 today</del>, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
<li>I didn’t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today <em>might</em> be related?</li>
<pre><code>SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
<li>Last week Felix from Earlham said that they finished testing on DSpace Test (linode19) so I made backups of some things there and re-deployed the system on Ubuntu 18.04
<li>During re-deployment I hit a few issues with the <ahref="https://github.com/ilri/rmg-ansible-public">Ansible playbooks</a> and made some minor improvements</li>
<li>There seems to be an <ahref="https://bugs.launchpad.net/ubuntu/+source/nodejs/+bug/1794589">issue with nodejs’s dependencies now</a>, which causes npm to get uninstalled when installing the certbot dependencies (due to a conflict in libssl dependencies)</li>
<li>Re-sync DSpace Test with a fresh database snapshot and assetstore from CGSpace
<ul>
<li>After restarting Tomcat, Solr was giving the “Error opening new searcher” error for all cores</li>
<li>I stopped Tomcat, added <code>ulimit -v unlimited</code> to the <code>catalina.sh</code> script and deleted all old locks in the DSpace <code>solr</code> directory and then DSpace started up normally</li>
<li>I’m still not exactly sure why I see this error and if the <code>ulimit</code> trick actually helps, as the <code>tomcat7.service</code> has <code>LimitAS=infinity</code> anyways (and from checking the PID’s limits file in <code>/proc</code> it seems to be applied)</li>
<li>I’m not entirely sure if it’s related, but I tried to delete the old migrations and then force running the ignored ones like when we upgraded to <ahref="/cgspace-notes/2018-06/">DSpace 5.8 in 2018-06</a> and then after restarting Tomcat I could see the item displays again</li>
<li>I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api</li>
<li>I noticed that the regular expression for validating lines from input files in my <code>agrovoc-lookup.py</code> script was skipping characters with accents, etc, so I changed it to use the <code>\w</code> character class for words instead of trying to match <code>[A-Z]</code> etc…
<ul>
<li>We have a Spanish and French subjects so this is very important</li>
<li>Also there were some subjects with apostrophes, dashes, and periods… these are probably invalid AGROVOC subject terms, but we should save them to the rejects file instead of skipping them nevertheless</li>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
<li>So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187)</li>
<li>Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (<ahref="https://github.com/ilri/DSpace/pull/416">#416</a>)</li>
<li>We are getting the blank page issue on CGSpace again today and I see a <del>large number</del> of the “SQL QueryTable Error” in the DSpace log again (last time was 2019-03-15):</li>
<li>Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the “binary file matches” result with <code>-I</code>:</li>
<li>It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use <code>-I</code> to say binary files don’t match</li>
<li>Anyways, the full error in DSpace’s log is:</li>
<li>This is unrelated and apparently due to <ahref="https://github.com/munin-monitoring/munin/issues/746">Munin checking a column that was changed in PostgreSQL 9.6</a></li>
<li>I found a handful of AGROVOC subjects that use a non-breaking space (0x00a0) instead of a regular space, which makes for a pretty confusing debugging…</li>
<li>CGSpace (linode18) is having problems with Solr again, I’m seeing “Error opening new searcher” in the Solr logs and there are no stats for previous years</li>
<li>Apparently the Solr statistics shards didn’t load properly when we restarted Tomcat <em>yesterday</em>:</li>
<li>For reference, I don’t see the <code>ulimit -v unlimited</code> in the <code>catalina.sh</code> script, though the <code>tomcat7</code> systemd service has <code>LimitAS=infinity</code></li>
<li>I will try to add <code>ulimit -v unlimited</code> to the Catalina startup script and check the output of the limits to see if it’s different in practice, as some wisdom on Stack Overflow says this solves the Solr core issues and I’ve superstitiously tried it various times in the past
<li>The result is the same before and after, so <em>adding the ulimit directly is unneccessary</em> (whether or not unlimited address space is useful or not is another question)</li>
<li>Another avenue might be to look at point releases in Solr 4.10.x, as we’re running 4.10.2 and they released 4.10.3 and 4.10.4 back in 2014 or 2015
<li>It’s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:</li>
<li>To investigate the Solr lock issue I added a <code>find</code> command to the Tomcat 7 service with <code>ExecStartPre</code> and <code>ExecStopPost</code> and noticed that the lock files are always there…
<li>In other news, I notice that that systemd always thinks that Tomcat has failed when it stops because the JVM exits with code 143, which is apparently normal when processes gracefully receive a SIGTERM (128 + 15 == 143)
<pre><code>2019-03-22 21:21:34,378 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
<li>It’s hard to tell because we have <code>logAbanded</code> enabled, but I don’t see anything in the <code>tomcat7</code> service logs in the systemd journal</li>
<p>I spent some time trying to test and debug the Tomcat connection pool’s settings, but for some reason our logs are either messed up or no connections are actually getting abandoned</p>
<p>I compiled this <ahref="https://github.com/gnosly/TomcatJdbcConnectionTest">TomcatJdbcConnectionTest</a> and created a bunch of database connections and waited a few minutes but they never got abandoned until I created over <code>maxActive</code> (75), after which almost all were purged at once</p>
<li>I did some more tests with the <ahref="https://github.com/gnosly/TomcatJdbcConnectionTest">TomcatJdbcConnectionTest</a> thing and while monitoring the number of active connections in jconsole and after adjusting the limits quite low I eventually saw some connections get abandoned</li>
<li>I forgot that to connect to a remote JMX session with jconsole you need to use a dynamic SSH SOCKS proxy (as I originally <ahref="/cgspace-notes/2017-11/">discovered in 2017-11</a>:</li>
<li>This includes the new <code>testWhileIdle</code> and <code>testOnConnect</code> pool settings as well as the two new JDBC interceptors: <code>StatementFinalizer</code> and <code>ConnectionState</code> that should hopefully make sure our connections in the pool are valid</li>
<li>Looking at the DBCP status on CGSpace via jconsole and everything looks good, though I wonder why <code>timeBetweenEvictionRunsMillis</code> is -1, because the <ahref="https://tomcat.apache.org/tomcat-7.0-doc/jdbc-pool.html">Tomcat 7.0 JDBC docs</a> say the default is 5000…
<ul>
<li>Could be an error in the docs, as I see the <ahref="https://commons.apache.org/proper/commons-dbcp/configuration.html">Apache Commons DBCP</a> has -1 as the default</li>
<li>Now I see all my interceptor settings etc in jconsole, where I didn’t see them before (also a new <code>tomcat.jdbc</code> mbean)!</li>
<li>No wonder our settings didn’t quite match the ones in the <ahref="https://tomcat.apache.org/tomcat-7.0-doc/jdbc-pool.html">Tomcat DBCP Pool docs</a></li>
<li>I need to watch this carefully though because I’ve read some places that Tomcat’s DBCP doesn’t track statements and might create memory leaks if an application doesn’t close statements before a connection gets returned back to the pool</li>
<li>The two IPV6 addresses are something called BLEXBot, which seems to check the robots.txt file and then completely ignore it by making thousands of requests to dynamic pages like Browse and Discovery</li>
<li><code>54.70.40.11</code> is SemanticScholarBot</li>
<li><code>216.244.66.198</code> is DotBot</li>
<li><code>93.179.69.74</code> is some IP in Ukraine, which I will add to the list of bot IPs in nginx</li>
<li>I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)</li>
<li>Looking at the nginx logs I don’t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:</li>
<li>So I’m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet</li>
<li>I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat’s Crawler Session Manager Valve to force them to re-use their session</li>
<li>What’s clear from this is that some other VM on our host has heavy usage for about four hours at 6AM and 6PM and that during that time the load on our server spikes
<li>I restarted Tomcat to see if I could fix the missing pre-2019 statistics (yes it fixed it)
<ul>
<li>Though I looked in the Solr Admin UI and noticed a logging dashboard that show warnings and errors, and the first concerning Solr cores was on 3/27/2019, 8:50:35 AM so I should check the logs around that time to see if something happened</li>
<li>It is frustrating to see that the load spikes for own own legitimate load on the server were <em>very</em> aggravated and drawn out by the contention for CPU on this host</li>
<li>Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in <ahref="/cgspace-notes/2018-10/">2018-10</a>:</li>
<li>This <ahref="https://www.hetzner.com/dedicated-rootserver/px62-nvme">PX62-NVME system</a> looks great an is half the price of our current Linode instance</li>
<li>It has 64GB of ECC RAM, six core Xeon processor from 2018, and 2x960GB NVMe storage</li>
<li>The alternative of staying with Linode and using dedicated CPU instances with added block storage gets expensive quickly if we want to keep more than 16GB of RAM (do we?)
<li>The item was added on 2019-03-13 and these three IPs have attempted to download the item’s bitstream 43,000 times since it was added eighteen days ago:</li>
<li>I will send a mail to CTA to ask if they know these IPs</li>
<li>I wonder if the Cocoon errors we had earlier this month were inadvertently related to the CPU steal issue… I see very low occurrences of the “Can not load requested doc” error in the Cocoon logs the past few days</li>
<li>Helping Perttu debug some issues with the REST API on DSpace Test
<pre><code>2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
<li>Interestingly, if I look at the network requests during page load for the first item I see the following response payload for the Altmetric API request:</li>
<pre><code>_altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
<pre><code>_altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})