Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243
real 0m19.873s
user 0m22.203s
sys 0m1.979s
</code></pre>
<ul>
<li>Normally I’d say this was very high, but <ahref="/cgspace-notes/2018-02/">about this time last year</a> I remember thinking the same thing when we had 3.1 million…</li>
<li>I will have to keep an eye on this to see if there is some error in Solr…</li>
<li>Atmire sent their <ahref="https://github.com/ilri/DSpace/pull/407">pull request to re-enable the Metadata Quality Module (MQM) on our <code>5_x-dev</code> branch</a> today
<ul>
<li>I will test it next week and send them feedback</li>
<li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li>
<li><code>45.5.184.2</code> is CIAT and <code>85.25.237.71</code> is the new Linguee bot that I first noticed a few days ago</li>
<li>I will increase the Linode alert threshold from 275 to 300% because this is becoming too much!</li>
<li>I tested the Atmire Metadata Quality Module (MQM)’s duplicate checked on the some <ahref="https://dspacetest.cgiar.org/handle/10568/81268">WLE items</a> that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates!</li>
<li><code>45.5.184.2</code> is CIAT, <code>70.32.83.92</code> and <code>205.186.128.185</code> are Macaroni Bros harvesters for CCAFS I think</li>
<li><code>195.201.104.240</code> is a new IP address in Germany with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre>
<ul>
<li>This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!</li>
<li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config
<ul>
<li>I <ahref="https://github.com/ilri/rmg-ansible-public/commit/36dfb072d6724fb5cdc81ef79cab08ed9ce427ad">extended the existing <code>dynamicpages</code> (12/m) rate limit to <code>/browse</code> and <code>/discover</code></a> with an allowance for bursting of up to five requests for “real” users</li>
<li>Generate a list of CTA subjects from CGSpace for Peter:</li>
</ul>
<pre><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
<li>This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:</li>
<li>At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there’s nothing we can do to improve REST API performance!</li>
<li>Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host?</li>
<li>Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file</li>
<li>In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use <code>toString()</code>:</li>
</ul>
<pre><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
isNotNull(value.match(/.*\u2019.*/)),
isNotNull(value.match(/.*\u00b4.*/)),
isNotNull(value.match(/.*\u007e.*/))
).toString()
</code></pre>
<ul>
<li>Testing the corrections for sixty-five items and sixteen deletions using my <ahref="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> and <ahref="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a> scripts:</li>
<li>Peter had marked several terms with <code>||</code> to indicate multiple values in his corrections so I will have to go back and do those manually:</li>
<li>So it seems that the load issue comes from the REST API, not the XMLUI</li>
<li>I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don’t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)</li>
<li>Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
</code></pre>
<ul>
<li>Collection 1056 appears to be <ahref="https://cgspace.cgiar.org/handle/10568/68741">IITA Posters and Presentations</a> and I see that its workflow step 1 (Accept/Reject) is empty:</li>
</ul>
<p><imgsrc="/cgspace-notes/2019/02/iita-workflow-step1-empty.png"alt="IITA Posters and Presentations workflow step 1 empty"/></p>
<ul>
<li>IITA editors or approvers should be added to that step (though I’m curious why nobody is in that group currently)</li>
<li>CGNET said these servers were discontinued in 2018-01 and that I should use <ahref="https://docs.microsoft.com/en-us/exchange/mail-flow-best-practices/how-to-set-up-a-multifunction-device-or-application-to-send-email-using-office-3">Office 365</a></li>
</ul>
<h2id="2019-02-08">2019-02-08</h2>
<ul>
<li>I re-configured CGSpace to use the email/password for cgspace-support, but I get this error when I try the <code>test-email</code> script:</li>
</ul>
<pre><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
</code></pre>
<ul>
<li>I tried to log into Outlook 365 with the credentials but I think the ones I have must be wrong, so I will ask ICT to reset the password</li>
<li>Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:</li>
<li>I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it’s becoming a bit of <em>the boy who cried wolf</em> because it alerts like clockwork twice per day!</li>
<li>Add my Python- and shell-based metadata workflow helper scripts as well as the environment settings for pipenv to our DSpace repository (<ahref="https://github.com/ilri/DSpace/pull/408">#408</a>) so I can track changes and distribute them more formally instead of just keeping them <ahref="https://github.com/ilri/DSpace/wiki/Scripts">collected on the wiki</a></li>
<li>Started adding IITA research theme (<code>cg.identifier.iitatheme</code>) to CGSpace
<ul>
<li>I’m still waiting for feedback from IITA whether they actually want to use “SOCIAL SCIENCE & AGRIC BUSINESS” because it is listed as <ahref="http://www.iita.org/project-discipline/social-science-and-agribusiness/">“Social Science and Agribusiness”</a> on their website</li>
<li>Also, I think they want to do some mappings of items with existing subjects to these new themes</li>
</ul></li>
<li>Update ILRI author name style in the controlled vocabulary (Domelevo Entfellner, Jean-Baka) (<ahref="https://github.com/ilri/DSpace/pull/409">#409</a>)
<ul>
<li>I’m still waiting to hear from Bizuwork whether we’ll batch update all existing items with the old name style</li>
<li>Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses
<li>Today I made <ahref="http://13.57.134.254/app/license_requests/15/">a request</a> to the <ahref="https://github.com/spdx/license-list-XML/blob/master/CONTRIBUTING.md">SPDX using their web form</a> to include this class of Creative Commons licenses](<ahref="https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations">https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations</a>)</li>
<li>Testing the <code>mail.server.disabled</code> property that I noticed in <code>dspace.cfg</code> recently
<ul>
<li>Setting it to true results in the following message when I try the <code>dspace test-email</code> helper on DSpace Test:</li>
</ul></li>
</ul>
<pre><code>Error sending email:
- Error: cannot test email because mail.server.disabled is set to true
</code></pre>
<ul>
<li>I’m not sure why I didn’t know about this configuration option before, and always maintained multiple configurations for development and production
<ul>
<li>I will modify the <ahref="https://github.com/ilri/rmg-ansible-public">Ansible DSpace role</a> to use this in its <code>build.properties</code> template</li>
</ul></li>
<li>I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:</li>
<li>For some reason my <code>mvn package</code> for DSpace is not working now… I might go back to <ahref="https://mjanja.ch/2018/02/cache-maven-artifacts-with-artifactory/">using Artifactory for caching</a> instead:</li>
<li>Bosede from IITA said we can use “SOCIAL SCIENCE & AGRIBUSINESS” in their new IITA theme field to be consistent with other places they are using it</li>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>I notice that <ahref="https://jira.duraspace.org/browse/DS-3052">DSpace 6 has included a new JAR-based PDF thumbnailer based on PDFBox</a>, I wonder how good its thumbnails are and how it handles CMYK PDFs</li>
<li>On a similar note, I wonder if we could use the performance-focused <ahref="https://libvips.github.io/libvips/">libvps</a> and the third-party <ahref="https://github.com/codecitizen/jlibvips/">jlibvips Java library</a> in DSpace</li>
<li>Testing the <code>vipsthumbnail</code> command line tool with <ahref="https://cgspace.cgiar.org/handle/10568/51999">this CGSpace item that uses CMYK</a>:</li>
<li>Thinking about making “top items” endpoints in my <ahref="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
<li>I could use the following SQL queries very easily to get the top items by views or downloads:</li>
</ul>
<pre><code>dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads DESC LIMIT 10;
</code></pre>
<ul>
<li>I’d have to think about what to make the REST API endpoints, perhaps: <code>/statistics/top/items?limit=10</code></li>
<li>But how do I do top items by views / downloads separately?</li>
<li>I re-deployed DSpace 6.3 locally to test the PDFBox thumbnails, especially to see if they handle CMYK files properly
<ul>
<li>The quality is JPEG 75 and I don’t see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:</li>
<li>ILRI ICT reset the password for the CGSpace mail account, but I still can’t get it to send mail from DSpace’s <code>test-email</code> utility</li>
<li>I even added extra mail properties to <code>dspace.cfg</code> as suggested by someone on the dspace-tech mailing list:</li>
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
</code></pre>
<ul>
<li>I tried to log into the Outlook 365 web mail and it doesn’t work so I’ve emailed ILRI ICT again</li>
<li>After reading the <ahref="https://javaee.github.io/javamail/FAQ#commonmistakes">common mistakes in the JavaMail FAQ</a> I reconfigured the extra properties in DSpace’s mail configuration to be simply:</li>
<li>… and then I was able to send a mail using my personal account where I know the credentials work</li>
<li>The CGSpace account still gets this error message:</li>
</ul>
<pre><code>Error sending email:
- Error: javax.mail.AuthenticationFailedException
</code></pre>
<ul>
<li>I updated the <ahref="https://github.com/ilri/DSpace/pull/410">DSpace SMTP settings in <code>dspace.cfg</code></a> as well as the <ahref="https://github.com/ilri/rmg-ansible-public/commit/ab5fe4d10e16413cd04ffb1bc3179dc970d6d47c">variables in the DSpace role of the Ansible infrastructure scripts</a></li>
<li>Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:</li>
</ul>
<pre><code>$ dspace user --delete --email blah@cta.int
<li>On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable <code>webui.user.assumelogin = true</code></li>
<li>I will enable this on CGSpace (<ahref="https://github.com/ilri/DSpace/pull/411">#411</a>)</li>