Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>Normally I’d say this was very high, but <ahref="/cgspace-notes/2018-02/">about this time last year</a> I remember thinking the same thing when we had 3.1 million…</li>
<li>Atmire sent their <ahref="https://github.com/ilri/DSpace/pull/407">pull request to re-enable the Metadata Quality Module (MQM) on our <code>5_x-dev</code> branch</a> today
<ul>
<li>I will test it next week and send them feedback</li>
<li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li>
<li>I tested the Atmire Metadata Quality Module (MQM)’s duplicate checked on the some <ahref="https://dspacetest.cgiar.org/handle/10568/81268">WLE items</a> that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates!</li>
<li>This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!</li>
<li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config
<li>I <ahref="https://github.com/ilri/rmg-ansible-public/commit/36dfb072d6724fb5cdc81ef79cab08ed9ce427ad">extended the existing <code>dynamicpages</code> (12/m) rate limit to <code>/browse</code> and <code>/discover</code></a> with an allowance for bursting of up to five requests for “real” users</li>
<pretabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
<li>Skype with Michael Victor about CKM and CGSpace</li>
<li>Discuss the new IITA research theme field with Abenet and decide that we should use <code>cg.identifier.iitatheme</code></li>
<li>This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:</li>
<li>At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there’s nothing we can do to improve REST API performance!</li>
<li>In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use <code>toString()</code>:</li>
<li>Testing the corrections for sixty-five items and sixteen deletions using my <ahref="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> and <ahref="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a> scripts:</li>
<li>Peter had marked several terms with <code>||</code> to indicate multiple values in his corrections so I will have to go back and do those manually:</li>
<li>I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don’t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)</li>
<li>Collection 1056 appears to be <ahref="https://cgspace.cgiar.org/handle/10568/68741">IITA Posters and Presentations</a> and I see that its workflow step 1 (Accept/Reject) is empty:</li>
<li>CGNET said these servers were discontinued in 2018-01 and that I should use <ahref="https://docs.microsoft.com/en-us/exchange/mail-flow-best-practices/how-to-set-up-a-multifunction-device-or-application-to-send-email-using-office-3">Office 365</a></li>
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
</code></pre><ul>
<li>I tried to log into Outlook 365 with the credentials but I think the ones I have must be wrong, so I will ask ICT to reset the password</li>
<li>Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:</li>
<li>It’s very clear to me now that the API requests are the heaviest!</li>
<li>I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it’s becoming a bit of <em>the boy who cried wolf</em> because it alerts like clockwork twice per day!</li>
<li>Add my Python- and shell-based metadata workflow helper scripts as well as the environment settings for pipenv to our DSpace repository (<ahref="https://github.com/ilri/DSpace/pull/408">#408</a>) so I can track changes and distribute them more formally instead of just keeping them <ahref="https://github.com/ilri/DSpace/wiki/Scripts">collected on the wiki</a></li>
<li>Started adding IITA research theme (<code>cg.identifier.iitatheme</code>) to CGSpace
<li>I’m still waiting for feedback from IITA whether they actually want to use “SOCIAL SCIENCE & AGRIC BUSINESS” because it is listed as <ahref="http://www.iita.org/project-discipline/social-science-and-agribusiness/">“Social Science and Agribusiness”</a> on their website</li>
<li>Update ILRI author name style in the controlled vocabulary (Domelevo Entfellner, Jean-Baka) (<ahref="https://github.com/ilri/DSpace/pull/409">#409</a>)
<li>Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses
<li>Today I made <ahref="http://13.57.134.254/app/license_requests/15/">a request</a> to the <ahref="https://github.com/spdx/license-list-XML/blob/master/CONTRIBUTING.md">SPDX using their web form</a> to include this <ahref="https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations">class of Creative Commons licenses</a></li>
<li>I’m not sure why I didn’t know about this configuration option before, and always maintained multiple configurations for development and production
<li>I will modify the <ahref="https://github.com/ilri/rmg-ansible-public">Ansible DSpace role</a> to use this in its <code>build.properties</code> template</li>
<li>For some reason my <code>mvn package</code> for DSpace is not working now… I might go back to <ahref="https://mjanja.ch/2018/02/cache-maven-artifacts-with-artifactory/">using Artifactory for caching</a> instead:</li>
<li>Bosede from IITA said we can use “SOCIAL SCIENCE & AGRIBUSINESS” in their new IITA theme field to be consistent with other places they are using it</li>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>I notice that <ahref="https://jira.duraspace.org/browse/DS-3052">DSpace 6 has included a new JAR-based PDF thumbnailer based on PDFBox</a>, I wonder how good its thumbnails are and how it handles CMYK PDFs</li>
<li>On a similar note, I wonder if we could use the performance-focused <ahref="https://libvips.github.io/libvips/">libvps</a> and the third-party <ahref="https://github.com/codecitizen/jlibvips/">jlibvips Java library</a> in DSpace</li>
<li>Testing the <code>vipsthumbnail</code> command line tool with <ahref="https://cgspace.cgiar.org/handle/10568/51999">this CGSpace item that uses CMYK</a>:</li>
<li>(DSpace 5 appears to use JPEG 92 quality so I do the same)</li>
<li>Thinking about making “top items” endpoints in my <ahref="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
<li>I could use the following SQL queries very easily to get the top items by views or downloads:</li>
<li>ILRI ICT reset the password for the CGSpace mail account, but I still can’t get it to send mail from DSpace’s <code>test-email</code> utility</li>
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
<li>I tried to log into the Outlook 365 web mail and it doesn’t work so I’ve emailed ILRI ICT again</li>
<li>After reading the <ahref="https://javaee.github.io/javamail/FAQ#commonmistakes">common mistakes in the JavaMail FAQ</a> I reconfigured the extra properties in DSpace’s mail configuration to be simply:</li>
<li>I updated the <ahref="https://github.com/ilri/DSpace/pull/410">DSpace SMTP settings in <code>dspace.cfg</code></a> as well as the <ahref="https://github.com/ilri/rmg-ansible-public/commit/ab5fe4d10e16413cd04ffb1bc3179dc970d6d47c">variables in the DSpace role of the Ansible infrastructure scripts</a></li>
<li>Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:</li>
<li>On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable <code>webui.user.assumelogin = true</code></li>
<li>I will enable this on CGSpace (<ahref="https://github.com/ilri/DSpace/pull/411">#411</a>)</li>
<li>Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):</li>
<li>Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:</li>
<li>After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I’m pretty sure that’s not supposed to be how locks work…</li>
<li>I want to try to normalize the <code>text_lang</code> values to make working with metadata easier</li>
<li>We currently have a bunch of weird values that DSpace uses like <code>NULL</code>, <code>en_US</code>, and <code>en</code> and others that have been entered manually by editors:</li>
<pretabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
<li>The majority are <code>NULL</code>, <code>en_US</code>, the blank string, and <code>en</code>—the rest are not enough to be significant</li>
<li>Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!</li>
<li>Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:</li>
<li>More on the <ahref="https://podman.io/blogs/2018/10/03/podman-remove-content-homedir.html">subuid permissions issue with rootless containers here</a></li>
<li>I merged the Atmire Metadata Quality Module (MQM) changes to the <code>5_x-prod</code> branch and deployed it on CGSpace (<ahref="https://github.com/ilri/DSpace/pull/407">#407</a>)</li>
<li>Then I ran all system updates on CGSpace server and rebooted it</li>
<p>The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent “Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0”:</p>
<li>I will add this user agent to the <ahref="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2">“badbots” rate limiting in our nginx configuration</a></li>
<li>I realized that I had effectively only been applying the “badbots” rate limiting to requests at the root, so I added it to the other blocks that match Discovery, Browse, etc as well</li>
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
<li>I merged the changes to the <code>5_x-prod</code> branch and they will go live the next time we re-deploy CGSpace (<ahref="https://github.com/ilri/DSpace/pull/412">#412</a>)</li>
<li>That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don’t know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this</li>
<li>The user is requesting only things like <code>/handle/10568/56199?show=full</code> so it’s nothing malicious, only annoying</li>
<li>Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday’s nginx rate limiting updates
<li><code>2a01:7e00::f03c:91ff:fe0a:d645</code> is on Linode, and I can see from the XMLUI access logs that it is Drupal, so I assume it is part of the new ILRI website harvester…</li>
<li>Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I’m so fucking sick of this</li>
<li>I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate</li>
<li>I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from <ahref="https://hdl.handle.net/10568/96140">10568/96140</a> almost 200 times:</li>
<li>I wrote a quick and dirty Python script called <code>resolve-addresses.py</code> to resolve IP addresses to their owning organization’s name, ASN, and country using the <ahref="https://ipapi.co">IPAPI.co API</a></li>
<li>This returns six items for me, which is the <ahref="https://cgspace.cgiar.org/discover?filtertype_1=orcid&filter_relational_operator_1=contains&filter_1=Alan+S.+Orth%3A+0000-0002-1735-7458&submit_apply_filter=&query=">same I see in a Discovery search</a></li>
<li>Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my <ahref="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
<li>I was playing with <ahref="http://yasgui.org/">YasGUI</a> to query AGROVOC’s SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually</li>
<li>There seems to be a REST API for AGROVOC here: <ahref="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=FISH&lang=en">http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=FISH&lang=en</a></li>
<li>See this <ahref="https://jira.duraspace.org/browse/VIVO-1655">issue on the VIVO tracker</a> for more information about this endpoint</li>
<li>The old-school AGROVOC SOAP WSDL works with the <ahref="https://python-zeep.readthedocs.io/en/master/">Zeep Python library</a>, but in my tests the results are way too broad despite trying to use a “exact match” searching</li>
<li>I wrote a script <ahref="https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py">agrovoc-lookup.py</a> to resolve subject terms against the public AGROVOC REST API</li>
<li>It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to</li>
<li>And then a list of all the unique <em>unmatched</em> terms using some utility I’ve never heard of before called <code>comm</code> or with <code>diff</code>:</li>
<pretabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
<li>I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it’s almost ready so I created a pull request (<ahref="https://github.com/ilri/DSpace/pull/413">#413</a>)</li>
<li>A few days ago he added the metadata to <ahref="https://cgspace.cgiar.org/handle/10568/93011">10568/93011</a> and now I see that the item is present on the <ahref="https://www.wle.cgiar.org/resource-recovery-waste-business-models-energy-nutrient-and-water-reuse-low-and-middle-income">WLE publications website</a></li>
<p>Start looking at IITA’s latest round of batch uploads called <ahref="https://dspacetest.cgiar.org/handle/10568/108684">“IITA_Feb_14” on DSpace Test</a></p>
<li>A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)</li>
<li>One issue with smart quotes in countries</li>
<li>A few IITA subjects with syntax errors</li>
<li>Some whitespace and consistency issues in sponsorships</li>
<li>Eight items with invalid ISBN: 0-471-98560-3</li>
<li>You have to make sure to URL encode the value with <code>quote_plus()</code> and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable</li>
<li>There is a <ahref="https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json">good resource discussing OpenRefine, Jython, and web scraping</a></li>
<li>I decided to try to validate the AGROVOC subjects in IITA’s recent batch upload by dumping all their terms, checking them in en/es/fr with <code>agrovoc-lookup.py</code>, then reconciling against the final list using reconcile-csv with OpenRefine</li>
<li>I’m not sure how to deal with terms like “CORN” that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be “MAIZE”</li>
<li>There are dozens of other entries like “corn (soft wheat)”, “corn (zea)”, “corn bran”, “Cornales”, etc that could potentially match and to determine if they are related programatically is difficult</li>
<li>Shit, and then there are terms like “GENETIC DIVERSITY” that should <ahref="http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/page/c_33952">technically be</a>“genetic diversity (as resource)”</li>
<li>I applied all changes to the IITA Feb 14 batch data except the affiliations and sponsorships because I think I made some mistakes with the copying of reconciled values so I will try to look at those again separately</li>
<li>I went back and re-did the affiliations and sponsorships and then applied them on the IITA Feb 14 collection on DSpace Test</li>
<li>I did a duplicate check of the IITA Feb 14 records on DSpace Test and there were about fifteen or twenty items reported
<li>A few of them are actually in previous IITA batch updates, which means they have been uploaded to CGSpace yet, so I worry that there would be many more</li>
<li>I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I’m not sure I can because the Earlham guys are still testing COPO actively on DSpace Test</li>
<li>There seems to be something going on with Solr on CGSpace (linode18) because statistics on communities and collections are blank for January and February this year</li>
Feb 25 21:09:29 linode18 tomcat7[1015]: java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger$9.visit(SourceFile:1241)
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger.visitEachStatisticShard(SourceFile:268)
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger.update(SourceFile:1225)
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger.update(SourceFile:1220)
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:103)
<li>In the Solr admin GUI I see we have the following error: “statistics-2011: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”</li>
<li>I restarted Tomcat and upon startup I see lots of errors in the systemd journal, like:</li>
<li>Also, now the Solr admin UI says “statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”</li>
<li>I can see that there are still hits being recorded for items (in the Solr admin UI as well as my statistics API), so the main stats core is working at least!</li>
<li>On a hunch I tried adding <code>ulimit -v unlimited</code> to the Tomcat <code>catalina.sh</code> and now Solr starts up with no core errors and I actually have statistics for January and February on <ahref="https://cgspace.cgiar.org/handle/10568/16814">some communities</a>, but not <ahref="https://cgspace.cgiar.org/handle/10568/1">others</a></li>
<li>I wonder if the address space limits that I added via <code>LimitAS=infinity</code> in the systemd service are somehow not working?</li>
<li>I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the <code>LimitAS</code> setting does work, and the <code>infinity</code> setting in systemd does get translated to “unlimited” on the service</li>
<li>Update DSpace Test to <ahref="https://tomcat.apache.org/tomcat-7.0-doc/changelog.html#Tomcat_7.0.93_(violetagg)">Tomcat 7.0.93</a></li>
<li>Something seems to have happened (some Atmire scheduled task, perhaps the CUA one at 7AM?) on CGSpace because I checked a few communities and collections on CGSpace and there are now statistics for January and February</li>
<li>According to the <ahref="https://cgspace.cgiar.org/rest/collections/1021">REST API</a> collection 1021 appears to be <ahref="https://cgspace.cgiar.org/handle/10568/66581">CCAFS Tools, Maps, Datasets and Models</a></li>
<li>I looked at the <code>WORKFLOW_STEP_1</code> (Accept/Reject) and the group is of course empty</li>
<li>He asked me to upload the files for him via the command line, but the file he referenced (<code>Thumbnails_feb_2019.zip</code>) doesn’t exist</li>
<li>I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file’s name:</li>
<li>Why don’t they just derive the directory from the path to the zip file?</li>
<li>Working on Udana’s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
<li>I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn’t find any</li>
<li>I uploaded fifty-two records to the <ahref="https://cgspace.cgiar.org/handle/10568/81592">Restoring Degraded Landscapes collection</a> on CGSpace</li>
<li>I helped Sisay upload the nineteen CTA records from last week via the command line because they required mappings (which is not possible to do via the batch upload web interface)</li>