IWMI notified me that AReS was down with an HTTP 502 error
Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running
IWMI notified me that AReS was down with an HTTP 502 error
Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running
<li>IWMI notified me that AReS was down with an HTTP 502 error
<ul>
<li>Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification</li>
<li>I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the <code>angular_nginx</code> container isn’t running</li>
<li>I simply started it and AReS was running again:</li>
<li>Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI’s content into the new AMCOW Knowledge Hub
<ul>
<li>At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time</li>
<li>That’s when I thought they could perhaps use the OpenSearch, but I can’t remember if it’s possible to limit by community, or only collection…</li>
<li>Looking now, I see there is a “scope” parameter that can be used for community or collection, for example:</li>
<li>The harvesting on AReS completed successfully</li>
<li>Provide feedback to FAO on how we use AGROVOC for their “AGROVOC call for use cases”</li>
</ul>
<h2id="2021-06-10">2021-06-10</h2>
<ul>
<li>Skype with Moayad to discuss AReS harvesting improvements
<ul>
<li>He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not</li>
</ul>
</li>
</ul>
<h2id="2021-06-14">2021-06-14</h2>
<ul>
<li>Dump and re-create indexes on AReS (as above) so I can do a harvest</li>
</ul>
<h2id="2021-06-16">2021-06-16</h2>
<ul>
<li>Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot’s DNS
<li>For example, user agent <code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)</code> with DNS <code>msnbot-131-253-25-91.search.msn.com.</code></li>
<li>I queried Solr for all hits using the MSN bot DNS (<code>dns:*msnbot* AND dns:*.msn.com.</code>) and found 457,706</li>
<li>I extracted their IPs using Solr’s CSV format and ran them through my <code>resolve-addresses.py</code> script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)</li>
<li>Note that <ahref="https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26">Microsoft’s docs say that reverse lookups on Bingbot IPs will always have “search.msn.com”</a> so it is safe to purge these as non-human traffic</li>
<li>I purged the hits with <code>ilri/check-spider-ip-hits.sh</code> (though I had to do it in 3 batches because I forgot to increase the <code>facet.limit</code> so I was only getting them 100 at a time)</li>
<li>It will hopefully also fix the duplicate and missing items issues</li>
<li>I had a Skype with him to discuss</li>
<li>I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:</li>
<li>Then I uploaded the resulting CSV to CGSpace, updating 161 items</li>
<li>Start a harvest on AReS</li>
<li>I found <ahref="https://jira.lyrasis.org/browse/DS-1977">a bug</a> and <ahref="https://github.com/DSpace/DSpace/pull/2584">a patch</a> for the private items showing up in the DSpace sitemap bug
<ul>
<li>The fix is super simple, I should try to apply it</li>
</ul>
</li>
</ul>
<h2id="2021-06-21">2021-06-21</h2>
<ul>
<li>The AReS harvesting finished, but the indexes got messed up again</li>
<li>I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
<ul>
<li>We have 90,000+ items, but only 85,000 unique:</li>
<li>Some appear twice in the Elasticsearch index, but appear in only one collection</li>
<li>Some appear twice in the Elasticsearch index, and appear in <em>two</em> collections</li>
<li>Some appear twice in the Elasticsearch index, but appear in three collections (!)</li>
</ul>
</li>
<li>So really we need to just check whether a handle exists before we insert it</li>
<li>I tested the <ahref="https://github.com/DSpace/DSpace/pull/2584">pull request for DS-1977</a> that adjusts the sitemap generation code to exclude private items
<ul>
<li>It applies cleanly and seems to work, but we don’t actually have any private items</li>
<li>The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private</li>
</ul>
</li>
<li>Testing the <ahref="https://github.com/DSpace/DSpace/pull/2275">pull request for DS-4065</a> where the REST API’s <code>/rest/items</code> endpoint is not aware of private items and returns an incorrect number of items
<ul>
<li>This is most easily seen by setting a low limit in <code>/rest/items</code>, making one of the items private, and requesting items again with the same limit</li>
<li>I confirmed the issue on the current DSpace 6 Demo:</li>
</span></span><spanstyle="display:flex;"><span># log into DSpace Demo XMLUI as admin and make one item private <spanstyle="color:#f92672">(</span><spanstyle="color:#66d9ef">for</span> example 10673/6<spanstyle="color:#f92672">)</span>
<li>I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira</li>
<li>Last week I noticed that the Gender Platform website is using “cgspace.cgiar.org” links for CGSpace, instead of handles
<ul>
<li>I emailed Fabio and Marianne to ask them to please use the Handle links</li>
</ul>
</li>
<li>I tested the <ahref="https://github.com/DSpace/DSpace/pull/2543">pull request for DS-4271</a> where Discovery filters of type “contains” don’t work as expected when the user’s search term has spaces
<ul>
<li>I tested with filter “farmer managed irrigation systems” on DSpace Test</li>
<li>Before the patch I got 293 results, and the few I checked didn’t have the expected metadata value</li>
<li>After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting</li>
</ul>
</li>
<li>I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results:
<ul>
<li>90459 in final from last harvesting</li>
<li>90307 in temp after new harvest</li>
<li>90327 in temp after start plugins</li>
</ul>
</li>
<li>The 90327 number seems closer to the “real” number of items on CGSpace…
<ul>
<li>Seems close, but not entirely correct yet:</li>
<li>Make a <ahref="https://github.com/atmire/COUNTER-Robots/pull/43">pull request</a> to the COUNTER-Robots project to add two new user agents: crusty and newspaper
<ul>
<li>These two bots have made ~3,000 requests on CGSpace</li>
<li>Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:</li>
<li>These bots account for ~42,000 hits in our statistics… I will just purge them and add them to our local override, but I can’t be bothered to submit them to COUNTER-Robots since I’d have to look up the information for each one</li>
<li>I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)</li>
<li>I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now</li>
<li>In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)</li>
<li>I also applied the following patches from the 6.4 milestone to our <code>6_x-prod</code> branch:
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
<li>I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots</li>
</ul>
</li>
<li>I merged <ahref="https://github.com/ilri/OpenRXV/pull/96">Moayad’s health check pull request in AReS</a> and I will deploy it on the production server soon</li>
<li>I deployed the new OpenRXV code on CGSpace but I’m having problems with the indexing, something about missing the mappings on the <code>openrxv-items-temp</code> index
<li>I extracted the mappings from my local instance using <code>elasticdump</code> and after putting them on CGSpace I was able to harvest…</li>
<li>But still, there are way too many duplicates and I’m not sure what the actual number of items should be</li>
<li>According to the OAI ListRecords for each of our repositories, we should have about:
<li>So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?!
<ul>
<li>I notice that we are missing the <code>type</code> in the metadata structure config for each repository on the production site, and we are using <code>type</code> for item type in the actual schema… so maybe there is a conflict there</li>
<li>I will rename type to <code>item_type</code> and add it back to the metadata structure</li>
<li>The add missing items definitely checks this field…</li>
<li>I modified my local backup to add <code>type: item</code> and uploaded it to the temp index on production</li>
<li>Oh! nginx is blocking OpenRXV’s attempt to read the sitemap:</li>
<li>I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items…
<li>We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs</li>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>127.0.0.1:6379> TYPE bull:plugins:401827
<li>I whipped up a one liner to get the keys for all plugin jobs, convert to redis <code>HGET</code> commands to extract the value of the name field, and then sort them by their counts:</li>
<li>Note that this uses <code>ncat</code> to send commands directly to redis all at once instead of one at a time (<code>netcat</code> didn’t work here, as it doesn’t know when our input is finished and never quits)
<ul>
<li>I thought of using <code>redis-cli --pipe</code> but then you have to construct the commands in the redis protocol format with the number of args and length of each command</li>
</ul>
</li>
<li>There is clearly something wrong with the new DSpace health check plugin, as it creates WAY too many jobs every time we run the plugins</li>
</ul>
<h2id="2021-06-27">2021-06-27</h2>
<ul>
<li>Looking into the spike in PostgreSQL connections last week
<ul>
<li>I see the same things that I always see (large number of connections waiting for lock, large number of threads, high CPU usage, etc), but I also see almost 10,000 DSpace sessions on 2021-06-25</li>
<li>Annoyingly I found 37,000 more hits from Bing using <code>dns:*msnbot* AND dns:*.msn.com.</code> as a Solr filter
<ul>
<li>WTF, they are using a normal user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
<li>I will purge the IPs and add this user agent to the nginx config so that we can rate limit it</li>
</ul>
</li>
<li>I signed up for Bing Webmaster Tools and verified cgspace.cgiar.org with the BingSiteAuth.xml file
<ul>
<li>Also I adjusted the nginx config to explicitly allow access to <code>robots.txt</code> even when bots are rate limited</li>
<li>Also I found that Bing was auto discovering all our RSS and Atom feeds as “sitemaps” so I deleted 750 of them and submitted the real sitemap</li>
<li>I need to see if I can adjust the nginx config further to map the <code>bot</code> user agent to DNS like msnbot…</li>
</ul>
</li>
<li>Review Abdullah’s filter on click pull request
<ul>
<li>I rebased his code on the latest master branch and tested adding filter on click to the map and list components, and it works fine</li>
<li>There seems to be a bug that breaks scrolling on the page though…</li>
<li>Abdullah fixed the bug in the filter on click branch</li>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
<li>The first one of these I see is from last night at 2021-06-29 at 10:47 PM</li>
<li>I restarted Tomcat 7 and CGSpace came back up…</li>
<li>I didn’t see that Atmire had responded last week (on 2021-06-23) about the issues we had
<ul>
<li>He said they had to do the same thing that they did last time: switch to the postgres user and kill all activity</li>
<li>He said they found tons of connections to the REST API, like 3-4 per second, and asked if that was normal</li>
<li>I pointed him to our Tomcat server.xml configuration, saying that we purposefully isolated the Tomcat connection pools between the API and XMLUI for this purpose…</li>
</ul>
</li>
<li>Export a list of all CGSpace’s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:</li>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
<li><code>/dev/random</code> blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
<ul>
<li>Now Tomcat starts much faster and no warning is printed so I’m going to add this to our Ansible infrastructure playbooks</li>
</ul>
</li>
<li>Interesting resource about the lore behind the <code>/dev/./urandom</code> workaround that is posted all over the Internet, apparently due to a bug in early JVMs: <ahref="https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6202721">https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6202721</a></li>
<li>I’m experimenting with using PgBouncer for pooling instead of Tomcat’s JDBC</li>