<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="June, 2021" /> <meta property="og:description" content="2021-06-01 IWMI notified me that AReS was down with an HTTP 502 error Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running I simply started it and AReS was running again: " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-06/" /> <meta property="article:published_time" content="2021-06-01T10:51:07+03:00" /> <meta property="article:modified_time" content="2021-07-01T08:53:21+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="June, 2021"/> <meta name="twitter:description" content="2021-06-01 IWMI notified me that AReS was down with an HTTP 502 error Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running I simply started it and AReS was running again: "/> <meta name="generator" content="Hugo 0.101.0" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "June, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-06/", "wordCount": "3505", "datePublished": "2021-06-01T10:51:07+03:00", "dateModified": "2021-07-01T08:53:21+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-06/"> <title>June, 2021 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-06/">June, 2021</a></h2> <p class="blog-post-meta"> <time datetime="2021-06-01T10:51:07+03:00">Tue Jun 01, 2021</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2021-06-01">2021-06-01</h2> <ul> <li>IWMI notified me that AReS was down with an HTTP 502 error <ul> <li>Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification</li> <li>I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the <code>angular_nginx</code> container isn’t running</li> <li>I simply started it and AReS was running again:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml start angular_nginx </span></span></code></pre></div><ul> <li>Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately <ul> <li>I guess this is related to the SMTP issues last week</li> <li>I had fixed the config, but didn’t restart Tomcat so DSpace didn’t load the new variables</li> <li>I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers</li> </ul> </li> </ul> <h2 id="2021-06-03">2021-06-03</h2> <ul> <li>Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI’s content into the new AMCOW Knowledge Hub <ul> <li>At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time</li> <li>That’s when I thought they could perhaps use the OpenSearch, but I can’t remember if it’s possible to limit by community, or only collection…</li> <li>Looking now, I see there is a “scope” parameter that can be used for community or collection, for example:</li> </ul> </li> </ul> <pre tabindex="0"><code>https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1 </code></pre><ul> <li>That will sort by date issued (see: <code>webui.itemlist.sort-option.2</code> in dspace.cfg), give 100 results per page, and start on item 1</li> <li>Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week</li> <li>Fill out the <em>CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC</em> survey on behalf of CGSpace</li> </ul> <h2 id="2021-06-06">2021-06-06</h2> <ul> <li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>curl -XDELETE 'http://localhost:9200/openrxv-items-final' </span></span><span style="display:flex;"><span>curl -XDELETE 'http://localhost:9200/openrxv-items-temp' </span></span><span style="display:flex;"><span>curl -XPUT 'http://localhost:9200/openrxv-items-final' </span></span><span style="display:flex;"><span>curl -XPUT 'http://localhost:9200/openrxv-items-temp' </span></span><span style="display:flex;"><span>curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}' </span></span><span style="display:flex;"><span>elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping </span></span><span style="display:flex;"><span>elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000 </span></span></code></pre></div><ul> <li>Then I started a harvesting on AReS</li> </ul> <h2 id="2021-06-07">2021-06-07</h2> <ul> <li>The harvesting on AReS completed successfully</li> <li>Provide feedback to FAO on how we use AGROVOC for their “AGROVOC call for use cases”</li> </ul> <h2 id="2021-06-10">2021-06-10</h2> <ul> <li>Skype with Moayad to discuss AReS harvesting improvements <ul> <li>He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not</li> </ul> </li> </ul> <h2 id="2021-06-14">2021-06-14</h2> <ul> <li>Dump and re-create indexes on AReS (as above) so I can do a harvest</li> </ul> <h2 id="2021-06-16">2021-06-16</h2> <ul> <li>Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot’s DNS <ul> <li>For example, user agent <code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)</code> with DNS <code>msnbot-131-253-25-91.search.msn.com.</code></li> <li>I queried Solr for all hits using the MSN bot DNS (<code>dns:*msnbot* AND dns:*.msn.com.</code>) and found 457,706</li> <li>I extracted their IPs using Solr’s CSV format and ran them through my <code>resolve-addresses.py</code> script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)</li> <li>Note that <a href="https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26">Microsoft’s docs say that reverse lookups on Bingbot IPs will always have “search.msn.com”</a> so it is safe to purge these as non-human traffic</li> <li>I purged the hits with <code>ilri/check-spider-ip-hits.sh</code> (though I had to do it in 3 batches because I forgot to increase the <code>facet.limit</code> so I was only getting them 100 at a time)</li> </ul> </li> <li>Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV <ul> <li>It will hopefully also fix the duplicate and missing items issues</li> <li>I had a Skype with him to discuss</li> <li>I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data </span></span></code></pre></div><ul> <li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster <ul> <li>I harvested 90,000+ items from DSpace Test in ~3 hours</li> <li>There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times…</li> </ul> </li> </ul> <h2 id="2021-06-17">2021-06-17</h2> <ul> <li>I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases <ul> <li>The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits</li> </ul> </li> <li>Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward</li> <li>More work with Moayad on OpenRXV harvesting issues <ul> <li>Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:digit:]]+"'</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">'{print $2}'</span> | wc -l </span></span><span style="display:flex;"><span>90459 </span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:digit:]]+"'</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">'{print $2}'</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>90380 </span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:digit:]]+"'</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">'{print $2}'</span> | sort | uniq -c | sort -h </span></span><span style="display:flex;"><span>... </span></span><span style="display:flex;"><span> 2 "10568/99409" </span></span><span style="display:flex;"><span> 2 "10568/99410" </span></span><span style="display:flex;"><span> 2 "10568/99411" </span></span><span style="display:flex;"><span> 2 "10568/99516" </span></span><span style="display:flex;"><span> 3 "10568/102093" </span></span><span style="display:flex;"><span> 3 "10568/103524" </span></span><span style="display:flex;"><span> 3 "10568/106664" </span></span><span style="display:flex;"><span> 3 "10568/106940" </span></span><span style="display:flex;"><span> 3 "10568/107195" </span></span><span style="display:flex;"><span> 3 "10568/96546" </span></span></code></pre></div><h2 id="2021-06-20">2021-06-20</h2> <ul> <li>Udana asked me to update their IWMI subjects from <code>farmer managed irrigation systems</code> to <code>farmer-led irrigation</code> <ul> <li>First I extracted the IWMI community from CGSpace:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv </span></span></code></pre></div><ul> <li>Then I used <code>csvcut</code> to extract just the columns I needed and do the replacement into a new CSV:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'id,dcterms.subject[],dcterms.subject[en_US]'</span> /tmp/2021-06-20-IWMI.csv | sed <span style="color:#e6db74">'s/farmer managed irrigation systems/farmer-led irrigation/'</span> > /tmp/2021-06-20-IWMI-new-subjects.csv </span></span></code></pre></div><ul> <li>Then I uploaded the resulting CSV to CGSpace, updating 161 items</li> <li>Start a harvest on AReS</li> <li>I found <a href="https://jira.lyrasis.org/browse/DS-1977">a bug</a> and <a href="https://github.com/DSpace/DSpace/pull/2584">a patch</a> for the private items showing up in the DSpace sitemap bug <ul> <li>The fix is super simple, I should try to apply it</li> </ul> </li> </ul> <h2 id="2021-06-21">2021-06-21</h2> <ul> <li>The AReS harvesting finished, but the indexes got messed up again</li> <li>I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates <ul> <li>We have 90,000+ items, but only 85,000 unique:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">'"repo":"CGSpace"'</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:alnum:]]+"'</span> | wc -l </span></span><span style="display:flex;"><span>90937 </span></span><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">'"repo":"CGSpace"'</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:alnum:]]+"'</span> | sort -u | wc -l </span></span><span style="display:flex;"><span>85709 </span></span></code></pre></div><ul> <li>So those could be duplicates from the way we harvest pages, but they could also be from mappings… <ul> <li>Manually inspecting the duplicates where handles appear more than once:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">'"repo":"CGSpace"'</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:alnum:]]+"'</span> | sort | uniq -c | sort -h </span></span></code></pre></div><ul> <li>Unfortunately I found no pattern: <ul> <li>Some appear twice in the Elasticsearch index, but appear in only one collection</li> <li>Some appear twice in the Elasticsearch index, and appear in <em>two</em> collections</li> <li>Some appear twice in the Elasticsearch index, but appear in three collections (!)</li> </ul> </li> <li>So really we need to just check whether a handle exists before we insert it</li> <li>I tested the <a href="https://github.com/DSpace/DSpace/pull/2584">pull request for DS-1977</a> that adjusts the sitemap generation code to exclude private items <ul> <li>It applies cleanly and seems to work, but we don’t actually have any private items</li> <li>The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private</li> </ul> </li> <li>Testing the <a href="https://github.com/DSpace/DSpace/pull/2275">pull request for DS-4065</a> where the REST API’s <code>/rest/items</code> endpoint is not aware of private items and returns an incorrect number of items <ul> <li>This is most easily seen by setting a low limit in <code>/rest/items</code>, making one of the items private, and requesting items again with the same limit</li> <li>I confirmed the issue on the current DSpace 6 Demo:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">"Accept: application/json"</span> <span style="color:#e6db74">"https://demo.dspace.org/rest/items?offset=0&limit=5"</span> | jq length </span></span><span style="display:flex;"><span>5 </span></span><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">"Accept: application/json"</span> <span style="color:#e6db74">"https://demo.dspace.org/rest/items?offset=0&limit=5"</span> | jq <span style="color:#e6db74">'.[].handle'</span> </span></span><span style="display:flex;"><span>"10673/4" </span></span><span style="display:flex;"><span>"10673/3" </span></span><span style="display:flex;"><span>"10673/6" </span></span><span style="display:flex;"><span>"10673/5" </span></span><span style="display:flex;"><span>"10673/7" </span></span><span style="display:flex;"><span># log into DSpace Demo XMLUI as admin and make one item private <span style="color:#f92672">(</span><span style="color:#66d9ef">for</span> example 10673/6<span style="color:#f92672">)</span> </span></span><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">"Accept: application/json"</span> <span style="color:#e6db74">"https://demo.dspace.org/rest/items?offset=0&limit=5"</span> | jq length </span></span><span style="display:flex;"><span>4 </span></span><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">"Accept: application/json"</span> <span style="color:#e6db74">"https://demo.dspace.org/rest/items?offset=0&limit=5"</span> | jq <span style="color:#e6db74">'.[].handle'</span> </span></span><span style="display:flex;"><span>"10673/4" </span></span><span style="display:flex;"><span>"10673/3" </span></span><span style="display:flex;"><span>"10673/5" </span></span><span style="display:flex;"><span>"10673/7" </span></span></code></pre></div><ul> <li>I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira</li> <li>Last week I noticed that the Gender Platform website is using “cgspace.cgiar.org” links for CGSpace, instead of handles <ul> <li>I emailed Fabio and Marianne to ask them to please use the Handle links</li> </ul> </li> <li>I tested the <a href="https://github.com/DSpace/DSpace/pull/2543">pull request for DS-4271</a> where Discovery filters of type “contains” don’t work as expected when the user’s search term has spaces <ul> <li>I tested with filter “farmer managed irrigation systems” on DSpace Test</li> <li>Before the patch I got 293 results, and the few I checked didn’t have the expected metadata value</li> <li>After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting</li> </ul> </li> <li>I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results: <ul> <li>90459 in final from last harvesting</li> <li>90307 in temp after new harvest</li> <li>90327 in temp after start plugins</li> </ul> </li> <li>The 90327 number seems closer to the “real” number of items on CGSpace… <ul> <li>Seems close, but not entirely correct yet:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:digit:]]+"'</span> openrxv-items_data-local-ds-4065.json | wc -l </span></span><span style="display:flex;"><span>90327 </span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'"handle":"[[:digit:]]+/[[:digit:]]+"'</span> openrxv-items_data-local-ds-4065.json | sort -u | wc -l </span></span><span style="display:flex;"><span>90317 </span></span></code></pre></div><h2 id="2021-06-22">2021-06-22</h2> <ul> <li>Make a <a href="https://github.com/atmire/COUNTER-Robots/pull/43">pull request</a> to the COUNTER-Robots project to add two new user agents: crusty and newspaper <ul> <li>These two bots have made ~3,000 requests on CGSpace</li> <li>Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p </span></span><span style="display:flex;"><span>Purging 1339 hits from RI\/1\.0 in statistics </span></span><span style="display:flex;"><span>Purging 447 hits from crusty in statistics </span></span><span style="display:flex;"><span>Purging 3736 hits from newspaper in statistics </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 5522 </span></span></code></pre></div><ul> <li>Surprised to see RI/1.0 in there because it’s been in the override file for a while</li> <li>Looking at the 2021 statistics in Solr I see a few more suspicious user agents: <ul> <li><code>PostmanRuntime/7.26.8</code></li> <li><code>node-fetch/1.0 (+https://github.com/bitinn/node-fetch)</code></li> <li><code>Photon/1.0</code></li> <li><code>StatusCake_Pagespeed_indev</code></li> <li><code>node-superagent/3.8.3</code></li> <li><code>cortex/1.0</code></li> </ul> </li> <li>These bots account for ~42,000 hits in our statistics… I will just purge them and add them to our local override, but I can’t be bothered to submit them to COUNTER-Robots since I’d have to look up the information for each one</li> <li>I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)</li> </ul> <h2 id="2021-06-23">2021-06-23</h2> <ul> <li>I woke up this morning to find CGSpace down <ul> <li>The logs show a high number of abandoned PostgreSQL connections and locks:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># journalctl --since<span style="color:#f92672">=</span>today -u tomcat7 | grep -c <span style="color:#e6db74">'Connection has been abandoned'</span> </span></span><span style="display:flex;"><span>978 </span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;'</span> | wc -l </span></span><span style="display:flex;"><span>10100 </span></span></code></pre></div><ul> <li>I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now</li> <li>In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)</li> <li>I also applied the following patches from the 6.4 milestone to our <code>6_x-prod</code> branch: <ul> <li>DS-4065: resource policy aware REST API hibernate queries</li> <li>DS-4271: Replaced brackets with double quotes in SolrServiceImpl</li> </ul> </li> <li>After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;'</span> | wc -l </span></span><span style="display:flex;"><span>63 </span></span></code></pre></div><ul> <li>Looking in the DSpace log, the first “pool empty” message I saw this morning was at 4AM:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000]. </span></span></code></pre></div><ul> <li>Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy) </span></span></code></pre></div><ul> <li>We can purge them, as this is not user traffic: <a href="https://about.flipboard.com/browserproxy/">https://about.flipboard.com/browserproxy/</a> <ul> <li>I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots</li> </ul> </li> <li>I merged <a href="https://github.com/ilri/OpenRXV/pull/96">Moayad’s health check pull request in AReS</a> and I will deploy it on the production server soon</li> </ul> <h2 id="2021-06-24">2021-06-24</h2> <ul> <li>I deployed the new OpenRXV code on CGSpace but I’m having problems with the indexing, something about missing the mappings on the <code>openrxv-items-temp</code> index <ul> <li>I extracted the mappings from my local instance using <code>elasticdump</code> and after putting them on CGSpace I was able to harvest…</li> <li>But still, there are way too many duplicates and I’m not sure what the actual number of items should be</li> <li>According to the OAI ListRecords for each of our repositories, we should have about: <ul> <li>MELSpace: 9537</li> <li>WorldFish: 4483</li> <li>CGSpace: 91305</li> <li>Total: 105325</li> </ul> </li> <li>Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'"handle":"([[:digit:]]|\.)+/[[:digit:]]+"'</span> cgspace-openrxv-items-temp-backup.json | wc -l </span></span><span style="display:flex;"><span>104797 </span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'"handle":"([[:digit:]]|\.)+/[[:digit:]]+"'</span> cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l </span></span><span style="display:flex;"><span>99186 </span></span></code></pre></div><ul> <li>This number is probably unique for that particular harvest, but I don’t think it represents the true number of items…</li> <li>The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">'"repo":"DSpace Test"'</span> 2021-06-23-openrxv-items-final-local.json | grep -oE <span style="color:#e6db74">'"handle":"([[:digit:]]|\.)+/[[:digit:]]+"'</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>90990 </span></span></code></pre></div><ul> <li>So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?! <ul> <li>I notice that we are missing the <code>type</code> in the metadata structure config for each repository on the production site, and we are using <code>type</code> for item type in the actual schema… so maybe there is a conflict there</li> <li>I will rename type to <code>item_type</code> and add it back to the metadata structure</li> <li>The add missing items definitely checks this field…</li> <li>I modified my local backup to add <code>type: item</code> and uploaded it to the temp index on production</li> <li>Oh! nginx is blocking OpenRXV’s attempt to read the sitemap:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV" </span></span></code></pre></div><ul> <li>I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items… <ul> <li>I see it fetched the sitemap three times, we need to make sure it’s only doing it once for each repository</li> </ul> </li> <li>According to the api logs we will be adding 5,697 items:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l </span></span><span style="display:flex;"><span>5697 </span></span></code></pre></div><ul> <li>Spent a few hours with Moayad troubleshooting and improving OpenRXV <ul> <li>We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs</li> </ul> </li> </ul> <h2 id="2021-06-25">2021-06-25</h2> <ul> <li>The new OpenRXV code creates almost 200,000 jobs when the plugins start <ul> <li>I figured out how to use <a href="https://github.com/bee-queue/arena/tree/master/example">bee-queue/arena</a> to view our Bull job queue</li> <li>Also, we can see the jobs directly using redis-cli:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ redis-cli </span></span><span style="display:flex;"><span>127.0.0.1:6379> SCAN 0 COUNT 5 </span></span><span style="display:flex;"><span>1) "49152" </span></span><span style="display:flex;"><span>2) 1) "bull:plugins:476595" </span></span><span style="display:flex;"><span> 2) "bull:plugins:367382" </span></span><span style="display:flex;"><span> 3) "bull:plugins:369228" </span></span><span style="display:flex;"><span> 4) "bull:plugins:438986" </span></span><span style="display:flex;"><span> 5) "bull:plugins:366215" </span></span></code></pre></div><ul> <li>We can apparently get the names of the jobs in each hash using <code>hget</code>:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>127.0.0.1:6379> TYPE bull:plugins:401827 </span></span><span style="display:flex;"><span>hash </span></span><span style="display:flex;"><span>127.0.0.1:6379> HGET bull:plugins:401827 name </span></span><span style="display:flex;"><span>"dspace_add_missing_items" </span></span></code></pre></div><ul> <li>I whipped up a one liner to get the keys for all plugin jobs, convert to redis <code>HGET</code> commands to extract the value of the name field, and then sort them by their counts:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ redis-cli KEYS <span style="color:#e6db74">"bull:plugins:*"</span> <span style="color:#ae81ff">\ </span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \ </span></span><span style="display:flex;"><span> | ncat -w 3 localhost 6379 \ </span></span><span style="display:flex;"><span> | grep -v -E '^\$' | sort | uniq -c | sort -h </span></span><span style="display:flex;"><span> 3 dspace_health_check </span></span><span style="display:flex;"><span> 4 -ERR wrong number of arguments for 'hget' command </span></span><span style="display:flex;"><span> 12 mel_downloads_and_views </span></span><span style="display:flex;"><span> 129 dspace_altmetrics </span></span><span style="display:flex;"><span> 932 dspace_downloads_and_views </span></span><span style="display:flex;"><span> 186428 dspace_add_missing_items </span></span></code></pre></div><ul> <li>Note that this uses <code>ncat</code> to send commands directly to redis all at once instead of one at a time (<code>netcat</code> didn’t work here, as it doesn’t know when our input is finished and never quits) <ul> <li>I thought of using <code>redis-cli --pipe</code> but then you have to construct the commands in the redis protocol format with the number of args and length of each command</li> </ul> </li> <li>There is clearly something wrong with the new DSpace health check plugin, as it creates WAY too many jobs every time we run the plugins</li> </ul> <h2 id="2021-06-27">2021-06-27</h2> <ul> <li>Looking into the spike in PostgreSQL connections last week <ul> <li>I see the same things that I always see (large number of connections waiting for lock, large number of threads, high CPU usage, etc), but I also see almost 10,000 DSpace sessions on 2021-06-25</li> </ul> </li> </ul> <p><img src="/cgspace-notes/2021/06/dspace-sessions-week.png" alt="DSpace sessions"></p> <ul> <li>Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> file in dspace.log.2021-06-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">"</span>$file<span style="color:#e6db74">"</span>; grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}'</span> <span style="color:#e6db74">"</span>$file<span style="color:#e6db74">"</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span> </span></span><span style="display:flex;"><span>dspace.log.2021-06-10 </span></span><span style="display:flex;"><span>19072 </span></span><span style="display:flex;"><span>dspace.log.2021-06-11 </span></span><span style="display:flex;"><span>19224 </span></span><span style="display:flex;"><span>dspace.log.2021-06-12 </span></span><span style="display:flex;"><span>19215 </span></span><span style="display:flex;"><span>dspace.log.2021-06-13 </span></span><span style="display:flex;"><span>16721 </span></span><span style="display:flex;"><span>dspace.log.2021-06-14 </span></span><span style="display:flex;"><span>17880 </span></span><span style="display:flex;"><span>dspace.log.2021-06-15 </span></span><span style="display:flex;"><span>12103 </span></span><span style="display:flex;"><span>dspace.log.2021-06-16 </span></span><span style="display:flex;"><span>4651 </span></span><span style="display:flex;"><span>dspace.log.2021-06-17 </span></span><span style="display:flex;"><span>22785 </span></span><span style="display:flex;"><span>dspace.log.2021-06-18 </span></span><span style="display:flex;"><span>21406 </span></span><span style="display:flex;"><span>dspace.log.2021-06-19 </span></span><span style="display:flex;"><span>25967 </span></span><span style="display:flex;"><span>dspace.log.2021-06-20 </span></span><span style="display:flex;"><span>20850 </span></span><span style="display:flex;"><span>dspace.log.2021-06-21 </span></span><span style="display:flex;"><span>6388 </span></span><span style="display:flex;"><span>dspace.log.2021-06-22 </span></span><span style="display:flex;"><span>5945 </span></span><span style="display:flex;"><span>dspace.log.2021-06-23 </span></span><span style="display:flex;"><span>46371 </span></span><span style="display:flex;"><span>dspace.log.2021-06-24 </span></span><span style="display:flex;"><span>9024 </span></span><span style="display:flex;"><span>dspace.log.2021-06-25 </span></span><span style="display:flex;"><span>12521 </span></span><span style="display:flex;"><span>dspace.log.2021-06-26 </span></span><span style="display:flex;"><span>16163 </span></span><span style="display:flex;"><span>dspace.log.2021-06-27 </span></span><span style="display:flex;"><span>5886 </span></span></code></pre></div><ul> <li>I see 15,000 unique IPs in the XMLUI logs alone on that day:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep <span style="color:#e6db74">'23/Jun/2021'</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>15835 </span></span></code></pre></div><ul> <li>Annoyingly I found 37,000 more hits from Bing using <code>dns:*msnbot* AND dns:*.msn.com.</code> as a Solr filter <ul> <li>WTF, they are using a normal user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li> <li>I will purge the IPs and add this user agent to the nginx config so that we can rate limit it</li> </ul> </li> <li>I signed up for Bing Webmaster Tools and verified cgspace.cgiar.org with the BingSiteAuth.xml file <ul> <li>Also I adjusted the nginx config to explicitly allow access to <code>robots.txt</code> even when bots are rate limited</li> <li>Also I found that Bing was auto discovering all our RSS and Atom feeds as “sitemaps” so I deleted 750 of them and submitted the real sitemap</li> <li>I need to see if I can adjust the nginx config further to map the <code>bot</code> user agent to DNS like msnbot…</li> </ul> </li> <li>Review Abdullah’s filter on click pull request <ul> <li>I rebased his code on the latest master branch and tested adding filter on click to the map and list components, and it works fine</li> <li>There seems to be a bug that breaks scrolling on the page though…</li> <li>Abdullah fixed the bug in the filter on click branch</li> </ul> </li> </ul> <h2 id="2021-06-28">2021-06-28</h2> <ul> <li>Some work on OpenRXV <ul> <li>Integrate <code>prettier</code> into the frontend and backend and format everything on the <code>master</code> branch</li> <li>Re-work the GitHub Actions workflow for frontend and add one for backend</li> <li>The workflows run <code>npm install</code> to test dependencies, and <code>npm ci</code> with <code>prettier</code> to check formatting</li> <li>Also I merged Abdallah’s filter on click pull request</li> </ul> </li> </ul> <h2 id="2021-06-30">2021-06-30</h2> <ul> <li>CGSpace is showing a blank white page… <ul> <li>The status is HTTP 200, but it’s blank white… so UptimeRobot didn’t send a notification!</li> </ul> </li> <li>The DSpace log shows:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object </span></span></code></pre></div><ul> <li>The first one of these I see is from last night at 2021-06-29 at 10:47 PM</li> <li>I restarted Tomcat 7 and CGSpace came back up…</li> <li>I didn’t see that Atmire had responded last week (on 2021-06-23) about the issues we had <ul> <li>He said they had to do the same thing that they did last time: switch to the postgres user and kill all activity</li> <li>He said they found tons of connections to the REST API, like 3-4 per second, and asked if that was normal</li> <li>I pointed him to our Tomcat server.xml configuration, saying that we purposefully isolated the Tomcat connection pools between the API and XMLUI for this purpose…</li> </ul> </li> <li>Export a list of all CGSpace’s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER; </span></span><span style="display:flex;"><span>COPY 20780 </span></span></code></pre></div><ul> <li>Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER; </span></span><span style="display:flex;"><span>COPY 1710 </span></span></code></pre></div><ul> <li>Fix an issue in the Ansible infrastructure playbooks for the DSpace role <ul> <li>It was causing the template module to fail when setting up the npm environment</li> <li>We needed to install <code>acl</code> so that Ansible can use <code>setfacl</code> on the target file before becoming an unprivileged user</li> </ul> </li> <li>I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds. </span></span></code></pre></div><ul> <li>What’s even crazier is that it is twice that on CGSpace (linode18)!</li> <li>Apparently OpenJDK defaults to using <code>/dev/random</code> (see <code>/etc/java-8-openjdk/security/java.security</code>):</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>securerandom.source=file:/dev/urandom </span></span></code></pre></div><ul> <li><code>/dev/random</code> blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator <ul> <li>Now Tomcat starts much faster and no warning is printed so I’m going to add this to our Ansible infrastructure playbooks</li> </ul> </li> <li>Interesting resource about the lore behind the <code>/dev/./urandom</code> workaround that is posted all over the Internet, apparently due to a bug in early JVMs: <a href="https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6202721">https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6202721</a></li> <li>I’m experimenting with using PgBouncer for pooling instead of Tomcat’s JDBC</li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2022-08/">August, 2022</a></li> <li><a href="/cgspace-notes/2022-07/">July, 2022</a></li> <li><a href="/cgspace-notes/2022-06/">June, 2022</a></li> <li><a href="/cgspace-notes/2022-05/">May, 2022</a></li> <li><a href="/cgspace-notes/2022-04/">April, 2022</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>