Add notes for 2021-11-08

This commit is contained in:
2021-11-09 06:29:52 +02:00
parent b3df4ff58f
commit 9afe5c13f9
110 changed files with 1827 additions and 1737 deletions

View File

@ -36,7 +36,7 @@ I simply started it and AReS was running again:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -132,8 +132,8 @@ I simply started it and AReS was running again:
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre></div><ul>
<li>Margarita from CCAFS emailed me to say that workflow alerts haven&rsquo;t been working lately
<ul>
<li>I guess this is related to the SMTP issues last week</li>
@ -162,14 +162,14 @@ I simply started it and AReS was running again:
<ul>
<li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">curl -XDELETE &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XDELETE &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -s -X POST &#39;http://localhost:9200/_aliases&#39; -H &#39;Content-Type: application/json&#39; -d&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre><ul>
</code></pre></div><ul>
<li>Then I started a harvesting on AReS</li>
</ul>
<h2 id="2021-06-07">2021-06-07</h2>
@ -208,8 +208,8 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</code></pre></div><ul>
<li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it&rsquo;s much faster
<ul>
<li>I harvested 90,000+ items from DSpace Test in ~3 hours</li>
@ -231,23 +231,23 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | wc -l
90459
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq | wc -l
90380
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq -c | sort -h
...
2 &quot;10568/99409&quot;
2 &quot;10568/99410&quot;
2 &quot;10568/99411&quot;
2 &quot;10568/99516&quot;
3 &quot;10568/102093&quot;
3 &quot;10568/103524&quot;
3 &quot;10568/106664&quot;
3 &quot;10568/106940&quot;
3 &quot;10568/107195&quot;
3 &quot;10568/96546&quot;
</code></pre><h2 id="2021-06-20">2021-06-20</h2>
2 &#34;10568/99409&#34;
2 &#34;10568/99410&#34;
2 &#34;10568/99411&#34;
2 &#34;10568/99516&#34;
3 &#34;10568/102093&#34;
3 &#34;10568/103524&#34;
3 &#34;10568/106664&#34;
3 &#34;10568/106940&#34;
3 &#34;10568/107195&#34;
3 &#34;10568/96546&#34;
</code></pre></div><h2 id="2021-06-20">2021-06-20</h2>
<ul>
<li>Udana asked me to update their IWMI subjects from <code>farmer managed irrigation systems</code> to <code>farmer-led irrigation</code>
<ul>
@ -255,12 +255,12 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</code></pre></div><ul>
<li>Then I used <code>csvcut</code> to extract just the columns I needed and do the replacement into a new CSV:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,dcterms.subject[],dcterms.subject[en_US]&#39;</span> /tmp/2021-06-20-IWMI.csv | sed <span style="color:#e6db74">&#39;s/farmer managed irrigation systems/farmer-led irrigation/&#39;</span> &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</code></pre></div><ul>
<li>Then I uploaded the resulting CSV to CGSpace, updating 161 items</li>
<li>Start a harvest on AReS</li>
<li>I found <a href="https://jira.lyrasis.org/browse/DS-1977">a bug</a> and <a href="https://github.com/DSpace/DSpace/pull/2584">a patch</a> for the private items showing up in the DSpace sitemap bug
@ -278,19 +278,19 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | wc -l
90937
$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort -u | wc -l
$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort -u | wc -l
85709
</code></pre><ul>
</code></pre></div><ul>
<li>So those could be duplicates from the way we harvest pages, but they could also be from mappings&hellip;
<ul>
<li>Manually inspecting the duplicates where handles appear more than once:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort | uniq -c | sort -h
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort | uniq -c | sort -h
</code></pre></div><ul>
<li>Unfortunately I found no pattern:
<ul>
<li>Some appear twice in the Elasticsearch index, but appear in only one collection</li>
@ -312,23 +312,23 @@ $ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
5
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq '.[].handle'
&quot;10673/4&quot;
&quot;10673/3&quot;
&quot;10673/6&quot;
&quot;10673/5&quot;
&quot;10673/7&quot;
# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
&#34;10673/4&#34;
&#34;10673/3&#34;
&#34;10673/6&#34;
&#34;10673/5&#34;
&#34;10673/7&#34;
# log into DSpace Demo XMLUI as admin and make one item private <span style="color:#f92672">(</span><span style="color:#66d9ef">for</span> example 10673/6<span style="color:#f92672">)</span>
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
4
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq '.[].handle'
&quot;10673/4&quot;
&quot;10673/3&quot;
&quot;10673/5&quot;
&quot;10673/7&quot;
</code></pre><ul>
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
&#34;10673/4&#34;
&#34;10673/3&#34;
&#34;10673/5&#34;
&#34;10673/7&#34;
</code></pre></div><ul>
<li>I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira</li>
<li>Last week I noticed that the Gender Platform website is using &ldquo;cgspace.cgiar.org&rdquo; links for CGSpace, instead of handles
<ul>
@ -355,11 +355,11 @@ $ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | wc -l
90327
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317
</code></pre><h2 id="2021-06-22">2021-06-22</h2>
</code></pre></div><h2 id="2021-06-22">2021-06-22</h2>
<ul>
<li>Make a <a href="https://github.com/atmire/COUNTER-Robots/pull/43">pull request</a> to the COUNTER-Robots project to add two new user agents: crusty and newspaper
<ul>
@ -368,13 +368,13 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics
Total number of bot hits purged: 5522
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 5522
</code></pre></div><ul>
<li>Surprised to see RI/1.0 in there because it&rsquo;s been in the override file for a while</li>
<li>Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
<ul>
@ -397,11 +397,11 @@ Total number of bot hits purged: 5522
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># journalctl --since<span style="color:#f92672">=</span>today -u tomcat7 | grep -c <span style="color:#e6db74">&#39;Connection has been abandoned&#39;</span>
978
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
10100
</code></pre><ul>
</code></pre></div><ul>
<li>I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now</li>
<li>In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)</li>
<li>I also applied the following patches from the 6.4 milestone to our <code>6_x-prod</code> branch:
@ -412,17 +412,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
63
</code></pre><ul>
</code></pre></div><ul>
<li>Looking in the DSpace log, the first &ldquo;pool empty&rdquo; message I saw this morning was at 4AM:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre></div><ul>
<li>Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</code></pre></div><ul>
<li>We can purge them, as this is not user traffic: <a href="https://about.flipboard.com/browserproxy/">https://about.flipboard.com/browserproxy/</a>
<ul>
<li>I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots</li>
@ -448,17 +448,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | wc -l
104797
$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
</code></pre><ul>
</code></pre></div><ul>
<li>This number is probably unique for that particular harvest, but I don&rsquo;t think it represents the true number of items&hellip;</li>
<li>The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;DSpace Test&quot;' 2021-06-23-openrxv-items-final-local.json | grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;DSpace Test&#34;&#39;</span> 2021-06-23-openrxv-items-final-local.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> | sort | uniq | wc -l
90990
</code></pre><ul>
</code></pre></div><ul>
<li>So the harvest on the live site is missing items, then why didn&rsquo;t the add missing items plugin find them?!
<ul>
<li>I notice that we are missing the <code>type</code> in the metadata structure config for each repository on the production site, and we are using <code>type</code> for item type in the actual schema&hellip; so maybe there is a conflict there</li>
@ -469,8 +469,8 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &quot;GET /sitemap HTTP/1.1&quot; 503 190 &quot;-&quot; &quot;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &#34;GET /sitemap HTTP/1.1&#34; 503 190 &#34;-&#34; &#34;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&#34;
</code></pre></div><ul>
<li>I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins&hellip; now it&rsquo;s checking 180,000+ handles to see if they are collections or items&hellip;
<ul>
<li>I see it fetched the sitemap three times, we need to make sure it&rsquo;s only doing it once for each repository</li>
@ -478,9 +478,9 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</li>
<li>According to the api logs we will be adding 5,697 items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
</code></pre><ul>
</code></pre></div><ul>
<li>Spent a few hours with Moayad troubleshooting and improving OpenRXV
<ul>
<li>We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs</li>
@ -496,35 +496,35 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ redis-cli
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ redis-cli
127.0.0.1:6379&gt; SCAN 0 COUNT 5
1) &quot;49152&quot;
2) 1) &quot;bull:plugins:476595&quot;
2) &quot;bull:plugins:367382&quot;
3) &quot;bull:plugins:369228&quot;
4) &quot;bull:plugins:438986&quot;
5) &quot;bull:plugins:366215&quot;
</code></pre><ul>
1) &#34;49152&#34;
2) 1) &#34;bull:plugins:476595&#34;
2) &#34;bull:plugins:367382&#34;
3) &#34;bull:plugins:369228&#34;
4) &#34;bull:plugins:438986&#34;
5) &#34;bull:plugins:366215&#34;
</code></pre></div><ul>
<li>We can apparently get the names of the jobs in each hash using <code>hget</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">127.0.0.1:6379&gt; TYPE bull:plugins:401827
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">127.0.0.1:6379&gt; TYPE bull:plugins:401827
hash
127.0.0.1:6379&gt; HGET bull:plugins:401827 name
&quot;dspace_add_missing_items&quot;
</code></pre><ul>
&#34;dspace_add_missing_items&#34;
</code></pre></div><ul>
<li>I whipped up a one liner to get the keys for all plugin jobs, convert to redis <code>HGET</code> commands to extract the value of the name field, and then sort them by their counts:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ redis-cli KEYS &quot;bull:plugins:*&quot; \
| sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ redis-cli KEYS <span style="color:#e6db74">&#34;bull:plugins:*&#34;</span> <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span> | sed -e &#39;s/^bull/HGET bull/&#39; -e &#39;s/\([[:digit:]]\)$/\1 name/&#39; \
| ncat -w 3 localhost 6379 \
| grep -v -E '^\$' | sort | uniq -c | sort -h
| grep -v -E &#39;^\$&#39; | sort | uniq -c | sort -h
3 dspace_health_check
4 -ERR wrong number of arguments for 'hget' command
4 -ERR wrong number of arguments for &#39;hget&#39; command
12 mel_downloads_and_views
129 dspace_altmetrics
932 dspace_downloads_and_views
186428 dspace_add_missing_items
</code></pre><ul>
</code></pre></div><ul>
<li>Note that this uses <code>ncat</code> to send commands directly to redis all at once instead of one at a time (<code>netcat</code> didn&rsquo;t work here, as it doesn&rsquo;t know when our input is finished and never quits)
<ul>
<li>I thought of using <code>redis-cli --pipe</code> but then you have to construct the commands in the redis protocol format with the number of args and length of each command</li>
@ -544,7 +544,7 @@ hash
<ul>
<li>Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ for file in dspace.log.2021-06-[12]*; do echo &quot;$file&quot;; grep -oE 'session_id=[A-Z0-9]{32}' &quot;$file&quot; | sort | uniq | wc -l; done
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ <span style="color:#66d9ef">for</span> file in dspace.log.2021-06-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span>; grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}&#39;</span> <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
dspace.log.2021-06-10
19072
dspace.log.2021-06-11
@ -581,12 +581,12 @@ dspace.log.2021-06-26
16163
dspace.log.2021-06-27
5886
</code></pre><ul>
</code></pre></div><ul>
<li>I see 15,000 unique IPs in the XMLUI logs alone on that day:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep <span style="color:#e6db74">&#39;23/Jun/2021&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l
15835
</code></pre><ul>
</code></pre></div><ul>
<li>Annoyingly I found 37,000 more hits from Bing using <code>dns:*msnbot* AND dns:*.msn.com.</code> as a Solr filter
<ul>
<li>WTF, they are using a normal user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
@ -628,8 +628,8 @@ dspace.log.2021-06-27
</li>
<li>The DSpace log shows:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre></div><ul>
<li>The first one of these I see is from last night at 2021-06-29 at 10:47 PM</li>
<li>I restarted Tomcat 7 and CGSpace came back up&hellip;</li>
<li>I didn&rsquo;t see that Atmire had responded last week (on 2021-06-23) about the issues we had
@ -641,14 +641,14 @@ dspace.log.2021-06-27
</li>
<li>Export a list of all CGSpace&rsquo;s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &quot;dcterms.subject&quot;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &quot;dcterms.subject&quot; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &#34;dcterms.subject&#34;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &#34;dcterms.subject&#34; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
COPY 20780
</code></pre><ul>
</code></pre></div><ul>
<li>Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
COPY 1710
</code></pre><ul>
</code></pre></div><ul>
<li>Fix an issue in the Ansible infrastructure playbooks for the DSpace role
<ul>
<li>It was causing the template module to fail when setting up the npm environment</li>
@ -657,13 +657,13 @@ COPY 1710
</li>
<li>I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
</code></pre></div><ul>
<li>What&rsquo;s even crazier is that it is twice that on CGSpace (linode18)!</li>
<li>Apparently OpenJDK defaults to using <code>/dev/random</code> (see <code>/etc/java-8-openjdk/security/java.security</code>):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">securerandom.source=file:/dev/urandom
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">securerandom.source=file:/dev/urandom
</code></pre></div><ul>
<li><code>/dev/random</code> blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
<ul>
<li>Now Tomcat starts much faster and no warning is printed so I&rsquo;m going to add this to our Ansible infrastructure playbooks</li>