cgspace-notes/docs/2021-06/index.html

464 lines
20 KiB
HTML
Raw Normal View History

2021-06-03 20:54:49 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="June, 2021" />
<meta property="og:description" content="2021-06-01
IWMI notified me that AReS was down with an HTTP 502 error
Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
I don&rsquo;t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn&rsquo;t running
I simply started it and AReS was running again:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-06/" />
<meta property="article:published_time" content="2021-06-01T10:51:07+03:00" />
2021-06-22 14:22:15 +02:00
<meta property="article:modified_time" content="2021-06-21T16:24:40+03:00" />
2021-06-03 20:54:49 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2021"/>
<meta name="twitter:description" content="2021-06-01
IWMI notified me that AReS was down with an HTTP 502 error
Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
I don&rsquo;t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn&rsquo;t running
I simply started it and AReS was running again:
"/>
2021-06-21 15:24:40 +02:00
<meta name="generator" content="Hugo 0.84.0" />
2021-06-03 20:54:49 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "June, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-06/",
2021-06-22 14:22:15 +02:00
"wordCount": "1665",
2021-06-03 20:54:49 +02:00
"datePublished": "2021-06-01T10:51:07+03:00",
2021-06-22 14:22:15 +02:00
"dateModified": "2021-06-21T16:24:40+03:00",
2021-06-03 20:54:49 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-06/">
<title>June, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-06/">June, 2021</a></h2>
<p class="blog-post-meta">
<time datetime="2021-06-01T10:51:07+03:00">Tue Jun 01, 2021</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2021-06-01">2021-06-01</h2>
<ul>
<li>IWMI notified me that AReS was down with an HTTP 502 error
<ul>
<li>Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification</li>
<li>I don&rsquo;t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the <code>angular_nginx</code> container isn&rsquo;t running</li>
<li>I simply started it and AReS was running again:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre><ul>
<li>Margarita from CCAFS emailed me to say that workflow alerts haven&rsquo;t been working lately
<ul>
<li>I guess this is related to the SMTP issues last week</li>
<li>I had fixed the config, but didn&rsquo;t restart Tomcat so DSpace didn&rsquo;t load the new variables</li>
<li>I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers</li>
</ul>
</li>
</ul>
<h2 id="2021-06-03">2021-06-03</h2>
<ul>
<li>Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI&rsquo;s content into the new AMCOW Knowledge Hub
<ul>
<li>At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time</li>
<li>That&rsquo;s when I thought they could perhaps use the OpenSearch, but I can&rsquo;t remember if it&rsquo;s possible to limit by community, or only collection&hellip;</li>
<li>Looking now, I see there is a &ldquo;scope&rdquo; parameter that can be used for community or collection, for example:</li>
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&amp;scope=10568/16814&amp;order=DESC&amp;rpp=100&amp;sort_by=2&amp;start=1
</code></pre><ul>
<li>That will sort by date issued (see: <code>webui.itemlist.sort-option.2</code> in dspace.cfg), give 100 results per page, and start on item 1</li>
<li>Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week</li>
<li>Fill out the <em>CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC</em> survey on behalf of CGSpace</li>
</ul>
2021-06-07 07:05:14 +02:00
<h2 id="2021-06-06">2021-06-06</h2>
<ul>
<li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li>
</ul>
<pre><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre><ul>
<li>Then I started a harvesting on AReS</li>
</ul>
<h2 id="2021-06-07">2021-06-07</h2>
<ul>
<li>The harvesting on AReS completed successfully</li>
2021-06-07 20:36:27 +02:00
<li>Provide feedback to FAO on how we use AGROVOC for their &ldquo;AGROVOC call for use cases&rdquo;</li>
2021-06-07 07:05:14 +02:00
</ul>
2021-06-10 20:41:44 +02:00
<h2 id="2021-06-10">2021-06-10</h2>
<ul>
<li>Skype with Moayad to discuss AReS harvesting improvements
<ul>
<li>He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not</li>
</ul>
</li>
</ul>
2021-06-14 14:09:07 +02:00
<h2 id="2021-06-14">2021-06-14</h2>
<ul>
<li>Dump and re-create indexes on AReS (as above) so I can do a harvest</li>
</ul>
2021-06-16 17:31:15 +02:00
<h2 id="2021-06-16">2021-06-16</h2>
<ul>
<li>Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot&rsquo;s DNS
<ul>
<li>For example, user agent <code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)</code> with DNS <code>msnbot-131-253-25-91.search.msn.com.</code></li>
<li>I queried Solr for all hits using the MSN bot DNS (<code>dns:*msnbot* AND dns:*.msn.com.</code>) and found 457,706</li>
<li>I extracted their IPs using Solr&rsquo;s CSV format and ran them through my <code>resolve-addresses.py</code> script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)</li>
<li>Note that <a href="https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26">Microsoft&rsquo;s docs say that reverse lookups on Bingbot IPs will always have &ldquo;search.msn.com&rdquo;</a> so it is safe to purge these as non-human traffic</li>
<li>I purged the hits with <code>ilri/check-spider-ip-hits.sh</code> (though I had to do it in 3 batches because I forgot to increase the <code>facet.limit</code> so I was only getting them 100 at a time)</li>
</ul>
</li>
<li>Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
<ul>
<li>It will hopefully also fix the duplicate and missing items issues</li>
<li>I had a Skype with him to discuss</li>
<li>I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
2021-06-17 17:16:18 +02:00
</code></pre><ul>
<li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it&rsquo;s much faster
<ul>
<li>I harvested 90,000+ items from DSpace Test in ~3 hours</li>
2021-06-21 15:24:40 +02:00
<li>There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times&hellip;</li>
2021-06-17 17:16:18 +02:00
</ul>
</li>
</ul>
<h2 id="2021-06-17">2021-06-17</h2>
<ul>
<li>I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
<ul>
<li>The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits</li>
</ul>
</li>
<li>Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward</li>
<li>More work with Moayad on OpenRXV harvesting issues
<ul>
<li>Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
90459
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
90380
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
...
2 &quot;10568/99409&quot;
2 &quot;10568/99410&quot;
2 &quot;10568/99411&quot;
2 &quot;10568/99516&quot;
3 &quot;10568/102093&quot;
3 &quot;10568/103524&quot;
3 &quot;10568/106664&quot;
3 &quot;10568/106940&quot;
3 &quot;10568/107195&quot;
3 &quot;10568/96546&quot;
2021-06-21 15:24:40 +02:00
</code></pre><h2 id="2021-06-20">2021-06-20</h2>
<ul>
<li>Udana asked me to update their IWMI subjects from <code>farmer managed irrigation systems</code> to <code>farmer-led irrigation</code>
<ul>
<li>First I extracted the IWMI community from CGSpace:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</code></pre><ul>
<li>Then I used <code>csvcut</code> to extract just the columns I needed and do the replacement into a new CSV:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</code></pre><ul>
<li>Then I uploaded the resulting CSV to CGSpace, updating 161 items</li>
<li>Start a harvest on AReS</li>
<li>I found <a href="https://jira.lyrasis.org/browse/DS-1977">a bug</a> and <a href="https://github.com/DSpace/DSpace/pull/2584">a patch</a> for the private items showing up in the DSpace sitemap bug
<ul>
<li>The fix is super simple, I should try to apply it</li>
</ul>
</li>
</ul>
<h2 id="2021-06-21">2021-06-21</h2>
<ul>
<li>The AReS harvesting finished, but the indexes got messed up again</li>
<li>I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
<ul>
<li>We have 90,000+ items, but only 85,000 unique:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | wc -l
90937
$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort -u | wc -l
85709
</code></pre><ul>
<li>So those could be duplicates from the way we harvest pages, but they could also be from mappings&hellip;
<ul>
<li>Manually inspecting the duplicates where handles appear more than once:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort | uniq -c | sort -h
</code></pre><ul>
<li>Unfortunately I found no pattern:
<ul>
<li>Some appear twice in the Elasticsearch index, but appear in only one collection</li>
<li>Some appear twice in the Elasticsearch index, and appear in <em>two</em> collections</li>
<li>Some appear twice in the Elasticsearch index, but appear in three collections (!)</li>
</ul>
</li>
<li>So really we need to just check whether a handle exists before we insert it</li>
<li>I tested the <a href="https://github.com/DSpace/DSpace/pull/2584">pull request for DS-1977</a> that adjusts the sitemap generation code to exclude private items
<ul>
<li>It applies cleanly and seems to work, but we don&rsquo;t actually have any private items</li>
<li>The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private</li>
</ul>
</li>
<li>Testing the <a href="https://github.com/DSpace/DSpace/pull/2275">pull request for DS-4065</a> where the REST API&rsquo;s <code>/rest/items</code> endpoint is not aware of private items and returns an incorrect number of items
<ul>
<li>This is most easily seen by setting a low limit in <code>/rest/items</code>, making one of the items private, and requesting items again with the same limit</li>
<li>I confirmed the issue on the current DSpace 6 Demo:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
5
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq '.[].handle'
&quot;10673/4&quot;
&quot;10673/3&quot;
&quot;10673/6&quot;
&quot;10673/5&quot;
&quot;10673/7&quot;
# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
4
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq '.[].handle'
&quot;10673/4&quot;
&quot;10673/3&quot;
&quot;10673/5&quot;
&quot;10673/7&quot;
</code></pre><ul>
<li>I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira</li>
<li>Last week I noticed that the Gender Platform website is using &ldquo;cgspace.cgiar.org&rdquo; links for CGSpace, instead of handles
<ul>
<li>I emailed Fabio and Marianne to ask them to please use the Handle links</li>
</ul>
</li>
<li>I tested the <a href="https://github.com/DSpace/DSpace/pull/2543">pull request for DS-4271</a> where Discovery filters of type &ldquo;contains&rdquo; don&rsquo;t work as expected when the user&rsquo;s search term has spaces
<ul>
<li>I tested with filter &ldquo;farmer managed irrigation systems&rdquo; on DSpace Test</li>
<li>Before the patch I got 293 results, and the few I checked didn&rsquo;t have the expected metadata value</li>
<li>After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting</li>
</ul>
</li>
2021-06-22 14:22:15 +02:00
<li>I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results:
<ul>
<li>90459 in final from last harvesting</li>
<li>90307 in temp after new harvest</li>
<li>90327 in temp after start plugins</li>
</ul>
</li>
<li>The 90327 number seems closer to the &ldquo;real&rdquo; number of items on CGSpace&hellip;
<ul>
<li>Seems close, but not entirely correct yet:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | wc -l
90327
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317
</code></pre><h2 id="2021-06-22">2021-06-22</h2>
<ul>
<li>Make a <a href="https://github.com/atmire/COUNTER-Robots/pull/43">pull request</a> to the COUNTER-Robots project to add two new user agents: crusty and newspaper
<ul>
<li>These two bots have made ~3,000 requests on CGSpace</li>
<li>Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics
Total number of bot hits purged: 5522
</code></pre><ul>
<li>Surprised to see RI/1.0 in there because it&rsquo;s been in the override file for a while</li>
<li>Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
<ul>
<li><code>PostmanRuntime/7.26.8</code></li>
<li><code>node-fetch/1.0 (+https://github.com/bitinn/node-fetch)</code></li>
<li><code>Photon/1.0</code></li>
<li><code>StatusCake_Pagespeed_indev</code></li>
<li><code>node-superagent/3.8.3</code></li>
<li><code>cortex/1.0</code></li>
</ul>
</li>
<li>These bots account for ~42,000 hits in our statistics&hellip; I will just purge them and add them to our local override, but I can&rsquo;t be bothered to submit them to COUNTER-Robots since I&rsquo;d have to look up the information for each one</li>
2021-06-21 15:24:40 +02:00
</ul>
<!-- raw HTML omitted -->
2021-06-03 20:54:49 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2021-03/">March, 2021</a></li>
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>