cgspace-notes/docs/2021-06/index.html

<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">


<meta property="og:title" content="June, 2021" />
<meta property="og:description" content="2021-06-01

IWMI notified me that AReS was down with an HTTP 502 error

Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
I don&rsquo;t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn&rsquo;t running
I simply started it and AReS was running again:


" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-06/" />
<meta property="article:published_time" content="2021-06-01T10:51:07+03:00" />
<meta property="article:modified_time" content="2021-06-16T18:31:15+03:00" />


<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2021"/>
<meta name="twitter:description" content="2021-06-01

IWMI notified me that AReS was down with an HTTP 502 error

Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
I don&rsquo;t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn&rsquo;t running
I simply started it and AReS was running again:


"/>
<meta name="generator" content="Hugo 0.83.1" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "June, 2021",
  "url": "https://alanorth.github.io/cgspace-notes/2021-06/",
  "wordCount": "817",
  "datePublished": "2021-06-01T10:51:07+03:00",
  "dateModified": "2021-06-16T18:31:15+03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-06/">

    <title>June, 2021 | CGSpace Notes</title>

    
    <!-- combined, minified CSS -->
    
    <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
    

    <!-- minified Font Awesome for SVG icons -->
    
    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>

    <!-- RSS 2.0 feed -->
    

  </head>

  <body>

    
    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>
    

    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>
    
    
    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          
<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-06/">June, 2021</a></h2>
    <p class="blog-post-meta">
<time datetime="2021-06-01T10:51:07+03:00">Tue Jun 01, 2021</time>
 in 
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2021-06-01">2021-06-01</h2>
<ul>
<li>IWMI notified me that AReS was down with an HTTP 502 error
<ul>
<li>Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification</li>
<li>I don&rsquo;t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the <code>angular_nginx</code> container isn&rsquo;t running</li>
<li>I simply started it and AReS was running again:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre><ul>
<li>Margarita from CCAFS emailed me to say that workflow alerts haven&rsquo;t been working lately
<ul>
<li>I guess this is related to the SMTP issues last week</li>
<li>I had fixed the config, but didn&rsquo;t restart Tomcat so DSpace didn&rsquo;t load the new variables</li>
<li>I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers</li>
</ul>
</li>
</ul>
<h2 id="2021-06-03">2021-06-03</h2>
<ul>
<li>Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI&rsquo;s content into the new AMCOW Knowledge Hub
<ul>
<li>At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time</li>
<li>That&rsquo;s when I thought they could perhaps use the OpenSearch, but I can&rsquo;t remember if it&rsquo;s possible to limit by community, or only collection&hellip;</li>
<li>Looking now, I see there is a &ldquo;scope&rdquo; parameter that can be used for community or collection, for example:</li>
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&amp;scope=10568/16814&amp;order=DESC&amp;rpp=100&amp;sort_by=2&amp;start=1
</code></pre><ul>
<li>That will sort by date issued (see: <code>webui.itemlist.sort-option.2</code> in dspace.cfg), give 100 results per page, and start on item 1</li>
<li>Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week</li>
<li>Fill out the <em>CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC</em> survey on behalf of CGSpace</li>
</ul>
<h2 id="2021-06-06">2021-06-06</h2>
<ul>
<li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li>
</ul>
<pre><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre><ul>
<li>Then I started a harvesting on AReS</li>
</ul>
<h2 id="2021-06-07">2021-06-07</h2>
<ul>
<li>The harvesting on AReS completed successfully</li>
<li>Provide feedback to FAO on how we use AGROVOC for their &ldquo;AGROVOC call for use cases&rdquo;</li>
</ul>
<h2 id="2021-06-10">2021-06-10</h2>
<ul>
<li>Skype with Moayad to discuss AReS harvesting improvements
<ul>
<li>He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not</li>
</ul>
</li>
</ul>
<h2 id="2021-06-14">2021-06-14</h2>
<ul>
<li>Dump and re-create indexes on AReS (as above) so I can do a harvest</li>
</ul>
<h2 id="2021-06-16">2021-06-16</h2>
<ul>
<li>Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot&rsquo;s DNS
<ul>
<li>For example, user agent <code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0;  Trident/5.0)</code> with DNS <code>msnbot-131-253-25-91.search.msn.com.</code></li>
<li>I queried Solr for all hits using the MSN bot DNS (<code>dns:*msnbot* AND dns:*.msn.com.</code>) and found 457,706</li>
<li>I extracted their IPs using Solr&rsquo;s CSV format and ran them through my <code>resolve-addresses.py</code> script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)</li>
<li>Note that <a href="https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26">Microsoft&rsquo;s docs say that reverse lookups on Bingbot IPs will always have &ldquo;search.msn.com&rdquo;</a> so it is safe to purge these as non-human traffic</li>
<li>I purged the hits with <code>ilri/check-spider-ip-hits.sh</code> (though I had to do it in 3 batches because I forgot to increase the <code>facet.limit</code> so I was only getting them 100 at a time)</li>
</ul>
</li>
<li>Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
<ul>
<li>It will hopefully also fix the duplicate and missing items issues</li>
<li>I had a Skype with him to discuss</li>
<li>I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</code></pre><ul>
<li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it&rsquo;s much faster
<ul>
<li>I harvested 90,000+ items from DSpace Test in ~3 hours</li>
<li>There seem to be some issues with the health check step though</li>
</ul>
</li>
</ul>
<h2 id="2021-06-17">2021-06-17</h2>
<ul>
<li>I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
<ul>
<li>The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits</li>
</ul>
</li>
<li>Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward</li>
<li>More work with Moayad on OpenRXV harvesting issues
<ul>
<li>Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
90459
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
90380
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
...
      2 &quot;10568/99409&quot;
      2 &quot;10568/99410&quot;
      2 &quot;10568/99411&quot;
      2 &quot;10568/99516&quot;
      3 &quot;10568/102093&quot;
      3 &quot;10568/103524&quot;
      3 &quot;10568/106664&quot;
      3 &quot;10568/106940&quot;
      3 &quot;10568/107195&quot;
      3 &quot;10568/96546&quot;
</code></pre><!-- raw HTML omitted -->

  
</article> 


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">
  

        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>

<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>

<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>

<li><a href="/cgspace-notes/2021-03/">March, 2021</a></li>

<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>

    </ol>
  </section>

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->
    

    <footer class="blog-footer">
      <p dir="auto">
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>
    

  </body>

</html>
Add notes for 2021-06-03 2021-06-03 21:54:49 +03:00			`<!DOCTYPE html>`
			`<html lang="en" >`

			`<head>`
			`<meta charset="utf-8">`
			`<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">`


			`<meta property="og:title" content="June, 2021" />`
			`<meta property="og:description" content="2021-06-01`

			`IWMI notified me that AReS was down with an HTTP 502 error`

			`Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification`
			`I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running`
			`I simply started it and AReS was running again:`


			`" />`
			`<meta property="og:type" content="article" />`
			`<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-06/" />`
			`<meta property="article:published_time" content="2021-06-01T10:51:07+03:00" />`
Add notes for 2021-06-17 2021-06-17 18:16:18 +03:00			`<meta property="article:modified_time" content="2021-06-16T18:31:15+03:00" />`
Add notes for 2021-06-03 2021-06-03 21:54:49 +03:00


			`<meta name="twitter:card" content="summary"/>`
			`<meta name="twitter:title" content="June, 2021"/>`
			`<meta name="twitter:description" content="2021-06-01`

			`IWMI notified me that AReS was down with an HTTP 502 error`

			`Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification`
			`I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running`
			`I simply started it and AReS was running again:`


			`"/>`
			`<meta name="generator" content="Hugo 0.83.1" />`



			`<script type="application/ld+json">`
			`{`
			`"@context": "http://schema.org",`
			`"@type": "BlogPosting",`
			`"headline": "June, 2021",`
			`"url": "https://alanorth.github.io/cgspace-notes/2021-06/",`
Add notes for 2021-06-17 2021-06-17 18:16:18 +03:00			`"wordCount": "817",`
Add notes for 2021-06-03 2021-06-03 21:54:49 +03:00			`"datePublished": "2021-06-01T10:51:07+03:00",`
Add notes for 2021-06-17 2021-06-17 18:16:18 +03:00			`"dateModified": "2021-06-16T18:31:15+03:00",`
Add notes for 2021-06-03 2021-06-03 21:54:49 +03:00			`"author": {`
			`"@type": "Person",`
			`"name": "Alan Orth"`
			`},`
			`"keywords": "Notes"`
			`}`
			`</script>`



			`<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-06/">`

			`<title>June, 2021 \| CGSpace Notes</title>`


			`<!-- combined, minified CSS -->`

			`<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">`


			`<!-- minified Font Awesome for SVG icons -->`

			`<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk+S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>`

			`<!-- RSS 2.0 feed -->`




			`</head>`

			`<body>`


			`<div class="blog-masthead">`
			`<div class="container">`
			`<nav class="nav blog-nav">`
			`<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>`
			`</nav>`
			`</div>`
			`</div>`




			`<header class="blog-header">`
			`<div class="container">`
			`<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>`
			`<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>`
			`</div>`
			`</header>`




			`<div class="container">`
			`<div class="row">`
			`<div class="col-sm-8 blog-main">`




			`<article class="blog-post">`
			`<header>`
			`<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-06/">June, 2021</a></h2>`
			`<p class="blog-post-meta">`
			`<time datetime="2021-06-01T10:51:07+03:00">Tue Jun 01, 2021</time>`
			`in`
			`<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>`


			`</p>`
			`</header>`
			`<h2 id="2021-06-01">2021-06-01</h2>`
			`<ul>`
			`<li>IWMI notified me that AReS was down with an HTTP 502 error`
			`<ul>`
			`<li>Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification</li>`
			`<li>I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the <code>angular_nginx</code> container isn’t running</li>`
			`<li>I simply started it and AReS was running again:</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<pre><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx`
			`</code></pre><ul>`
			`<li>Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately`
			`<ul>`
			`<li>I guess this is related to the SMTP issues last week</li>`
			`<li>I had fixed the config, but didn’t restart Tomcat so DSpace didn’t load the new variables</li>`
			`<li>I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<h2 id="2021-06-03">2021-06-03</h2>`
			`<ul>`
			`<li>Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI’s content into the new AMCOW Knowledge Hub`
			`<ul>`
			`<li>At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time</li>`
			`<li>That’s when I thought they could perhaps use the OpenSearch, but I can’t remember if it’s possible to limit by community, or only collection…</li>`
			`<li>Looking now, I see there is a “scope” parameter that can be used for community or collection, for example:</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<pre><code>https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1`
			`</code></pre><ul>`
			`<li>That will sort by date issued (see: <code>webui.itemlist.sort-option.2</code> in dspace.cfg), give 100 results per page, and start on item 1</li>`
			`<li>Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week</li>`
			`<li>Fill out the <em>CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC</em> survey on behalf of CGSpace</li>`
			`</ul>`
Add notes for 2021-06-06 2021-06-07 08:05:14 +03:00			`<h2 id="2021-06-06">2021-06-06</h2>`
			`<ul>`
			`<li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li>`
			`</ul>`
			`<pre><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'`
			`curl -XDELETE 'http://localhost:9200/openrxv-items-temp'`
			`curl -XPUT 'http://localhost:9200/openrxv-items-final'`
			`curl -XPUT 'http://localhost:9200/openrxv-items-temp'`
			`curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'`
			`elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping`
			`elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000`
			`</code></pre><ul>`
			`<li>Then I started a harvesting on AReS</li>`
			`</ul>`
			`<h2 id="2021-06-07">2021-06-07</h2>`
			`<ul>`
			`<li>The harvesting on AReS completed successfully</li>`
Add notes for 2021-06-07 2021-06-07 21:36:27 +03:00			`<li>Provide feedback to FAO on how we use AGROVOC for their “AGROVOC call for use cases”</li>`
Add notes for 2021-06-06 2021-06-07 08:05:14 +03:00			`</ul>`
Add notes for 2021-06-10 2021-06-10 21:41:44 +03:00			`<h2 id="2021-06-10">2021-06-10</h2>`
			`<ul>`
			`<li>Skype with Moayad to discuss AReS harvesting improvements`
			`<ul>`
			`<li>He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not</li>`
			`</ul>`
			`</li>`
			`</ul>`
Add notes for 2021-06-14 2021-06-14 15:09:07 +03:00			`<h2 id="2021-06-14">2021-06-14</h2>`
			`<ul>`
			`<li>Dump and re-create indexes on AReS (as above) so I can do a harvest</li>`
			`</ul>`
Add notes for 2021-06-16 2021-06-16 18:31:15 +03:00			`<h2 id="2021-06-16">2021-06-16</h2>`
			`<ul>`
			`<li>Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot’s DNS`
			`<ul>`
			`<li>For example, user agent <code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)</code> with DNS <code>msnbot-131-253-25-91.search.msn.com.</code></li>`
			`<li>I queried Solr for all hits using the MSN bot DNS (<code>dns:msnbot AND dns:*.msn.com.</code>) and found 457,706</li>`
			`<li>I extracted their IPs using Solr’s CSV format and ran them through my <code>resolve-addresses.py</code> script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)</li>`
			`<li>Note that <a href="https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26">Microsoft’s docs say that reverse lookups on Bingbot IPs will always have “search.msn.com”</a> so it is safe to purge these as non-human traffic</li>`
			`<li>I purged the hits with <code>ilri/check-spider-ip-hits.sh</code> (though I had to do it in 3 batches because I forgot to increase the <code>facet.limit</code> so I was only getting them 100 at a time)</li>`
			`</ul>`
			`</li>`
			`<li>Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV`
			`<ul>`
			`<li>It will hopefully also fix the duplicate and missing items issues</li>`
			`<li>I had a Skype with him to discuss</li>`
			`<li>I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<pre><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data`
Add notes for 2021-06-17 2021-06-17 18:16:18 +03:00			`</code></pre><ul>`
			`<li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster`
			`<ul>`
			`<li>I harvested 90,000+ items from DSpace Test in ~3 hours</li>`
			`<li>There seem to be some issues with the health check step though</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<h2 id="2021-06-17">2021-06-17</h2>`
			`<ul>`
			`<li>I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases`
			`<ul>`
			`<li>The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits</li>`
			`</ul>`
			`</li>`
			`<li>Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward</li>`
			`<li>More work with Moayad on OpenRXV harvesting issues`
			`<ul>`
			`<li>Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<pre><code class="language-console" data-lang="console">$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json \| awk -F: '{print $2}' \| wc -l`
			`90459`
			`$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json \| awk -F: '{print $2}' \| sort \| uniq \| wc -l`
			`90380`
			`$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json \| awk -F: '{print $2}' \| sort \| uniq -c \| sort -h`
			`...`
			`2 "10568/99409"`
			`2 "10568/99410"`
			`2 "10568/99411"`
			`2 "10568/99516"`
			`3 "10568/102093"`
			`3 "10568/103524"`
			`3 "10568/106664"`
			`3 "10568/106940"`
			`3 "10568/107195"`
			`3 "10568/96546"`
Add notes for 2021-06-16 2021-06-16 18:31:15 +03:00			`</code></pre><!-- raw HTML omitted -->`
Add notes for 2021-06-03 2021-06-03 21:54:49 +03:00




			`</article>`



			`</div> <!-- /.blog-main -->`

			`<aside class="col-sm-3 ml-auto blog-sidebar">`



			`<section class="sidebar-module">`
			`<h4>Recent Posts</h4>`
			`<ol class="list-unstyled">`


			`<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>`

			`<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>`

			`<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>`

			`<li><a href="/cgspace-notes/2021-03/">March, 2021</a></li>`

			`<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>`

			`</ol>`
			`</section>`




			`<section class="sidebar-module">`
			`<h4>Links</h4>`
			`<ol class="list-unstyled">`

			`<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>`

			`<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>`

			`<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>`

			`</ol>`
			`</section>`

			`</aside>`


			`</div> <!-- /.row -->`
			`</div> <!-- /.container -->`



			`<footer class="blog-footer">`
			`<p dir="auto">`

			`Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.`

			`</p>`
			`<p>`
			`<a href="#">Back to top</a>`
			`</p>`
			`</footer>`


			`</body>`

			`</html>`