cgspace-notes/docs/2021-11/index.html

310 lines
13 KiB
HTML
Raw Normal View History

2021-11-01 09:49:21 +01:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="November, 2021" />
2021-11-03 14:56:15 +01:00
<meta property="og:description" content="2021-11-02
I experimented with manually sharding the Solr statistics on DSpace Test
First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
" />
2021-11-01 09:49:21 +01:00
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-11/" />
2021-11-03 14:56:15 +01:00
<meta property="article:published_time" content="2021-11-02T22:27:07+02:00" />
2021-11-09 05:29:52 +01:00
<meta property="article:modified_time" content="2021-11-07T11:26:32+02:00" />
2021-11-01 09:49:21 +01:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2021"/>
2021-11-03 14:56:15 +01:00
<meta name="twitter:description" content="2021-11-02
I experimented with manually sharding the Solr statistics on DSpace Test
First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
"/>
2021-11-09 05:29:52 +01:00
<meta name="generator" content="Hugo 0.89.2" />
2021-11-01 09:49:21 +01:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-11/",
2021-11-09 05:29:52 +01:00
"wordCount": "722",
2021-11-03 14:56:15 +01:00
"datePublished": "2021-11-02T22:27:07+02:00",
2021-11-09 05:29:52 +01:00
"dateModified": "2021-11-07T11:26:32+02:00",
2021-11-01 09:49:21 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-11/">
<title>November, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-11/">November, 2021</a></h2>
<p class="blog-post-meta">
2021-11-03 14:56:15 +01:00
<time datetime="2021-11-02T22:27:07+02:00">Tue Nov 02, 2021</time>
2021-11-01 09:49:21 +01:00
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
2021-11-03 14:56:15 +01:00
<h2 id="2021-11-02">2021-11-02</h2>
<ul>
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
2021-11-09 05:29:52 +01:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
2021-11-03 14:56:15 +01:00
$ zstd statistics-2019.json
2021-11-09 05:29:52 +01:00
</code></pre></div><ul>
2021-11-03 14:56:15 +01:00
<li>Then on DSpace Test I created a <code>statistics-2019</code> core with the same instance dir as the main <code>statistics</code> core (as <a href="https://wiki.lyrasis.org/display/DSDOC6x/Testing+Solr+Shards">illustrated in the DSpace docs</a>)</li>
</ul>
2021-11-09 05:29:52 +01:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
2021-11-03 14:56:15 +01:00
# create core in Solr admin
2021-11-09 05:29:52 +01:00
$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;time:2019-*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
2021-11-03 14:56:15 +01:00
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
2021-11-09 05:29:52 +01:00
</code></pre></div><ul>
2021-11-03 14:56:15 +01:00
<li>The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system</li>
<li>I restarted the server after the import was done to see if the cores would come back up OK
<ul>
<li>I remember last time I tried this the manually created statistics cores didn&rsquo;t come back up after I rebooted, but this time they did</li>
</ul>
</li>
</ul>
<h2 id="2021-11-03">2021-11-03</h2>
<ul>
<li>While inspecting the stats for the new statistics-2019 shard on DSpace Test I noticed that I can&rsquo;t find any stats via the DSpace Statistics API for an item that <em>should</em> have some
<ul>
<li>I checked on CGSpace&rsquo;s and I can&rsquo;t find them there either, but I see them in Solr when I query in the admin UI</li>
<li>I need to debug that, but it doesn&rsquo;t seem to be related to the sharding&hellip;</li>
</ul>
</li>
</ul>
2021-11-07 10:26:32 +01:00
<h2 id="2021-11-04">2021-11-04</h2>
<ul>
<li>I spent a little bit of time debugging the Solr bug with the statistics-2019 shard but couldn&rsquo;t reproduce it for the few items I tested
<ul>
<li>So that&rsquo;s good, it seems the sharding worked</li>
</ul>
</li>
<li>Linode alerted me to high CPU usage on CGSpace (linode18) yesterday
<ul>
<li>Looking at the Solr hits from yesterday I see 91.213.50.11 making 2,300 requests</li>
<li>According to AbuseIPDB.com this is owned by Registrarus LLC (registrarus.ru) and it has been reported for malicious activity by several users</li>
<li>The ASN is 50340 (SELECTEL-MSK, RU)</li>
<li>They are attempting SQL injection:</li>
</ul>
</li>
</ul>
2021-11-09 05:29:52 +01:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] &#34;HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&amp;isAllowed=y HTTP/1.1&#34; 200 0 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf&#34; &#34;Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10&#34;
</code></pre></div><ul>
2021-11-07 10:26:32 +01:00
<li>Another is in China, and they grabbed 1,200 PDFs from the REST API in under an hour:</li>
</ul>
2021-11-09 05:29:52 +01:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
2021-11-07 10:26:32 +01:00
1178
2021-11-09 05:29:52 +01:00
</code></pre></div><ul>
2021-11-07 10:26:32 +01:00
<li>I will continue to split the Solr statistics back into year-shards on DSpace Test (linode26)
<ul>
<li>Today I did all 2018 stats&hellip;</li>
<li>I want to see if there is a noticeable change in JVM memory, Solr response time, etc</li>
</ul>
</li>
</ul>
<h2 id="2021-11-07">2021-11-07</h2>
<ul>
<li>Update all Docker containers on AReS and rebuild OpenRXV:</li>
</ul>
2021-11-09 05:29:52 +01:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
2021-11-07 10:26:32 +01:00
$ docker-compose build
2021-11-09 05:29:52 +01:00
</code></pre></div><ul>
2021-11-07 10:26:32 +01:00
<li>Then restart the server and start a fresh harvest</li>
2021-11-09 05:29:52 +01:00
<li>Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)</li>
<li>Several users wrote to me last week to say that workflow emails haven&rsquo;t been working since 2021-10-21 or so
<ul>
<li>I did a test on CGSpace and it&rsquo;s indeed broken:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace test-email
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>About to send test email:
- To: fuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Please see the DSpace documentation for assistance.
</code></pre></div><ul>
<li>I sent a message to ILRI ICT to ask them to check the account/password</li>
<li>I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
</code></pre></div><ul>
<li>Then on my local server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components<span style="color:#f92672">=</span><span style="color:#ae81ff">4</span>
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod <span style="color:#ae81ff">660</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod <span style="color:#ae81ff">770</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
# copy backend/data to /tmp <span style="color:#66d9ef">for</span> the repository setup/layout
$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
</code></pre></div><ul>
<li>This seems to work: all items, stats, and repository setup/layout are OK</li>
<li>I merged my <a href="https://github.com/ilri/OpenRXV/pull/126">Elasticsearch pull request</a> from last month into OpenRXV</li>
</ul>
<h2 id="2021-11-08">2021-11-08</h2>
<ul>
<li>File <a href="https://github.com/DSpace/dspace-angular/issues/1391">an issue for the Angular flash of unstyled content</a> on DSpace 7</li>
<li>Help Udana from IWMI with a question about CGSpace statistics
<ul>
<li>He found conflicting numbers when using the community and collection modes in Content and Usage Analysis</li>
<li>I sent him more numbers directly from the DSpace Statistics API</li>
</ul>
</li>
2021-11-07 10:26:32 +01:00
</ul>
2021-11-01 09:49:21 +01:00
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>
<li><a href="/cgspace-notes/2021-10/">October, 2021</a></li>
<li><a href="/cgspace-notes/2021-09/">September, 2021</a></li>
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>