cgspace-notes/docs/2021-05/index.html

341 lines
15 KiB
HTML
Raw Normal View History

2021-05-02 18:55:06 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2021" />
<meta property="og:description" content="2021-05-01
I looked at the top user agents and IPs in the Solr statistics for last month and I see these user agents:
&ldquo;RI/1.0&rdquo;, 1337
&ldquo;Microsoft Office Word 2014&rdquo;, 941
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-05/" />
<meta property="article:published_time" content="2021-05-02T09:50:54+03:00" />
2021-05-09 18:11:51 +02:00
<meta property="article:modified_time" content="2021-05-05T21:03:27+03:00" />
2021-05-02 18:55:06 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2021"/>
<meta name="twitter:description" content="2021-05-01
I looked at the top user agents and IPs in the Solr statistics for last month and I see these user agents:
&ldquo;RI/1.0&rdquo;, 1337
&ldquo;Microsoft Office Word 2014&rdquo;, 941
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;
"/>
2021-05-05 20:03:27 +02:00
<meta name="generator" content="Hugo 0.83.1" />
2021-05-02 18:55:06 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-05/",
2021-05-09 18:11:51 +02:00
"wordCount": "964",
2021-05-02 18:55:06 +02:00
"datePublished": "2021-05-02T09:50:54+03:00",
2021-05-09 18:11:51 +02:00
"dateModified": "2021-05-05T21:03:27+03:00",
2021-05-02 18:55:06 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-05/">
<title>May, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-05/">May, 2021</a></h2>
<p class="blog-post-meta">
<time datetime="2021-05-02T09:50:54+03:00">Sun May 02, 2021</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2021-05-01">2021-05-01</h2>
<ul>
<li>I looked at the top user agents and IPs in the Solr statistics for last month and I see these user agents:
<ul>
<li>&ldquo;RI/1.0&rdquo;, 1337</li>
<li>&ldquo;Microsoft Office Word 2014&rdquo;, 941</li>
</ul>
</li>
<li>I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;</li>
</ul>
<ul>
<li>I should probably add the <code>RI/1.0</code> pattern to COUNTER-Robots project</li>
<li>As well as these IPs:
<ul>
<li>193.169.254.178, 21648</li>
<li>181.62.166.177, 20323</li>
<li>45.146.166.180, 19376</li>
</ul>
</li>
<li>The first IP seems to be in Estonia and their requests to the REST API change user agents from curl to Mac OS X to Windows and more
<ul>
<li>Also, they seem to be trying to exploit something:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1&quot; 400 5 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata-21%2B21*01 HTTP/1.1&quot; 200 458201 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata'||lower('')||' HTTP/1.1&quot; 400 5 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata'%2Brtrim('')%2B' HTTP/1.1&quot; 200 458209 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
</code></pre><ul>
<li>I will report the IP on abuseipdb.com and purge their hits from Solr</li>
<li>The second IP is in Colombia and is making thousands of requests for what looks like some test site:</li>
</ul>
<pre><code class="language-console" data-lang="console">181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] &quot;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&quot; 200 123613 &quot;http://cassavalighthousetest.org/&quot; &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&quot;
181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] &quot;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&quot; 200 123613 &quot;http://cassavalighthousetest.org/&quot; &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&quot;
</code></pre><ul>
<li>But this site does not exist (yet?)
<ul>
<li>I will purge them from Solr</li>
</ul>
</li>
<li>The third IP is in Russia apparently, and the user agent has the <code>pl-PL</code> locale with thousands of requests like this:</li>
</ul>
<pre><code class="language-console" data-lang="console">45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] &quot;GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&amp;isAllowed=y HTTP/1.1&quot; 200 918998 &quot;http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf&quot; &quot;Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15&quot;
</code></pre><ul>
<li>I will purge these all with my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 21648 hits from 193.169.254.178 in statistics
Purging 20323 hits from 181.62.166.177 in statistics
Purging 19376 hits from 45.146.166.180 in statistics
Total number of bot hits purged: 61347
</code></pre><h2 id="2021-05-02">2021-05-02</h2>
<ul>
<li>Check the AReS Harvester indexes:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {}
},
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
}
},
</code></pre><ul>
<li>I think they look OK (<code>openrxv-items</code> is an alias of <code>openrxv-items-final</code>), but I took a backup just in case:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
</code></pre><ul>
<li>Then I started an indexing in the AReS Explorer admin dashboard</li>
<li>The indexing finished, but it looks like the aliases are messed up again:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
2021-05-05 20:03:27 +02:00
</code></pre><h2 id="2021-05-05">2021-05-05</h2>
<ul>
<li>Peter noticed that we no longer display <code>cg.link.reference</code> on the item view
<ul>
<li>It seems that this got dropped accidentally when we migrated to <code>dcterms.relation</code> in CG Core v2</li>
<li>I fixed it in the <code>6_x-prod</code> branch and told him it will be live soon</li>
</ul>
</li>
</ul>
2021-05-09 18:11:51 +02:00
<h2 id="2021-05-09">2021-05-09</h2>
<ul>
<li>I set up a clean DSpace 6.4 instance locally to test some things against, for example to be able to rule out whether some issues are due to Atmire modules or are fixed in the as-of-yet-unreleased DSpace 6.4
<ul>
<li>I had to delete all the Atmire schemas, then it worked fine on Tomcat 8.5 with Mirage (I didn&rsquo;t want to bother with npm and ruby for Mirage 2)</li>
<li>Then I tried to see if I could reproduce the mapping issue that Marianne raised last month
<ul>
<li>I tried unmapping and remapping to the CGIAR Gender grants collection and the collection appears in the item view&rsquo;s list of mapped collections, but not on the collection browse itself</li>
<li>Then I tried mapping to a new collection and it was the same as above</li>
<li>So this issue is really just a DSpace bug, and nothing to do with Atmire and not fixed in the unreleased DSpace 6.4</li>
<li>I will try one more time after updating the Discovery index (I&rsquo;m also curious how fast it is on vanilla DSpace 6.4, though I think I tried that when I did the flame graphs in 2019 and it was miserable)</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time ~/dspace64/bin/dspace index-discovery -b
~/dspace64/bin/dspace index-discovery -b 4053.24s user 53.17s system 38% cpu 2:58:53.83 total
</code></pre><ul>
<li>Nope! Still slow, and still no mapped item&hellip;
<ul>
<li>I even tried unmapping it from all collections, and adding it to a single new owning collection&hellip;</li>
</ul>
</li>
<li>Ah hah! Actually, I was inspecting the item&rsquo;s authorization policies when I noticed that someone had made the item private!
<ul>
<li>After making it public again I was able to see it in the target collection</li>
</ul>
</li>
<li>The indexes on AReS Explorer are messed up after last week&rsquo;s harvesting:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
}
}
</code></pre><ul>
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>&hellip;</li>
<li>I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><!-- raw HTML omitted -->
2021-05-02 18:55:06 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
<li><a href="/cgspace-notes/2021-03/">March, 2021</a></li>
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
<li><a href="/cgspace-notes/2021-02/">February, 2021</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>