Add notes for 2022-11-09

This commit is contained in:
2022-11-10 15:45:04 +03:00
parent c63abf656d
commit 4e6a8ec51b
29 changed files with 203 additions and 35 deletions

View File

@ -24,7 +24,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-11/" />
<meta property="article:published_time" content="2022-11-01T09:11:36+03:00" />
<meta property="article:modified_time" content="2022-11-01T22:12:24+03:00" />
<meta property="article:modified_time" content="2022-11-07T17:18:14+03:00" />
@ -54,9 +54,9 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
"@type": "BlogPosting",
"headline": "November, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-11/",
"wordCount": "863",
"wordCount": "1392",
"datePublished": "2022-11-01T09:11:36+03:00",
"dateModified": "2022-11-01T22:12:24+03:00",
"dateModified": "2022-11-07T17:18:14+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -263,7 +263,97 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -v -f -i 10568/78 &gt;&amp; /tmp/filter-media-cropbox.log
</span></span></code></pre></div><ul>
<li>But looking at the items it processed, I&rsquo;m not sure it&rsquo;s working as expected</li>
<li>But looking at the items it processed, I&rsquo;m not sure it&rsquo;s working as expected
<ul>
<li>I looked at a few dozen</li>
</ul>
</li>
<li>I found some links to the Bioversity website on CGSpace that are not redirecting properly:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
</span></span><span style="display:flex;"><span>GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
</span></span><span style="display:flex;"><span>Accept: */*
</span></span><span style="display:flex;"><span>Accept-Encoding: gzip, deflate
</span></span><span style="display:flex;"><span>Connection: keep-alive
</span></span><span style="display:flex;"><span>Host: www.bioversityinternational.org
</span></span><span style="display:flex;"><span>User-Agent: HTTPie/3.2.1
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>HTTP/1.1 302 Found
</span></span><span style="display:flex;"><span>Connection: Keep-Alive
</span></span><span style="display:flex;"><span>Content-Length: 275
</span></span><span style="display:flex;"><span>Content-Type: text/html; charset=iso-8859-1
</span></span><span style="display:flex;"><span>Date: Mon, 07 Nov 2022 16:35:21 GMT
</span></span><span style="display:flex;"><span>Keep-Alive: timeout=15, max=100
</span></span><span style="display:flex;"><span>Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
</span></span><span style="display:flex;"><span>Server: Apache
</span></span></code></pre></div><ul>
<li>The <code>Location</code> header is clearly wrong, and if I try https directly I get an HTTP 500</li>
</ul>
<h2 id="2022-11-08">2022-11-08</h2>
<ul>
<li>Looking at the Solr statistics hits on CGSpace for 2022-11
<ul>
<li>I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent</li>
<li>I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent</li>
<li>I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent</li>
<li>I will purge all these hits and proably add China Unicom&rsquo;s subnet mask to my nginx <code>bot-network.conf</code> file to tag them as bots since there are SO many bad and malicious requests coming from there</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 8975 hits from 221.219.100.42 in statistics
</span></span><span style="display:flex;"><span>Purging 7577 hits from 122.10.101.60 in statistics
</span></span><span style="display:flex;"><span>Purging 6536 hits from 135.125.21.38 in statistics
</span></span><span style="display:flex;"><span>Purging 23950 hits from 163.237.216.11 in statistics
</span></span><span style="display:flex;"><span>Purging 4093 hits from 51.254.154.148 in statistics
</span></span><span style="display:flex;"><span>Purging 2797 hits from 221.219.103.211 in statistics
</span></span><span style="display:flex;"><span>Purging 2618 hits from 216.218.223.53 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 56546
</span></span></code></pre></div><ul>
<li>Also interesting to see a few new user agents:
<ul>
<li><code>RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
<li><code>rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)</code></li>
<li><code>MEL</code></li>
<li><code>Gov employment data scraper ([[your email]])</code></li>
<li><code>RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
</ul>
</li>
<li>I will purge all these:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
</span></span><span style="display:flex;"><span>Purging 6155 hits from RStudio in statistics
</span></span><span style="display:flex;"><span>Purging 1929 hits from rstudio in statistics
</span></span><span style="display:flex;"><span>Purging 1454 hits from MEL in statistics
</span></span><span style="display:flex;"><span>Purging 1094 hits from Gov employment data scraper in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10632
</span></span></code></pre></div><ul>
<li>Work on the CIAT Library items a bit again in OpenRefine
<ul>
<li>I flagged items with:
<ul>
<li>URL containing &ldquo;#page&rdquo; at the end (these are linking to book chapters, but we don&rsquo;t want to upload the PDF multiple times)</li>
<li>Same URL used by more than one item (&ldquo;Duplicates&rdquo; facet in OpenRefine, these are some corner case I don&rsquo;t want to handle right now)</li>
<li>URL containing &ldquo;:8080&rdquo; to CIAT&rsquo;s old DSpace (this server is no longer live)</li>
</ul>
</li>
<li>I want to try to handle the simple cases that should cover most of the items first</li>
</ul>
</li>
</ul>
<h2 id="2022-11-09">2022-11-09</h2>
<ul>
<li>Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
<ul>
<li>I got the basic functionality working</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->