mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-11-09
This commit is contained in:
@ -24,7 +24,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-11/" />
|
||||
<meta property="article:published_time" content="2022-11-01T09:11:36+03:00" />
|
||||
<meta property="article:modified_time" content="2022-11-01T22:12:24+03:00" />
|
||||
<meta property="article:modified_time" content="2022-11-07T17:18:14+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -54,9 +54,9 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
|
||||
"@type": "BlogPosting",
|
||||
"headline": "November, 2022",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2022-11/",
|
||||
"wordCount": "863",
|
||||
"wordCount": "1392",
|
||||
"datePublished": "2022-11-01T09:11:36+03:00",
|
||||
"dateModified": "2022-11-01T22:12:24+03:00",
|
||||
"dateModified": "2022-11-07T17:18:14+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -263,7 +263,97 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">"-Xmx1024m -Dfile.encoding=UTF-8"</span> dspace filter-media -p <span style="color:#e6db74">"ImageMagick PDF Thumbnail"</span> -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>But looking at the items it processed, I’m not sure it’s working as expected</li>
|
||||
<li>But looking at the items it processed, I’m not sure it’s working as expected
|
||||
<ul>
|
||||
<li>I looked at a few dozen</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I found some links to the Bioversity website on CGSpace that are not redirecting properly:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
|
||||
</span></span><span style="display:flex;"><span>GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
|
||||
</span></span><span style="display:flex;"><span>Accept: */*
|
||||
</span></span><span style="display:flex;"><span>Accept-Encoding: gzip, deflate
|
||||
</span></span><span style="display:flex;"><span>Connection: keep-alive
|
||||
</span></span><span style="display:flex;"><span>Host: www.bioversityinternational.org
|
||||
</span></span><span style="display:flex;"><span>User-Agent: HTTPie/3.2.1
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>HTTP/1.1 302 Found
|
||||
</span></span><span style="display:flex;"><span>Connection: Keep-Alive
|
||||
</span></span><span style="display:flex;"><span>Content-Length: 275
|
||||
</span></span><span style="display:flex;"><span>Content-Type: text/html; charset=iso-8859-1
|
||||
</span></span><span style="display:flex;"><span>Date: Mon, 07 Nov 2022 16:35:21 GMT
|
||||
</span></span><span style="display:flex;"><span>Keep-Alive: timeout=15, max=100
|
||||
</span></span><span style="display:flex;"><span>Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
|
||||
</span></span><span style="display:flex;"><span>Server: Apache
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>The <code>Location</code> header is clearly wrong, and if I try https directly I get an HTTP 500</li>
|
||||
</ul>
|
||||
<h2 id="2022-11-08">2022-11-08</h2>
|
||||
<ul>
|
||||
<li>Looking at the Solr statistics hits on CGSpace for 2022-11
|
||||
<ul>
|
||||
<li>I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
|
||||
<li>I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent</li>
|
||||
<li>I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection</li>
|
||||
<li>I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent</li>
|
||||
<li>I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection</li>
|
||||
<li>I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
|
||||
<li>I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent</li>
|
||||
<li>I will purge all these hits and proably add China Unicom’s subnet mask to my nginx <code>bot-network.conf</code> file to tag them as bots since there are SO many bad and malicious requests coming from there</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
||||
</span></span><span style="display:flex;"><span>Purging 8975 hits from 221.219.100.42 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 7577 hits from 122.10.101.60 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 6536 hits from 135.125.21.38 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 23950 hits from 163.237.216.11 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 4093 hits from 51.254.154.148 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 2797 hits from 221.219.103.211 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 2618 hits from 216.218.223.53 in statistics
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 56546
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Also interesting to see a few new user agents:
|
||||
<ul>
|
||||
<li><code>RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
|
||||
<li><code>rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)</code></li>
|
||||
<li><code>MEL</code></li>
|
||||
<li><code>Gov employment data scraper ([[your email]])</code></li>
|
||||
<li><code>RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I will purge all these:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
|
||||
</span></span><span style="display:flex;"><span>Purging 6155 hits from RStudio in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 1929 hits from rstudio in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 1454 hits from MEL in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 1094 hits from Gov employment data scraper in statistics
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10632
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Work on the CIAT Library items a bit again in OpenRefine
|
||||
<ul>
|
||||
<li>I flagged items with:
|
||||
<ul>
|
||||
<li>URL containing “#page” at the end (these are linking to book chapters, but we don’t want to upload the PDF multiple times)</li>
|
||||
<li>Same URL used by more than one item (“Duplicates” facet in OpenRefine, these are some corner case I don’t want to handle right now)</li>
|
||||
<li>URL containing “:8080” to CIAT’s old DSpace (this server is no longer live)</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I want to try to handle the simple cases that should cover most of the items first</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2022-11-09">2022-11-09</h2>
|
||||
<ul>
|
||||
<li>Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
|
||||
<ul>
|
||||
<li>I got the basic functionality working</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
Reference in New Issue
Block a user