Add notes for 2022-09-05

This commit is contained in:
2022-09-05 16:59:11 +03:00
parent 6ce43e6a95
commit ac66d6c1a9
119 changed files with 295 additions and 151 deletions

View File

@ -25,7 +25,7 @@ I also fixed a few bugs and improved the region-matching logic
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-01/" />
<meta property="article:published_time" content="2022-01-01T09:41:36+03:00" />
<meta property="article:modified_time" content="2022-01-01T09:41:36+03:00" />
<meta property="article:modified_time" content="2022-09-02T16:41:19+03:00" />
@ -46,7 +46,7 @@ I also fixed a few bugs and improved the region-matching logic
"/>
<meta name="generator" content="Hugo 0.101.0" />
<meta name="generator" content="Hugo 0.102.3" />
@ -56,9 +56,9 @@ I also fixed a few bugs and improved the region-matching logic
"@type": "BlogPosting",
"headline": "September, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-01/",
"wordCount": "141",
"wordCount": "558",
"datePublished": "2022-01-01T09:41:36+03:00",
"dateModified": "2022-01-01T09:41:36+03:00",
"dateModified": "2022-09-02T16:41:19+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -164,6 +164,77 @@ I also fixed a few bugs and improved the region-matching logic
</ul>
</li>
</ul>
<h2 id="2022-09-05">2022-09-05</h2>
<ul>
<li>Started a harvest on AReS last night</li>
<li>Looking over the Solr statistics from last month I see many user agents that look suspicious:
<ul>
<li>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)</li>
<li>Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36</li>
<li>Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0</li>
<li>Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre</li>
<li>Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131</li>
<li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)</li>
<li>Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)</li>
<li>curb</li>
<li>bitdiscovery</li>
<li>omgili/0.5 +http://omgili.com</li>
<li>Mozilla/5.0 (compatible)</li>
<li>Vizzit</li>
<li>Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0</li>
<li>Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0</li>
<li>Java/17-ea</li>
<li>AdobeUxTechC4-Async/3.0.12 (win32)</li>
<li>ZaloPC-win32-24v473</li>
<li>Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com</li>
<li>Scoop.it</li>
<li>Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0</li>
<li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</li>
<li>ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0</li>
<li>WebAPIClient</li>
<li>Mozilla/5.0 Firefox/26.0</li>
<li>Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)</li>
</ul>
</li>
<li>For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (<code>Mozilla / 5.0</code>)</li>
<li>Tons of hosts making requests likt this:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&amp;isAllowed=\x22&gt;&lt;script%20&gt;alert(String.fromCharCode(88,83,83))&lt;/script&gt; HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
</span></span></code></pre></div><ul>
<li>I got a list of hosts making requests like that so I can purge their hits:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.<span style="color:#f92672">[</span>123<span style="color:#f92672">]</span>*.gz | grep <span style="color:#e6db74">&#39;String.fromCharCode(&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort -u &gt; /tmp/ips.txt
</span></span></code></pre></div><ul>
<li>I purged 4,718 hits from IPs</li>
<li>I see some new Hetzner ranges that I hadn&rsquo;t blocked yet apparently?
<ul>
<li>I got a <a href="https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh">list of Hetzner&rsquo;s IPs from IP Quality Score</a> then added them to the existing ones in my Ansible playbooks:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> /tmp/hetzner.txt | wc -l
</span></span><span style="display:flex;"><span>36
</span></span><span style="display:flex;"><span>$ sort -u /tmp/hetzner-combined.txt | wc -l
</span></span><span style="display:flex;"><span>49
</span></span></code></pre></div><ul>
<li>I will add this new list to nginx&rsquo;s <code>bot-networks.conf</code> so they get throttled on scraping XMLUI and get classified as bots in Solr statistics</li>
<li>Then I purged hits from the following user agents:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents
</span></span><span style="display:flex;"><span>Found 374 hits from curb in statistics
</span></span><span style="display:flex;"><span>Found 350 hits from bitdiscovery in statistics
</span></span><span style="display:flex;"><span>Found 564 hits from omgili in statistics
</span></span><span style="display:flex;"><span>Found 390 hits from Vizzit in statistics
</span></span><span style="display:flex;"><span>Found 9125 hits from AdobeUxTechC4-Async in statistics
</span></span><span style="display:flex;"><span>Found 97 hits from ZaloPC-win32-24v473 in statistics
</span></span><span style="display:flex;"><span>Found 518 hits from nbertaupete95 in statistics
</span></span><span style="display:flex;"><span>Found 218 hits from Scoop.it in statistics
</span></span><span style="display:flex;"><span>Found 584 hits from WebAPIClient in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 12220
</span></span></code></pre></div><ul>
<li>Then I will add these user agents to the ILRI spider override in DSpace</li>
</ul>
<!-- raw HTML omitted -->