Add notes for 2023-01-17

This commit is contained in:
2023-01-17 22:38:55 +03:00
parent 3f4e42fe37
commit ddb1ce8f4e
35 changed files with 311 additions and 34 deletions

View File

@ -19,7 +19,7 @@ I see we have some new ones that aren’t in our list if I combine with this
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-01/" />
<meta property="article:published_time" content="2023-01-01T08:44:36+03:00" />
<meta property="article:modified_time" content="2023-01-12T23:11:42+03:00" />
<meta property="article:modified_time" content="2023-01-15T08:10:16+03:00" />
@ -44,9 +44,9 @@ I see we have some new ones that aren&rsquo;t in our list if I combine with this
"@type": "BlogPosting",
"headline": "January, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-01/",
"wordCount": "1103",
"wordCount": "2250",
"datePublished": "2023-01-01T08:44:36+03:00",
"dateModified": "2023-01-12T23:11:42+03:00",
"dateModified": "2023-01-15T08:10:16+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -307,6 +307,151 @@ I see we have some new ones that aren&rsquo;t in our list if I combine with this
<ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2023-01-16">2023-01-16</h2>
<ul>
<li>Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems</li>
<li>Batch import another twenty-eight items for IFPRI across several Initiatives
<ul>
<li>On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc</li>
<li>I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts</li>
<li>Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values</li>
</ul>
</li>
</ul>
<h2 id="2023-01-17">2023-01-17</h2>
<ul>
<li>Batch import another twenty-three items for IFPRI across several Initiatives
<ul>
<li>I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc</li>
<li>I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts</li>
<li>Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669</li>
<li>Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values</li>
</ul>
</li>
<li>I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality</li>
<li>I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace</li>
<li>There is a high load on CGSpace pretty regularly
<ul>
<li>Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks:</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2023/01/jmx_dspace_sessions-year.png" alt="DSpace sessions year"></p>
<ul>
<li>Is this attributable to all the PRMS harvesting?</li>
<li>I also see some PostgreSQL locks starting earlier today:</li>
</ul>
<p><img src="/cgspace-notes/2023/01/postgres_connections_ALL-day.png" alt="PostgreSQL locks day"></p>
<ul>
<li>I&rsquo;m curious to see what kinds of IPs have been connecting, so I will look at the last few weeks:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/<span style="color:#f92672">{</span>rest,access,library-access,oai<span style="color:#f92672">}</span>.log /var/log/nginx/<span style="color:#f92672">{</span>rest,access,library-access,oai<span style="color:#f92672">}</span>.log.1 /var/log/nginx/<span style="color:#f92672">{</span>rest,access,library-access,oai<span style="color:#f92672">}</span>.log.<span style="color:#f92672">{</span>2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25<span style="color:#f92672">}</span>.gz | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/2023-01-17-cgspace-ips.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/2023-01-17-cgspace-ips.txt
</span></span><span style="display:flex;"><span>129446 /tmp/2023-01-17-cgspace-ips.txt
</span></span></code></pre></div><ul>
<li>I ran the IPs through my <code>resolve-addresses-geoip2.py</code> script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$&#39;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> /tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
</span></span><span style="display:flex;"><span> sed 1d | sort | uniq &gt; /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>776 /tmp/networks-to-block.txt
</span></span></code></pre></div><ul>
<li>I added the list of networks to nginx&rsquo;s <code>bot-networks.conf</code> so they will all be heavily rate limited</li>
<li>Looking at the Munin stats again I see the load has been extra high since yesterday morning:</li>
</ul>
<p><img src="/cgspace-notes/2023/01/cpu-week.png" alt="CPU week"></p>
<ul>
<li>But still, it&rsquo;s suspicious that there are so many PostgreSQL locks</li>
<li>Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy)
<ul>
<li>I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!)</li>
<li>I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it&rsquo;s a data center ISP so nope</li>
<li>I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked</li>
<li>I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request</li>
<li>There are too many to count&hellip; so I will purge these and then move on to user agents</li>
</ul>
</li>
<li>I purged hits from those IPs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 439185 hits from 31.148.223.10 in statistics
</span></span><span style="display:flex;"><span>Purging 2151 hits from 18.203.245.60 in statistics
</span></span><span style="display:flex;"><span>Purging 1990 hits from 3.249.192.212 in statistics
</span></span><span style="display:flex;"><span>Purging 1975 hits from 34.244.160.145 in statistics
</span></span><span style="display:flex;"><span>Purging 1969 hits from 52.213.59.101 in statistics
</span></span><span style="display:flex;"><span>Purging 2540 hits from 91.209.8.29 in statistics
</span></span><span style="display:flex;"><span>Purging 1624 hits from 54.78.176.127 in statistics
</span></span><span style="display:flex;"><span>Purging 1236 hits from 54.74.197.53 in statistics
</span></span><span style="display:flex;"><span>Purging 1327 hits from 54.246.128.111 in statistics
</span></span><span style="display:flex;"><span>Purging 1108 hits from 52.16.103.133 in statistics
</span></span><span style="display:flex;"><span>Purging 1045 hits from 63.32.99.252 in statistics
</span></span><span style="display:flex;"><span>Purging 999 hits from 176.34.141.181 in statistics
</span></span><span style="display:flex;"><span>Purging 997 hits from 34.243.17.80 in statistics
</span></span><span style="display:flex;"><span>Purging 985 hits from 34.240.206.16 in statistics
</span></span><span style="display:flex;"><span>Purging 862 hits from 18.203.81.120 in statistics
</span></span><span style="display:flex;"><span>Purging 1654 hits from 176.97.210.106 in statistics
</span></span><span style="display:flex;"><span>Purging 1628 hits from 51.81.193.200 in statistics
</span></span><span style="display:flex;"><span>Purging 1020 hits from 79.110.73.54 in statistics
</span></span><span style="display:flex;"><span>Purging 842 hits from 35.153.105.213 in statistics
</span></span><span style="display:flex;"><span>Purging 1689 hits from 54.164.237.125 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 466826
</span></span></code></pre></div><ul>
<li>Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones:
<ul>
<li><code>azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0</code></li>
<li><code>Gov employment data scraper ([[your email]])</code></li>
<li><code>Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)</code></li>
<li><code>crownpeak</code></li>
<li><code>Mozilla/5.0 (compatible)</code></li>
</ul>
</li>
<li>Also, a ton of them are lower case, which I&rsquo;ve never seen before&hellip; it might be possible, but looks super fishy to me:
<ul>
<li><code>mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0</code></li>
<li><code>mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0</code></li>
<li><code>mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36</code></li>
<li><code>mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36</code></li>
<li><code>mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36</code></li>
</ul>
</li>
<li>I purged some of those:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
</span></span><span style="display:flex;"><span>Purging 1658 hits from azure-logic-apps\/1.0 in statistics
</span></span><span style="display:flex;"><span>Purging 948 hits from Gov employment data scraper in statistics
</span></span><span style="display:flex;"><span>Purging 786 hits from Microsoft\.Data\.Mashup in statistics
</span></span><span style="display:flex;"><span>Purging 303 hits from crownpeak in statistics
</span></span><span style="display:flex;"><span>Purging 332 hits from Mozilla\/5.0 (compatible) in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 4027
</span></span></code></pre></div><ul>
<li>Then I ran all system updates on the server and rebooted it
<ul>
<li>Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers</li>
<li>I need to re-work how I&rsquo;m doing this whitelisting and blacklisting&hellip; it&rsquo;s way too complicated now</li>
</ul>
</li>
<li>Export entire CGSpace to check Initiative mappings, and add nineteen&hellip;</li>
<li>Start a harvest on AReS</li>
</ul>
<!-- raw HTML omitted -->