Add notes for 2022-09-12

This commit is contained in:
2022-09-12 11:35:57 +03:00
parent 69392070de
commit 147ad86375
120 changed files with 294 additions and 152 deletions

View File

@ -25,7 +25,7 @@ I also fixed a few bugs and improved the region-matching logic
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-01/" />
<meta property="article:published_time" content="2022-01-01T09:41:36+03:00" />
<meta property="article:modified_time" content="2022-09-08T17:47:25+03:00" />
<meta property="article:modified_time" content="2022-09-09T17:29:51+03:00" />
@ -56,9 +56,9 @@ I also fixed a few bugs and improved the region-matching logic
"@type": "BlogPosting",
"headline": "September, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-01/",
"wordCount": "844",
"wordCount": "1259",
"datePublished": "2022-01-01T09:41:36+03:00",
"dateModified": "2022-09-08T17:47:25+03:00",
"dateModified": "2022-09-09T17:29:51+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -76,7 +76,7 @@ I also fixed a few bugs and improved the region-matching logic
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
@ -283,6 +283,80 @@ I also fixed a few bugs and improved the region-matching logic
</span></span></code></pre></div><ul>
<li>Start a full Discovery index on CGSpace to catch these changes in the Discovery</li>
</ul>
<h2 id="2022-09-11">2022-09-11</h2>
<ul>
<li>Today is Sunday and I see the load on the server is high
<ul>
<li>Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it&rsquo;s not from them!</li>
<li>Looking at the top IPs this morning:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># cat /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | grep <span style="color:#e6db74">&#39;11/Sep/2022&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">40</span>
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> 165 64.233.172.79
</span></span><span style="display:flex;"><span> 166 87.250.224.34
</span></span><span style="display:flex;"><span> 200 69.162.124.231
</span></span><span style="display:flex;"><span> 202 216.244.66.198
</span></span><span style="display:flex;"><span> 385 207.46.13.149
</span></span><span style="display:flex;"><span> 398 207.46.13.147
</span></span><span style="display:flex;"><span> 421 66.249.64.185
</span></span><span style="display:flex;"><span> 422 157.55.39.81
</span></span><span style="display:flex;"><span> 442 2a01:4f8:1c17:5550::1
</span></span><span style="display:flex;"><span> 451 64.124.8.36
</span></span><span style="display:flex;"><span> 578 137.184.159.211
</span></span><span style="display:flex;"><span> 597 136.243.228.195
</span></span><span style="display:flex;"><span> 1185 66.249.64.183
</span></span><span style="display:flex;"><span> 1201 157.55.39.80
</span></span><span style="display:flex;"><span> 3135 80.248.237.167
</span></span><span style="display:flex;"><span> 4794 54.195.118.125
</span></span><span style="display:flex;"><span> 5486 45.5.186.2
</span></span><span style="display:flex;"><span> 6322 2a01:7e00::f03c:91ff:fe9a:3a37
</span></span><span style="display:flex;"><span> 9556 66.249.64.181
</span></span></code></pre></div><ul>
<li>The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least</li>
<li>Then there&rsquo;s 80.248.237.167, which is using a normal user agent and scraping Discovery
<ul>
<li>That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as &lsquo;bot&rsquo; for XMLUI so most of these requests are HTTP 503</li>
</ul>
</li>
<li>On another note, I&rsquo;m curious to explore enabling caching of certain REST API responses
<ul>
<li>For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">&#39;{print $7}&#39;</span> /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">10</span>
</span></span><span style="display:flex;"><span> 4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
</span></span><span style="display:flex;"><span> 4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
</span></span><span style="display:flex;"><span> 4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
</span></span><span style="display:flex;"><span> 4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
</span></span><span style="display:flex;"><span> 5 /rest/handle/10568/110310?expand=all
</span></span><span style="display:flex;"><span> 5 /rest/handle/10568/89980?expand=all
</span></span><span style="display:flex;"><span> 5 /rest/handle/10568/97614?expand=all
</span></span><span style="display:flex;"><span> 6 /rest/handle/10568/107086?expand=all
</span></span><span style="display:flex;"><span> 6 /rest/handle/10568/108503?expand=all
</span></span><span style="display:flex;"><span> 6 /rest/handle/10568/98424?expand=all
</span></span></code></pre></div><ul>
<li>I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
<ul>
<li>Will be interesting to check the results above as the day goes on (now 10AM)</li>
<li>To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday&rsquo;s log):</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">&#39;{print $7}&#39;</span> /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
</span></span><span style="display:flex;"><span>33733
</span></span><span style="display:flex;"><span># awk <span style="color:#e6db74">&#39;{print $7}&#39;</span> /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk <span style="color:#e6db74">&#39;$1 &gt; 1&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>5637
</span></span></code></pre></div><ul>
<li>In the afternoon I started a harvest on AReS (which should affect the numbers above also)</li>
<li>I enabled an nginx proxy cache on DSpace Test for this location regex: <code>location ~ /rest/(handle|items|collections|communities)/.+</code></li>
</ul>
<h2 id="2022-09-12">2022-09-12</h2>
<ul>
<li>I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled</li>
</ul>
<!-- raw HTML omitted -->