mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-09-12
This commit is contained in:
@ -25,7 +25,7 @@ I also fixed a few bugs and improved the region-matching logic
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-01/" />
|
||||
<meta property="article:published_time" content="2022-01-01T09:41:36+03:00" />
|
||||
<meta property="article:modified_time" content="2022-09-08T17:47:25+03:00" />
|
||||
<meta property="article:modified_time" content="2022-09-09T17:29:51+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -56,9 +56,9 @@ I also fixed a few bugs and improved the region-matching logic
|
||||
"@type": "BlogPosting",
|
||||
"headline": "September, 2022",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2022-01/",
|
||||
"wordCount": "844",
|
||||
"wordCount": "1259",
|
||||
"datePublished": "2022-01-01T09:41:36+03:00",
|
||||
"dateModified": "2022-09-08T17:47:25+03:00",
|
||||
"dateModified": "2022-09-09T17:29:51+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -76,7 +76,7 @@ I also fixed a few bugs and improved the region-matching logic
|
||||
|
||||
<!-- combined, minified CSS -->
|
||||
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
|
||||
|
||||
|
||||
<!-- minified Font Awesome for SVG icons -->
|
||||
@ -283,6 +283,80 @@ I also fixed a few bugs and improved the region-matching logic
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Start a full Discovery index on CGSpace to catch these changes in the Discovery</li>
|
||||
</ul>
|
||||
<h2 id="2022-09-11">2022-09-11</h2>
|
||||
<ul>
|
||||
<li>Today is Sunday and I see the load on the server is high
|
||||
<ul>
|
||||
<li>Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it’s not from them!</li>
|
||||
<li>Looking at the top IPs this morning:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># cat /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | grep <span style="color:#e6db74">'11/Sep/2022'</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">40</span>
|
||||
</span></span><span style="display:flex;"><span>...
|
||||
</span></span><span style="display:flex;"><span> 165 64.233.172.79
|
||||
</span></span><span style="display:flex;"><span> 166 87.250.224.34
|
||||
</span></span><span style="display:flex;"><span> 200 69.162.124.231
|
||||
</span></span><span style="display:flex;"><span> 202 216.244.66.198
|
||||
</span></span><span style="display:flex;"><span> 385 207.46.13.149
|
||||
</span></span><span style="display:flex;"><span> 398 207.46.13.147
|
||||
</span></span><span style="display:flex;"><span> 421 66.249.64.185
|
||||
</span></span><span style="display:flex;"><span> 422 157.55.39.81
|
||||
</span></span><span style="display:flex;"><span> 442 2a01:4f8:1c17:5550::1
|
||||
</span></span><span style="display:flex;"><span> 451 64.124.8.36
|
||||
</span></span><span style="display:flex;"><span> 578 137.184.159.211
|
||||
</span></span><span style="display:flex;"><span> 597 136.243.228.195
|
||||
</span></span><span style="display:flex;"><span> 1185 66.249.64.183
|
||||
</span></span><span style="display:flex;"><span> 1201 157.55.39.80
|
||||
</span></span><span style="display:flex;"><span> 3135 80.248.237.167
|
||||
</span></span><span style="display:flex;"><span> 4794 54.195.118.125
|
||||
</span></span><span style="display:flex;"><span> 5486 45.5.186.2
|
||||
</span></span><span style="display:flex;"><span> 6322 2a01:7e00::f03c:91ff:fe9a:3a37
|
||||
</span></span><span style="display:flex;"><span> 9556 66.249.64.181
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least</li>
|
||||
<li>Then there’s 80.248.237.167, which is using a normal user agent and scraping Discovery
|
||||
<ul>
|
||||
<li>That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as ‘bot’ for XMLUI so most of these requests are HTTP 503</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>On another note, I’m curious to explore enabling caching of certain REST API responses
|
||||
<ul>
|
||||
<li>For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $7}'</span> /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">10</span>
|
||||
</span></span><span style="display:flex;"><span> 4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
|
||||
</span></span><span style="display:flex;"><span> 4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
|
||||
</span></span><span style="display:flex;"><span> 4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
|
||||
</span></span><span style="display:flex;"><span> 4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
|
||||
</span></span><span style="display:flex;"><span> 5 /rest/handle/10568/110310?expand=all
|
||||
</span></span><span style="display:flex;"><span> 5 /rest/handle/10568/89980?expand=all
|
||||
</span></span><span style="display:flex;"><span> 5 /rest/handle/10568/97614?expand=all
|
||||
</span></span><span style="display:flex;"><span> 6 /rest/handle/10568/107086?expand=all
|
||||
</span></span><span style="display:flex;"><span> 6 /rest/handle/10568/108503?expand=all
|
||||
</span></span><span style="display:flex;"><span> 6 /rest/handle/10568/98424?expand=all
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
|
||||
<ul>
|
||||
<li>Will be interesting to check the results above as the day goes on (now 10AM)</li>
|
||||
<li>To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday’s log):</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $7}'</span> /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
|
||||
</span></span><span style="display:flex;"><span>33733
|
||||
</span></span><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $7}'</span> /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk <span style="color:#e6db74">'$1 > 1'</span> | wc -l
|
||||
</span></span><span style="display:flex;"><span>5637
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>In the afternoon I started a harvest on AReS (which should affect the numbers above also)</li>
|
||||
<li>I enabled an nginx proxy cache on DSpace Test for this location regex: <code>location ~ /rest/(handle|items|collections|communities)/.+</code></li>
|
||||
</ul>
|
||||
<h2 id="2022-09-12">2022-09-12</h2>
|
||||
<ul>
|
||||
<li>I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user