Add notes for 2022-09-30

This commit is contained in:
2022-09-30 17:29:50 +03:00
parent 96f47ec7b5
commit c7aec5606c
120 changed files with 305 additions and 153 deletions

View File

@ -25,7 +25,7 @@ I also fixed a few bugs and improved the region-matching logic
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-09/" />
<meta property="article:published_time" content="2022-09-01T09:41:36+03:00" />
<meta property="article:modified_time" content="2022-09-28T17:10:23+03:00" />
<meta property="article:modified_time" content="2022-09-28T21:22:59+03:00" />
@ -46,7 +46,7 @@ I also fixed a few bugs and improved the region-matching logic
"/>
<meta name="generator" content="Hugo 0.104.1" />
<meta name="generator" content="Hugo 0.104.2" />
@ -56,9 +56,9 @@ I also fixed a few bugs and improved the region-matching logic
"@type": "BlogPosting",
"headline": "September, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-09/",
"wordCount": "3112",
"wordCount": "3621",
"datePublished": "2022-09-01T09:41:36+03:00",
"dateModified": "2022-09-28T17:10:23+03:00",
"dateModified": "2022-09-28T21:22:59+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -685,7 +685,84 @@ harvesting of meat from wildlife and not from livestock.</p>
</span></span><span style="display:flex;"><span>Fixed 3 occurences of: Alessandra Galie: 0000-0001-9868-7733
</span></span><span style="display:flex;"><span>Fixed 1 occurences of: Amanda De Filippo: 0000-0002-1536-3221
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><!-- raw HTML omitted -->
</span></span></code></pre></div><h2 id="2022-09-29">2022-09-29</h2>
<ul>
<li>I&rsquo;ve been checking the size of the nginx proxy cache the last few days and it always seems to hover around 14,000 entries and 385MB:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># find /var/cache/nginx/rest_cache/ -type f | wc -l
</span></span><span style="display:flex;"><span>14202
</span></span><span style="display:flex;"><span># du -sh /var/cache/nginx/rest_cache
</span></span><span style="display:flex;"><span>384M /var/cache/nginx/rest_cache
</span></span></code></pre></div><ul>
<li>Also on that note I&rsquo;m trying to implement a workaround for a potential caching issue that causes MEL to not be able to update items on DSpace Test
<ul>
<li>I <em>think</em> we might need to allow requests with a JSESSIONID to bypass the cache, but I have to verify with Salem</li>
<li>We can do this with an nginx map:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># Check <span style="color:#66d9ef">if</span> the JSESSIONID cookie is present and contains a 32-character hex
</span></span><span style="display:flex;"><span># value, which would mean that a user is actively attempting to re-use their
</span></span><span style="display:flex;"><span># Tomcat session. Then we set the $active_user_session variable and use it
</span></span><span style="display:flex;"><span># to bypass the nginx proxy cache in REST requests.
</span></span><span style="display:flex;"><span>map $cookie_jsessionid $active_user_session {
</span></span><span style="display:flex;"><span> # requests with an empty key are not evaluated by limit_req
</span></span><span style="display:flex;"><span> # see: http://nginx.org/en/docs/http/ngx_http_limit_req_module.html
</span></span><span style="display:flex;"><span> default &#39;&#39;;
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> &#39;~[A-Z0-9]{32}&#39; 1;
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Then in the location block where we do the proxy cache:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> # Don&#39;t cache when user Shift-refreshes (Cache-Control: no-cache) or
</span></span><span style="display:flex;"><span> # when a client has an active session (see the $cookie_jsessionid map).
</span></span><span style="display:flex;"><span> proxy_cache_bypass $http_cache_control $active_user_session;
</span></span><span style="display:flex;"><span> proxy_no_cache $http_cache_control $active_user_session;
</span></span></code></pre></div><ul>
<li>I found one client making 10,000 requests using a Windows 98 user agent:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/4.0 (compatible; MSIE 5.00; Windows 98)
</span></span></code></pre></div><ul>
<li>They all come from one IP address (129.227.149.43) in Hong Kong
<ul>
<li>The IP belongs to a hosting provider called Zenlayer</li>
<li>I will add this IP to the nginx bot networks and purge its hits</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip -p
</span></span><span style="display:flex;"><span>Purging 33027 hits from 129.227.149.43 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 33027
</span></span></code></pre></div><ul>
<li>So it seems we&rsquo;ve seen this bot before and the total number is much higher than the 10,000 this month</li>
<li>I had a call with Salem and we verified that the nginx cache bypass for clients who provide a JSESSIONID fixes their issue with updating items/bitstreams from MEL
<ul>
<li>The issue was that they delete all metadata and bitstreams, then add them again to make sure everything is up to date, and in that process they also re-request the item with all expands to get the bitstreams, which ends up getting cached and then they try to delete the old bitstream</li>
</ul>
</li>
<li>I also noticed that someone made a <a href="https://github.com/DSpace/DSpace/pull/8343">pull request to enable POSTing bitstreams to a particular bundle</a> and it works, so that&rsquo;s awesome!</li>
</ul>
<h2 id="2022-09-30">2022-09-30</h2>
<ul>
<li>I applied <a href="https://github.com/DSpace/DSpace/pull/8343">the patch for POSTing bitstreams to other bundles</a> on CGSpace</li>
<li>Testing a few other DSpace 6.4 patches on DSpace Test:
<ul>
<li><a href="https://github.com/DSpace/DSpace/pull/1901">DS-3791 Make sure the &ldquo;yearDifference&rdquo; takes into account that a gap of 10 year contains 11 years</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/2501">DS-3873 Limit the usage of PDFBoxThumbnail to PDFs</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/2161">Reduce itemCounter init</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/2201">ImageMagick: Only execute &ldquo;identify&rdquo; on first page</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/2371">DS-3881: Show no total results on search-filter</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/2699">pass value instead of qualifier to method</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/7993">dspace-api: check for null AND empty qualifier in findByElement()</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/7995">Avoid exporting mapped Item more than once</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/3162">[DS-4574] v. 6 - Upgrade DBCP2 dependency</a></li>
<li><a href="https://github.com/DSpace/DSpace/pull/2742">bump up pdfbox version on 6.x to match main branch</a></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->