Add notes

This commit is contained in:
2024-01-05 15:45:46 +03:00
parent 264cdcf1db
commit 7418dae4b9
146 changed files with 1598 additions and 1169 deletions

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-12-27T10:48:32+03:00" />
<meta property="og:updated_time" content="2024-01-02T10:08:00+03:00" />
@ -31,7 +31,7 @@
"@type": "Person",
"name": "Alan Orth"
},
"dateModified": "2023-12-01T08:48:36+03:00",
"dateModified": "2024-01-02T10:08:00+03:00",
"keywords": "notes, migration, notes",
"description":"Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."
}
@ -96,6 +96,48 @@
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-05/">May, 2022</a></h2>
<p class="blog-post-meta"><time datetime="2022-05-04T09:13:39+03:00">Wed May 04, 2022</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2022-05-04">2022-05-04</h2>
<ul>
<li>I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
<ul>
<li>18.207.136.176</li>
<li>185.189.36.248</li>
<li>50.118.223.78</li>
<li>52.70.76.123</li>
<li>3.236.10.11</li>
</ul>
</li>
<li>Looking at the Solr statistics for 2022-04
<ul>
<li>52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests</li>
<li>64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc</li>
<li>185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt</li>
<li>157.55.39.159 is owned by Microsoft and identifies as bingbot so I don&rsquo;t know why its requests were logged in Solr</li>
<li>52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
<li>157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
<li>207.46.13.177 is owned by Microsoft and identifies as bingbot so I don&rsquo;t know why its requests were logged in Solr</li>
<li>If I query Solr for <code>time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.</code> I see a handful of IPs that made 41,000 requests</li>
</ul>
</li>
<li>I purged 93,974 hits from these IPs using my <code>check-spider-ip-hits.sh</code> script</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2022-05/'>Read more →</a>
</article>
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-04/">April, 2022</a></h2>
@ -332,30 +374,6 @@
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-07/">July, 2021</a></h2>
<p class="blog-post-meta"><time datetime="2021-07-01T08:53:07+03:00">Thu Jul 01, 2021</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2021-07-01">2021-07-01</h2>
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 20994
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>
<nav class="blog-pagination">
<a class="btn btn-outline-primary" href="/cgspace-notes/posts/page/2/" rel="prev" role="button">Previous page</a>
@ -380,6 +398,8 @@
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
<li><a href="/cgspace-notes/2023-12/">December, 2023</a></li>
<li><a href="/cgspace-notes/2023-11/">November, 2023</a></li>
@ -388,8 +408,6 @@
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
</ol>
</section>