mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes
This commit is contained in:
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-12-27T10:48:32+03:00" />
|
||||
<meta property="og:updated_time" content="2024-01-02T10:08:00+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -31,7 +31,7 @@
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
},
|
||||
"dateModified": "2023-12-01T08:48:36+03:00",
|
||||
"dateModified": "2024-01-02T10:08:00+03:00",
|
||||
"keywords": "notes, migration, notes",
|
||||
"description":"Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."
|
||||
}
|
||||
@ -96,6 +96,48 @@
|
||||
|
||||
|
||||
|
||||
<article class="blog-post">
|
||||
<header>
|
||||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-05/">May, 2022</a></h2>
|
||||
<p class="blog-post-meta"><time datetime="2022-05-04T09:13:39+03:00">Wed May 04, 2022</time> by Alan Orth in
|
||||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
||||
|
||||
|
||||
</p>
|
||||
</header>
|
||||
<h2 id="2022-05-04">2022-05-04</h2>
|
||||
<ul>
|
||||
<li>I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
|
||||
<ul>
|
||||
<li>18.207.136.176</li>
|
||||
<li>185.189.36.248</li>
|
||||
<li>50.118.223.78</li>
|
||||
<li>52.70.76.123</li>
|
||||
<li>3.236.10.11</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Looking at the Solr statistics for 2022-04
|
||||
<ul>
|
||||
<li>52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests</li>
|
||||
<li>64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc</li>
|
||||
<li>185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt</li>
|
||||
<li>157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr</li>
|
||||
<li>52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
|
||||
<li>157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
|
||||
<li>207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr</li>
|
||||
<li>If I query Solr for <code>time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.</code> I see a handful of IPs that made 41,000 requests</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I purged 93,974 hits from these IPs using my <code>check-spider-ip-hits.sh</code> script</li>
|
||||
</ul>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2022-05/'>Read more →</a>
|
||||
</article>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<article class="blog-post">
|
||||
<header>
|
||||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-04/">April, 2022</a></h2>
|
||||
@ -332,30 +374,6 @@
|
||||
|
||||
|
||||
|
||||
|
||||
<article class="blog-post">
|
||||
<header>
|
||||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-07/">July, 2021</a></h2>
|
||||
<p class="blog-post-meta"><time datetime="2021-07-01T08:53:07+03:00">Thu Jul 01, 2021</time> by Alan Orth in
|
||||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
||||
|
||||
|
||||
</p>
|
||||
</header>
|
||||
<h2 id="2021-07-01">2021-07-01</h2>
|
||||
<ul>
|
||||
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
|
||||
</span></span><span style="display:flex;"><span>COPY 20994
|
||||
</span></span></code></pre></div>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
|
||||
</article>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<nav class="blog-pagination">
|
||||
|
||||
<a class="btn btn-outline-primary" href="/cgspace-notes/page/2/" rel="prev" role="button">Previous page</a>
|
||||
@ -380,6 +398,8 @@
|
||||
<ol class="list-unstyled">
|
||||
|
||||
|
||||
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-12/">December, 2023</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-11/">November, 2023</a></li>
|
||||
@ -388,8 +408,6 @@
|
||||
|
||||
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
|
||||
|
||||
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
|
||||
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
|
Reference in New Issue
Block a user