Update notes for 2020-09-10

This commit is contained in:
Alan Orth 2020-09-10 15:00:40 +03:00
parent 9d0f0cbfde
commit 7b3aa58055
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
22 changed files with 66 additions and 27 deletions

View File

@ -190,5 +190,20 @@ Would fix 3 occurences of: SOUTHWEST ASIA
- I think we need to wait for the web team, though, as they need to update their mappings
- Not to mention that we'll need to give WLE and CCAFS time to update their harvesters as well... hmmm
- Looking at the top user agents active on CGSpace in 2020-08 and I see:
- `Delphi 2009`: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)
- `Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)`: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA's content)
- `RTB website BOT`: 12282
- `ILRI Livestock Website Publications importer BOT`: 9393
- Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn't commit the change
- HTTrack is in the agents list so I'm not sure why DSpace registers a hit from that request
- Also, I am surprised to see the RTB and ILRI bots here because they have "BOT" in the name and that should also be dropped
- I also see hits from `curl` and `Java/1.8.0_66` and `Apache-HttpClient` so WTF... those are supposed to be dropped by the default agents list
- Some IP `2607:f298:5:101d:f816:3eff:fed9:a484` made 9,000 requests with the `RI/1.0` user agent this year...
- That's on DreamHost...?
- I purged 448658 hits from these agents and added `Delphi` to our local agents overload for Solr as well as Tomcat's Crawler Session Manager Valve so that it forces them to re-use a single session
- I made a pull request on the COUNTER-Robots project for the Daum robot: https://github.com/atmire/COUNTER-Robots/pull/38
- This bot made 8,000 requests to CGSpace this year
- I purged about 20,000 total requests from this bot from our Solr stats for the last few years
<!-- vim: set sw=2 ts=2: -->

View File

@ -25,7 +25,7 @@ I filed an issue on OpenRXV to make some minor edits to the admin UI: https://gi
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-09/" />
<meta property="article:published_time" content="2020-09-02T15:35:54+03:00" />
<meta property="article:modified_time" content="2020-09-08T12:10:08+03:00" />
<meta property="article:modified_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2020"/>
@ -55,9 +55,9 @@ I filed an issue on OpenRXV to make some minor edits to the admin UI: https://gi
"@type": "BlogPosting",
"headline": "September, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-09/",
"wordCount": "1159",
"wordCount": "1398",
"datePublished": "2020-09-02T15:35:54+03:00",
"dateModified": "2020-09-08T12:10:08+03:00",
"dateModified": "2020-09-10T12:18:03+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -341,6 +341,30 @@ Would fix 3 occurences of: SOUTHWEST ASIA
<li>Not to mention that we&rsquo;ll need to give WLE and CCAFS time to update their harvesters as well&hellip; hmmm</li>
</ul>
</li>
<li>Looking at the top user agents active on CGSpace in 2020-08 and I see:
<ul>
<li><code>Delphi 2009</code>: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)</li>
<li><code>Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)</code>: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA&rsquo;s content)</li>
<li><code>RTB website BOT</code>: 12282</li>
<li><code>ILRI Livestock Website Publications importer BOT</code>: 9393</li>
</ul>
</li>
<li>Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn&rsquo;t commit the change</li>
<li>HTTrack is in the agents list so I&rsquo;m not sure why DSpace registers a hit from that request</li>
<li>Also, I am surprised to see the RTB and ILRI bots here because they have &ldquo;BOT&rdquo; in the name and that should also be dropped</li>
<li>I also see hits from <code>curl</code> and <code>Java/1.8.0_66</code> and <code>Apache-HttpClient</code> so WTF&hellip; those are supposed to be dropped by the default agents list</li>
<li>Some IP <code>2607:f298:5:101d:f816:3eff:fed9:a484</code> made 9,000 requests with the <code>RI/1.0</code> user agent this year&hellip;
<ul>
<li>That&rsquo;s on DreamHost&hellip;?</li>
</ul>
</li>
<li>I purged 448658 hits from these agents and added <code>Delphi</code> to our local agents overload for Solr as well as Tomcat&rsquo;s Crawler Session Manager Valve so that it forces them to re-use a single session</li>
<li>I made a pull request on the COUNTER-Robots project for the Daum robot: <a href="https://github.com/atmire/COUNTER-Robots/pull/38">https://github.com/atmire/COUNTER-Robots/pull/38</a>
<ul>
<li>This bot made 8,000 requests to CGSpace this year</li>
<li>I purged about 20,000 total requests from this bot from our Solr stats for the last few years</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-09-08T12:10:08+03:00" />
<meta property="og:updated_time" content="2020-09-10T12:18:03+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-09-08T12:10:08+03:00</lastmod>
<lastmod>2020-09-10T12:18:03+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-09-08T12:10:08+03:00</lastmod>
<lastmod>2020-09-10T12:18:03+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-09-08T12:10:08+03:00</lastmod>
<lastmod>2020-09-10T12:18:03+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-09-08T12:10:08+03:00</lastmod>
<lastmod>2020-09-10T12:18:03+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-09/</loc>
<lastmod>2020-09-08T12:10:08+03:00</lastmod>
<lastmod>2020-09-10T12:18:03+03:00</lastmod>
</url>
<url>