diff --git a/content/posts/2020-09.md b/content/posts/2020-09.md
index 269d15c93..177e1c8b5 100644
--- a/content/posts/2020-09.md
+++ b/content/posts/2020-09.md
@@ -190,5 +190,20 @@ Would fix 3 occurences of: SOUTHWEST ASIA
- I think we need to wait for the web team, though, as they need to update their mappings
- Not to mention that we'll need to give WLE and CCAFS time to update their harvesters as well... hmmm
+- Looking at the top user agents active on CGSpace in 2020-08 and I see:
+ - `Delphi 2009`: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)
+ - `Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)`: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA's content)
+ - `RTB website BOT`: 12282
+ - `ILRI Livestock Website Publications importer BOT`: 9393
+- Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn't commit the change
+- HTTrack is in the agents list so I'm not sure why DSpace registers a hit from that request
+- Also, I am surprised to see the RTB and ILRI bots here because they have "BOT" in the name and that should also be dropped
+- I also see hits from `curl` and `Java/1.8.0_66` and `Apache-HttpClient` so WTF... those are supposed to be dropped by the default agents list
+- Some IP `2607:f298:5:101d:f816:3eff:fed9:a484` made 9,000 requests with the `RI/1.0` user agent this year...
+ - That's on DreamHost...?
+- I purged 448658 hits from these agents and added `Delphi` to our local agents overload for Solr as well as Tomcat's Crawler Session Manager Valve so that it forces them to re-use a single session
+- I made a pull request on the COUNTER-Robots project for the Daum robot: https://github.com/atmire/COUNTER-Robots/pull/38
+ - This bot made 8,000 requests to CGSpace this year
+ - I purged about 20,000 total requests from this bot from our Solr stats for the last few years
diff --git a/docs/2020-09/index.html b/docs/2020-09/index.html
index 689ebc992..96a99d542 100644
--- a/docs/2020-09/index.html
+++ b/docs/2020-09/index.html
@@ -25,7 +25,7 @@ I filed an issue on OpenRXV to make some minor edits to the admin UI: https://gi
-
+
@@ -55,9 +55,9 @@ I filed an issue on OpenRXV to make some minor edits to the admin UI: https://gi
"@type": "BlogPosting",
"headline": "September, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-09/",
- "wordCount": "1159",
+ "wordCount": "1398",
"datePublished": "2020-09-02T15:35:54+03:00",
- "dateModified": "2020-09-08T12:10:08+03:00",
+ "dateModified": "2020-09-10T12:18:03+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -341,6 +341,30 @@ Would fix 3 occurences of: SOUTHWEST ASIA
Not to mention that we’ll need to give WLE and CCAFS time to update their harvesters as well… hmmm
+Looking at the top user agents active on CGSpace in 2020-08 and I see:
+
+Delphi 2009
: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)
+Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA’s content)
+RTB website BOT
: 12282
+ILRI Livestock Website Publications importer BOT
: 9393
+
+
+Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn’t commit the change
+HTTrack is in the agents list so I’m not sure why DSpace registers a hit from that request
+Also, I am surprised to see the RTB and ILRI bots here because they have “BOT” in the name and that should also be dropped
+I also see hits from curl
and Java/1.8.0_66
and Apache-HttpClient
so WTF… those are supposed to be dropped by the default agents list
+Some IP 2607:f298:5:101d:f816:3eff:fed9:a484
made 9,000 requests with the RI/1.0
user agent this year…
+
+
+I purged 448658 hits from these agents and added Delphi
to our local agents overload for Solr as well as Tomcat’s Crawler Session Manager Valve so that it forces them to re-use a single session
+I made a pull request on the COUNTER-Robots project for the Daum robot: https://github.com/atmire/COUNTER-Robots/pull/38
+
+- This bot made 8,000 requests to CGSpace this year
+- I purged about 20,000 total requests from this bot from our Solr stats for the last few years
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index aabfc2696..c114f07a6 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index e6e165943..5555e9460 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index cda11b758..830128db8 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index fb7d1e958..ed36d661b 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 0c35d83a0..b7a3b4ae1 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index d73b3b694..427f11400 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 997b1e26f..6d7252bd5 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 6cc3e17c6..a9a6e17fe 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index a04f5364c..ecf5de606 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index b1c4ef3d7..075bdaf2e 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 47f08cd4f..4f03912d9 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 63b3479ab..9cb065182 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 38c138346..8d260a9cf 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index ad1435bfa..581bc3575 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index fd95db1b4..87d5e09c5 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 5e0221cb4..41afea5fa 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 0be45e701..7a1f50450 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 479bb77a4..55840c155 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index c348ac379..2200ab602 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 3b90c9ffa..4f3c8624e 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,27 +4,27 @@
https://alanorth.github.io/cgspace-notes/categories/
- 2020-09-08T12:10:08+03:00
+ 2020-09-10T12:18:03+03:00
https://alanorth.github.io/cgspace-notes/
- 2020-09-08T12:10:08+03:00
+ 2020-09-10T12:18:03+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2020-09-08T12:10:08+03:00
+ 2020-09-10T12:18:03+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2020-09-08T12:10:08+03:00
+ 2020-09-10T12:18:03+03:00
https://alanorth.github.io/cgspace-notes/2020-09/
- 2020-09-08T12:10:08+03:00
+ 2020-09-10T12:18:03+03:00