From c6ac8b9afebc1e9d8b277367c1cec971d97af299 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 5 Aug 2020 16:58:31 +0300 Subject: [PATCH] Update notes for 2020-08-05 --- content/posts/2020-08.md | 60 +++++++++++++++++++++++++ docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/sitemap.xml | 10 ++--- 19 files changed, 82 insertions(+), 22 deletions(-) diff --git a/content/posts/2020-08.md b/content/posts/2020-08.md index 6da0b0541..bb04d3d6e 100644 --- a/content/posts/2020-08.md +++ b/content/posts/2020-08.md @@ -120,5 +120,65 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa - Seems that something happened yesterday afternoon at around 5PM... - For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue - I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly +- I checked the nginx logs around 5PM yesterday to see who was accessing the server: + +``` +# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED - +``` + +- I see the Macaroni Bros are using their new user agent for harvesting: `RTB website BOT` + - But that pattern doesn't match in the nginx bot list or Tomcat's crawler session manager valve because we're only checking for `[Bb]ot`! + - So they have created thousands of Tomcat sessions: + +``` +$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l +5693 +``` + +- DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don't misuse the resources + - Perhaps `[Bb][Oo][Tt]`... +- I see another IP 104.198.96.245, which is also using the "RTB website BOT" but there are 70,000 hits in Solr from earlier this year before they started using the user agent + - I purged all the hits from Solr, including a few thousand from 64.62.202.71 +- A few more IPs causing lots of Tomcat sessions yesterday: + +``` +$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l +1585 +$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l +5691 +``` + +- 38.128.66.10 isn't creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat: + +``` +Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2 +``` + +- 64.62.202.71 is using a user agent I've never seen before: + +``` +Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com) +``` + +- So now our "bot" regex can't even match that... + - Unless we change it to `[Bb]\.?[Oo]\.?[Tt]\.?`... which seems to match all variations of "bot" I can think of right now, according to [regexr.com](https://regexr.com/59lpt): + +``` +RTB website BOT +Altmetribot +Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) +Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com) +Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) +``` + +- And another IP belonging to Turnitin (the alternate user agent of Turnitinbot): + +``` +$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi +on_id=[A-Z0-9]{32}' | sort | uniq | wc -l +2777 +``` + +- I will add `Turnitin` to the Tomcat Crawler Session Manager Valve regex as well... diff --git a/docs/categories/index.html b/docs/categories/index.html index 4f8d39e3b..600183605 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index b30881833..ae303d571 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 8bd86d7a1..84bdb03d4 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index e2aa7f405..45204d66d 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index e7649cd42..9ff85015a 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 7225b99cd..b004790b8 100644 --- a/docs/index.html +++ b/docs/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 8b583663b..a82515a14 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 4f1e34760..46996d5fd 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 08796aec3..6911371d2 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 08110ca7d..2927c2bcc 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 5b159049b..9608e7177 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 83b0a73a5..818d1e419 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index b519c5c0c..ab05ed905 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 01915cb62..39532ac5f 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 1e77f1a94..e7da2351c 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 5d62d246c..a1db89c20 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 4b2785e4e..2a06cb3b5 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 1119d7954..39aff2545 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/2020-07/ - 2020-08-03T16:27:51+03:00 + 2020-08-05T15:00:06+03:00 https://alanorth.github.io/cgspace-notes/categories/ - 2020-08-03T16:27:51+03:00 + 2020-08-05T15:00:06+03:00 https://alanorth.github.io/cgspace-notes/ - 2020-08-03T16:27:51+03:00 + 2020-08-05T15:00:06+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-08-03T16:27:51+03:00 + 2020-08-05T15:00:06+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-08-03T16:27:51+03:00 + 2020-08-05T15:00:06+03:00