diff --git a/content/posts/2018-12.md b/content/posts/2018-12.md index 2cce254d2..1ccc3043b 100644 --- a/content/posts/2018-12.md +++ b/content/posts/2018-12.md @@ -231,4 +231,77 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg - This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness...? +## 2018-12-04 + +- Last night Linode sent a message that the load on CGSpace (linode18) was too high, here's a list of the top users at the time and throughout the day: + +``` +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 225 40.77.167.142 + 226 66.249.64.63 + 232 46.101.86.248 + 285 45.5.186.2 + 333 54.70.40.11 + 411 193.29.13.85 + 476 34.218.226.147 + 962 66.249.70.27 + 1193 35.237.175.180 + 1450 2a01:4f8:140:3192::2 +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 1141 207.46.13.57 + 1299 197.210.168.174 + 1341 54.70.40.11 + 1429 40.77.167.142 + 1528 34.218.226.147 + 1973 66.249.70.27 + 2079 50.116.102.77 + 2494 78.46.79.71 + 3210 2a01:4f8:140:3192::2 + 4190 35.237.175.180 +``` + +- `35.237.175.180` is known to us (CCAFS?), and I've already added it to the list of bot IPs in nginx, which appears to be working: + +``` +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 +4772 +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l +630 +``` + +- I haven't seen `2a01:4f8:140:3192::2` before. Its user agent is some new bot: + +``` +Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) +``` + +- At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions: + +``` +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 +5111 +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l +419 +``` + +- `78.46.79.71` is another host on Hetzner with the following user agent: + +``` +Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0 +``` + +- This is not the first time a host on Hetzner has used a "normal" user agent to make thousands of requests +- At least it is re-using its Tomcat sessions somehow: + +``` +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 +2044 +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l +1 +``` + +- In other news, it's good to see my re-work of the database connectivity in the [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) actually caused a reduction of persistent database connections (from 1 to 0, but still!): + +![PostgreSQL connections day](/cgspace-notes/2018/12/postgres_connections_db-month.png) + diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index 205b25ae5..542ec5027 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -21,7 +21,7 @@ Today these are the top 10 IPs: " /> - + @@ -48,9 +48,9 @@ Today these are the top 10 IPs: "@type": "BlogPosting", "headline": "November, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-11/", - "wordCount": "2698", + "wordCount": "2823", "datePublished": "2018-11-01T16:41:30+02:00", - "dateModified": "2018-11-28T09:32:04+02:00", + "dateModified": "2018-12-04T09:50:36+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -156,7 +156,7 @@ Today these are the top 10 IPs:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03 | sort | uniq
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
342
@@ -173,7 +173,7 @@ Today these are the top 10 IPs:
And it doesn’t seem they are re-using their Tomcat sessions:
-$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03 | sort | uniq
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
1243
@@ -205,14 +205,17 @@ Today these are the top 10 IPs:
-- It’s making lots of requests and using quite a number of Tomcat sessions:
+- It’s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:
-$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' /home/cgspace.cgiar.org/log/dspace.log.2018-11-03 | sort | uniq
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
8449
+$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
+1
+- Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions
- I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing
- Perhaps I should think about adding rate limits to dynamic pages like
/discover
and /browse
- I think it’s reasonable for a human to click one of those links five or ten times a minute…
@@ -270,14 +273,17 @@ Today these are the top 10 IPs:
-78.46.89.18
is back… and still making tons of Tomcat sessions:
+78.46.89.18
is back… and it is still actually re-using its Tomcat sessions:
-$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
8765
+$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
+1
+- Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly
- Also, now we have a ton of Facebook crawlers:
@@ -302,14 +308,15 @@ Today these are the top 10 IPs:
-- They are really making shit tons of Tomcat sessions:
+- They are really making shit tons of requests:
-$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
-14368
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
+37721
+- Updated on 2018-12-04 to correct the grep command to accurately show the number of requests
- Their user agent is:
@@ -342,16 +349,17 @@ Today these are the top 10 IPs:
-- And still making shit tons of Tomcat sessions:
+- Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:
-$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
-28470
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
+37721
+$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
+15206
-- And that’s even using the Tomcat Crawler Session Manager valve!
-- Maybe we need to limit more dynamic pages, like the “most popular” country, item, and author pages
+- I think we still need to limit more of the dynamic pages, like the “most popular” country, item, and author pages
- It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!
@@ -377,19 +385,22 @@ Today these are the top 10 IPs:
The file marlo.csv
was cleaned up and formatted in Open Refine
165 of the items in their 2017 data are from CGSpace!
I will add the data to CGSpace this week (done!)
-Jesus, is Facebook trying to be annoying?
+Jesus, is Facebook trying to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
29889
-# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq
-29156
+# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
+29763
+# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
+1057
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep "2a03:2880:11ff:" | grep -c -E "(handle|bitstream)"
29896
-- 29,000 requests from Facebook, 29,000 Tomcat sessions, and none of the requests are to the dynamic pages I rate limited yesterday!
+- 29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!
+- At least the Tomcat Crawler Session Manager Valve is working now…
2018-11-06
diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html
index d64977f5c..54ad0cc8a 100644
--- a/docs/2018-12/index.html
+++ b/docs/2018-12/index.html
@@ -21,7 +21,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
" />
-
+
@@ -48,9 +48,9 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
"@type": "BlogPosting",
"headline": "December, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-12/",
- "wordCount": "1503",
+ "wordCount": "1826",
"datePublished": "2018-12-02T02:09:30+02:00",
- "dateModified": "2018-12-03T13:16:42+02:00",
+ "dateModified": "2018-12-03T18:28:21+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -370,6 +370,87 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness…?
+2018-12-04
+
+
+- Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:
+
+
+# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+ 225 40.77.167.142
+ 226 66.249.64.63
+ 232 46.101.86.248
+ 285 45.5.186.2
+ 333 54.70.40.11
+ 411 193.29.13.85
+ 476 34.218.226.147
+ 962 66.249.70.27
+ 1193 35.237.175.180
+ 1450 2a01:4f8:140:3192::2
+# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
+ 1141 207.46.13.57
+ 1299 197.210.168.174
+ 1341 54.70.40.11
+ 1429 40.77.167.142
+ 1528 34.218.226.147
+ 1973 66.249.70.27
+ 2079 50.116.102.77
+ 2494 78.46.79.71
+ 3210 2a01:4f8:140:3192::2
+ 4190 35.237.175.180
+
+
+
+35.237.175.180
is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:
+
+
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
+4772
+$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
+630
+
+
+
+- I haven’t seen
2a01:4f8:140:3192::2
before. Its user agent is some new bot:
+
+
+Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
+
+
+
+- At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:
+
+
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
+5111
+$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
+419
+
+
+
+78.46.79.71
is another host on Hetzner with the following user agent:
+
+
+Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
+
+
+
+- This is not the first time a host on Hetzner has used a “normal” user agent to make thousands of requests
+- At least it is re-using its Tomcat sessions somehow:
+
+
+$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
+2044
+$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
+1
+
+
+
+- In other news, it’s good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):
+
+
+
+
diff --git a/docs/2018/12/postgres_connections_db-month.png b/docs/2018/12/postgres_connections_db-month.png
new file mode 100644
index 000000000..6a9bf25a6
Binary files /dev/null and b/docs/2018/12/postgres_connections_db-month.png differ
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 099d6d67a..6b230a371 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,12 +4,12 @@
https://alanorth.github.io/cgspace-notes/2018-12/
- 2018-12-03T13:16:42+02:00
+ 2018-12-03T18:28:21+02:00
https://alanorth.github.io/cgspace-notes/2018-11/
- 2018-11-28T09:32:04+02:00
+ 2018-12-04T09:50:36+02:00
@@ -199,7 +199,7 @@
https://alanorth.github.io/cgspace-notes/
- 2018-12-03T13:16:42+02:00
+ 2018-12-03T18:28:21+02:00
0
@@ -210,7 +210,7 @@
https://alanorth.github.io/cgspace-notes/tags/notes/
- 2018-12-03T13:16:42+02:00
+ 2018-12-03T18:28:21+02:00
0
@@ -222,13 +222,13 @@
https://alanorth.github.io/cgspace-notes/posts/
- 2018-12-03T13:16:42+02:00
+ 2018-12-03T18:28:21+02:00
0
https://alanorth.github.io/cgspace-notes/tags/
- 2018-12-03T13:16:42+02:00
+ 2018-12-03T18:28:21+02:00
0
diff --git a/static/2018/12/postgres_connections_db-month.png b/static/2018/12/postgres_connections_db-month.png
new file mode 100644
index 000000000..6a9bf25a6
Binary files /dev/null and b/static/2018/12/postgres_connections_db-month.png differ