diff --git a/content/posts/2018-12.md b/content/posts/2018-12.md index 2cce254d2..1ccc3043b 100644 --- a/content/posts/2018-12.md +++ b/content/posts/2018-12.md @@ -231,4 +231,77 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg - This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness...? +## 2018-12-04 + +- Last night Linode sent a message that the load on CGSpace (linode18) was too high, here's a list of the top users at the time and throughout the day: + +``` +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 225 40.77.167.142 + 226 66.249.64.63 + 232 46.101.86.248 + 285 45.5.186.2 + 333 54.70.40.11 + 411 193.29.13.85 + 476 34.218.226.147 + 962 66.249.70.27 + 1193 35.237.175.180 + 1450 2a01:4f8:140:3192::2 +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 + 1141 207.46.13.57 + 1299 197.210.168.174 + 1341 54.70.40.11 + 1429 40.77.167.142 + 1528 34.218.226.147 + 1973 66.249.70.27 + 2079 50.116.102.77 + 2494 78.46.79.71 + 3210 2a01:4f8:140:3192::2 + 4190 35.237.175.180 +``` + +- `35.237.175.180` is known to us (CCAFS?), and I've already added it to the list of bot IPs in nginx, which appears to be working: + +``` +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 +4772 +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l +630 +``` + +- I haven't seen `2a01:4f8:140:3192::2` before. Its user agent is some new bot: + +``` +Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) +``` + +- At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions: + +``` +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 +5111 +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l +419 +``` + +- `78.46.79.71` is another host on Hetzner with the following user agent: + +``` +Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0 +``` + +- This is not the first time a host on Hetzner has used a "normal" user agent to make thousands of requests +- At least it is re-using its Tomcat sessions somehow: + +``` +$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 +2044 +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l +1 +``` + +- In other news, it's good to see my re-work of the database connectivity in the [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) actually caused a reduction of persistent database connections (from 1 to 0, but still!): + +![PostgreSQL connections day](/cgspace-notes/2018/12/postgres_connections_db-month.png) + diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index 205b25ae5..542ec5027 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -21,7 +21,7 @@ Today these are the top 10 IPs: " /> - + @@ -48,9 +48,9 @@ Today these are the top 10 IPs: "@type": "BlogPosting", "headline": "November, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-11/", - "wordCount": "2698", + "wordCount": "2823", "datePublished": "2018-11-01T16:41:30+02:00", - "dateModified": "2018-11-28T09:32:04+02:00", + "dateModified": "2018-12-04T09:50:36+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -156,7 +156,7 @@ Today these are the top 10 IPs:
  • They at least seem to be re-using their Tomcat sessions:
  • -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
     342
     
    @@ -173,7 +173,7 @@ Today these are the top 10 IPs:
  • And it doesn’t seem they are re-using their Tomcat sessions:
  • -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
     1243
     
    @@ -205,14 +205,17 @@ Today these are the top 10 IPs:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' /home/cgspace.cgiar.org/log/dspace.log.2018-11-03 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     8449
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
    +1
     
      +
    • Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions
    • I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing
    • Perhaps I should think about adding rate limits to dynamic pages like /discover and /browse
    • I think it’s reasonable for a human to click one of those links five or ten times a minute…
    • @@ -270,14 +273,17 @@ Today these are the top 10 IPs:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     8765
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
    +1
     
      +
    • Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly
    • Also, now we have a ton of Facebook crawlers:
    @@ -302,14 +308,15 @@ Today these are the top 10 IPs:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
    -14368
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
    +37721
     
      +
    • Updated on 2018-12-04 to correct the grep command to accurately show the number of requests
    • Their user agent is:
    @@ -342,16 +349,17 @@ Today these are the top 10 IPs:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
    -28470
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
    +37721
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
    +15206
     
      -
    • And that’s even using the Tomcat Crawler Session Manager valve!
    • -
    • Maybe we need to limit more dynamic pages, like the “most popular” country, item, and author pages
    • +
    • I think we still need to limit more of the dynamic pages, like the “most popular” country, item, and author pages
    • It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!
    @@ -377,19 +385,22 @@ Today these are the top 10 IPs:
  • The file marlo.csv was cleaned up and formatted in Open Refine
  • 165 of the items in their 2017 data are from CGSpace!
  • I will add the data to CGSpace this week (done!)
  • -
  • Jesus, is Facebook trying to be annoying?
  • +
  • Jesus, is Facebook trying to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:
  • # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
     29889
    -# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq
    -29156
    +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
    +29763
    +# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
    +1057
     # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep "2a03:2880:11ff:" | grep -c -E "(handle|bitstream)"
     29896
     
      -
    • 29,000 requests from Facebook, 29,000 Tomcat sessions, and none of the requests are to the dynamic pages I rate limited yesterday!
    • +
    • 29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!
    • +
    • At least the Tomcat Crawler Session Manager Valve is working now…

    2018-11-06

    diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html index d64977f5c..54ad0cc8a 100644 --- a/docs/2018-12/index.html +++ b/docs/2018-12/index.html @@ -21,7 +21,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see " /> - + @@ -48,9 +48,9 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see "@type": "BlogPosting", "headline": "December, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-12/", - "wordCount": "1503", + "wordCount": "1826", "datePublished": "2018-12-02T02:09:30+02:00", - "dateModified": "2018-12-03T13:16:42+02:00", + "dateModified": "2018-12-03T18:28:21+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -370,6 +370,87 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
  • This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness…?
  • +

    2018-12-04

    + +
      +
    • Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:
    • +
    + +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +    225 40.77.167.142
    +    226 66.249.64.63
    +    232 46.101.86.248
    +    285 45.5.186.2
    +    333 54.70.40.11
    +    411 193.29.13.85
    +    476 34.218.226.147
    +    962 66.249.70.27
    +   1193 35.237.175.180
    +   1450 2a01:4f8:140:3192::2
    +# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +   1141 207.46.13.57
    +   1299 197.210.168.174
    +   1341 54.70.40.11
    +   1429 40.77.167.142
    +   1528 34.218.226.147
    +   1973 66.249.70.27
    +   2079 50.116.102.77
    +   2494 78.46.79.71
    +   3210 2a01:4f8:140:3192::2
    +   4190 35.237.175.180
    +
    + +
      +
    • 35.237.175.180 is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:
    • +
    + +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
    +4772
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
    +630
    +
    + +
      +
    • I haven’t seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:
    • +
    + +
    Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
    +
    + +
      +
    • At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:
    • +
    + +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
    +5111
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
    +419
    +
    + +
      +
    • 78.46.79.71 is another host on Hetzner with the following user agent:
    • +
    + +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    +
    + +
      +
    • This is not the first time a host on Hetzner has used a “normal” user agent to make thousands of requests
    • +
    • At least it is re-using its Tomcat sessions somehow:
    • +
    + +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
    +2044
    +$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
    +1
    +
    + +
      +
    • In other news, it’s good to see my re-work of the database connectivity in the dspace-statistics-api actually caused a reduction of persistent database connections (from 1 to 0, but still!):
    • +
    + +

    PostgreSQL connections day

    + diff --git a/docs/2018/12/postgres_connections_db-month.png b/docs/2018/12/postgres_connections_db-month.png new file mode 100644 index 000000000..6a9bf25a6 Binary files /dev/null and b/docs/2018/12/postgres_connections_db-month.png differ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 099d6d67a..6b230a371 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,12 +4,12 @@ https://alanorth.github.io/cgspace-notes/2018-12/ - 2018-12-03T13:16:42+02:00 + 2018-12-03T18:28:21+02:00 https://alanorth.github.io/cgspace-notes/2018-11/ - 2018-11-28T09:32:04+02:00 + 2018-12-04T09:50:36+02:00 @@ -199,7 +199,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-12-03T13:16:42+02:00 + 2018-12-03T18:28:21+02:00 0 @@ -210,7 +210,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-12-03T13:16:42+02:00 + 2018-12-03T18:28:21+02:00 0 @@ -222,13 +222,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2018-12-03T13:16:42+02:00 + 2018-12-03T18:28:21+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-12-03T13:16:42+02:00 + 2018-12-03T18:28:21+02:00 0 diff --git a/static/2018/12/postgres_connections_db-month.png b/static/2018/12/postgres_connections_db-month.png new file mode 100644 index 000000000..6a9bf25a6 Binary files /dev/null and b/static/2018/12/postgres_connections_db-month.png differ