diff --git a/content/post/2017-11.md b/content/post/2017-11.md index c50d53fb1..785398ea3 100644 --- a/content/post/2017-11.md +++ b/content/post/2017-11.md @@ -453,3 +453,24 @@ proxy_set_header User-Agent $ua; - I will deploy this on CGSpace later this week - I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on [2017-11-07](#2017-11-07) for example) - I merged the clickable thumbnails code to `5_x-prod` ([#347](https://github.com/ilri/DSpace/pull/347)) and will deploy it later along with the new bot mapping stuff (and re-run the Asible `nginx` and `tomcat` tags) +- I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in `robots.txt`: + +``` +# zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)" +22229 +# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)" +0 +``` + +- It seems that they rarely even bother checking `robots.txt`, but Google does multiple times per day! + +``` +# zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt +14 +# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt +1134 +``` + +- I have been looking for a reason to ban Baidu and this is definitely a good one +- Disallowing `Baiduspider` in `robots.txt` probably won't work because this bot doesn't seem to respect the robot exclusion standard anyways! +- I will whip up something in nginx later diff --git a/public/2017-11/index.html b/public/2017-11/index.html index 74434c2fe..92d4ce565 100644 --- a/public/2017-11/index.html +++ b/public/2017-11/index.html @@ -38,7 +38,7 @@ COPY 54701 - + @@ -86,9 +86,9 @@ COPY 54701 "@type": "BlogPosting", "headline": "November, 2017", "url": "https://alanorth.github.io/cgspace-notes/2017-11/", - "wordCount": "2568", + "wordCount": "2694", "datePublished": "2017-11-02T09:37:54+02:00", - "dateModified": "2017-11-08T14:26:08+02:00", + "dateModified": "2017-11-08T15:20:49+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -651,6 +651,29 @@ proxy_set_header User-Agent $ua;
  • I will deploy this on CGSpace later this week
  • I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on 2017-11-07 for example)
  • I merged the clickable thumbnails code to 5_x-prod (#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx and tomcat tags)
  • +
  • I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in robots.txt:
  • + + +
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
    +22229
    +# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
    +0
    +
    + + + +
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt 
    +14
    +# zgrep Googlebot  /var/log/nginx/access.log* | grep -c robots.txt 
    +1134
    +
    + + diff --git a/public/sitemap.xml b/public/sitemap.xml index 2543b8cb5..f3652377d 100644 --- a/public/sitemap.xml +++ b/public/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2017-11/ - 2017-11-08T14:26:08+02:00 + 2017-11-08T15:20:49+02:00 @@ -134,7 +134,7 @@ https://alanorth.github.io/cgspace-notes/ - 2017-11-08T14:26:08+02:00 + 2017-11-08T15:20:49+02:00 0 @@ -145,7 +145,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2017-11-08T14:26:08+02:00 + 2017-11-08T15:20:49+02:00 0 @@ -157,13 +157,13 @@ https://alanorth.github.io/cgspace-notes/post/ - 2017-11-08T14:26:08+02:00 + 2017-11-08T15:20:49+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2017-11-08T14:26:08+02:00 + 2017-11-08T15:20:49+02:00 0