diff --git a/content/post/2017-11.md b/content/post/2017-11.md index c50d53fb1..785398ea3 100644 --- a/content/post/2017-11.md +++ b/content/post/2017-11.md @@ -453,3 +453,24 @@ proxy_set_header User-Agent $ua; - I will deploy this on CGSpace later this week - I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on [2017-11-07](#2017-11-07) for example) - I merged the clickable thumbnails code to `5_x-prod` ([#347](https://github.com/ilri/DSpace/pull/347)) and will deploy it later along with the new bot mapping stuff (and re-run the Asible `nginx` and `tomcat` tags) +- I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in `robots.txt`: + +``` +# zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)" +22229 +# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)" +0 +``` + +- It seems that they rarely even bother checking `robots.txt`, but Google does multiple times per day! + +``` +# zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt +14 +# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt +1134 +``` + +- I have been looking for a reason to ban Baidu and this is definitely a good one +- Disallowing `Baiduspider` in `robots.txt` probably won't work because this bot doesn't seem to respect the robot exclusion standard anyways! +- I will whip up something in nginx later diff --git a/public/2017-11/index.html b/public/2017-11/index.html index 74434c2fe..92d4ce565 100644 --- a/public/2017-11/index.html +++ b/public/2017-11/index.html @@ -38,7 +38,7 @@ COPY 54701 - + @@ -86,9 +86,9 @@ COPY 54701 "@type": "BlogPosting", "headline": "November, 2017", "url": "https://alanorth.github.io/cgspace-notes/2017-11/", - "wordCount": "2568", + "wordCount": "2694", "datePublished": "2017-11-02T09:37:54+02:00", - "dateModified": "2017-11-08T14:26:08+02:00", + "dateModified": "2017-11-08T15:20:49+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -651,6 +651,29 @@ proxy_set_header User-Agent $ua;
5_x-prod
(#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx
and tomcat
tags)robots.txt
:# zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
+22229
+# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
+0
+
+
+robots.txt
, but Google does multiple times per day!# zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
+14
+# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
+1134
+
+
+Baiduspider
in robots.txt
probably won’t work because this bot doesn’t seem to respect the robot exclusion standard anyways!