Update notes for 2017-11-07

2025-01-27 05:49:12 +01:00 · 2017-11-07 18:09:29 +02:00
parent 7e18f5e5d2
commit 0ffe1f07b0
6 changed files with 61 additions and 14 deletions
--- a/content/post/2017-11.md
+++ b/content/post/2017-11.md
@@ -376,3 +376,29 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
 ```

 - I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API
+- About Baidu, I found a link to their [robots.txt tester tool](http://ziyuan.baidu.com/robots/)
+- It seems like our robots.txt file is valid, and they claim to recognize that URLs like `/discover` should be forbidden:
+
+![Baidu robots.txt tester](/cgspace-notes/2017/11/baidu-robotstxt.png)
+
+- But they literally just made this request today:
+
+```
+180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
+```
+
+- Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:
+
+```
+# grep -c Baiduspider /var/log/nginx/access.log
+3806
+# grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)"
+1085
+```
+
+- I will think about blocking their IPs but they have 164 of them!
+
+```
+# grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
+164
+```