mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2017-11-08
This commit is contained in:
@ -453,3 +453,24 @@ proxy_set_header User-Agent $ua;
|
||||
- I will deploy this on CGSpace later this week
|
||||
- I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on [2017-11-07](#2017-11-07) for example)
|
||||
- I merged the clickable thumbnails code to `5_x-prod` ([#347](https://github.com/ilri/DSpace/pull/347)) and will deploy it later along with the new bot mapping stuff (and re-run the Asible `nginx` and `tomcat` tags)
|
||||
- I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in `robots.txt`:
|
||||
|
||||
```
|
||||
# zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
|
||||
22229
|
||||
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
|
||||
0
|
||||
```
|
||||
|
||||
- It seems that they rarely even bother checking `robots.txt`, but Google does multiple times per day!
|
||||
|
||||
```
|
||||
# zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
|
||||
14
|
||||
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
|
||||
1134
|
||||
```
|
||||
|
||||
- I have been looking for a reason to ban Baidu and this is definitely a good one
|
||||
- Disallowing `Baiduspider` in `robots.txt` probably won't work because this bot doesn't seem to respect the robot exclusion standard anyways!
|
||||
- I will whip up something in nginx later
|
||||
|
Reference in New Issue
Block a user