Update notes for 2017-11-08

This commit is contained in:
2017-11-08 22:26:37 +02:00
parent 1ce78705aa
commit d4cafdf5fe
3 changed files with 52 additions and 8 deletions

View File

@ -453,3 +453,24 @@ proxy_set_header User-Agent $ua;
- I will deploy this on CGSpace later this week
- I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on [2017-11-07](#2017-11-07) for example)
- I merged the clickable thumbnails code to `5_x-prod` ([#347](https://github.com/ilri/DSpace/pull/347)) and will deploy it later along with the new bot mapping stuff (and re-run the Asible `nginx` and `tomcat` tags)
- I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in `robots.txt`:
```
# zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
22229
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
0
```
- It seems that they rarely even bother checking `robots.txt`, but Google does multiple times per day!
```
# zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
14
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
1134
```
- I have been looking for a reason to ban Baidu and this is definitely a good one
- Disallowing `Baiduspider` in `robots.txt` probably won't work because this bot doesn't seem to respect the robot exclusion standard anyways!
- I will whip up something in nginx later