Add notes for 2016-11-15

This commit is contained in:
2016-11-15 11:29:24 +02:00
parent 006d0c8d6f
commit fa5e351452
9 changed files with 306 additions and 1 deletions

View File

@ -253,4 +253,61 @@ Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
```
- The first one gets a session, and any after thatwithin 60 secondswill be internally mapped to the same session by Tomcat
- This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!
## 2016-11-15
- The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:
![Tomcat JVM heap (day) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-day.png)
![Tomcat JVM heap (week) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-week.png)
- Seems the default regex doesn't catch Baidu, though:
```
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:59 GMT
Server: nginx
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
```
- Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
```
<!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
```
- Looking at the bots that were active yesterday it seems the above regex should be sufficient:
```
$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
```