mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2016-11-15
This commit is contained in:
@ -253,4 +253,61 @@ Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
```
|
||||
|
||||
- The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat
|
||||
- This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!
|
||||
|
||||
## 2016-11-15
|
||||
|
||||
- The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:
|
||||
|
||||

|
||||

|
||||
|
||||
- Seems the default regex doesn't catch Baidu, though:
|
||||
|
||||
```
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
```
|
||||
|
||||
- Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
|
||||
|
||||
```
|
||||
<!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
|
||||
<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
|
||||
crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
|
||||
```
|
||||
|
||||
- Looking at the bots that were active yesterday it seems the above regex should be sufficient:
|
||||
|
||||
```
|
||||
$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
|
||||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
|
||||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
|
||||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
|
||||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
|
||||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
|
||||
```
|
||||
|
Reference in New Issue
Block a user