diff --git a/content/post/2016-11.md b/content/post/2016-11.md index e65e8bd56..2f06e9c0a 100644 --- a/content/post/2016-11.md +++ b/content/post/2016-11.md @@ -253,4 +253,61 @@ Vary: Accept-Encoding X-Cocoon-Version: 2.2.0 ``` +- The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat - This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM! + +## 2016-11-15 + +- The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night: + +![Tomcat JVM heap (day) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-day.png) +![Tomcat JVM heap (week) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-week.png) + +- Seems the default regex doesn't catch Baidu, though: + +``` +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:54 GMT +Server: nginx +Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:59 GMT +Server: nginx +Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +``` + +- Adding Baiduspider to the list of user agents seems to work, and the final configuration should be: + +``` + + +``` + +- Looking at the bots that were active yesterday it seems the above regex should be sufficient: + +``` +$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq +Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-" +Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-" +Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-" +Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-" +Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-" +``` diff --git a/public/2016-11/index.html b/public/2016-11/index.html index 2e337da9f..1fd12715d 100644 --- a/public/2016-11/index.html +++ b/public/2016-11/index.html @@ -30,7 +30,7 @@ - + @@ -388,9 +388,71 @@ X-Cocoon-Version: 2.2.0 +

2016-11-15

+ + + +

Tomcat JVM heap (day) after setting up the Crawler Session Manager +Tomcat JVM heap (week) after setting up the Crawler Session Manager

+ + + +
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
+HTTP/1.1 200 OK
+Connection: keep-alive
+Content-Encoding: gzip
+Content-Language: en-US
+Content-Type: text/html;charset=utf-8
+Date: Tue, 15 Nov 2016 08:49:54 GMT
+Server: nginx
+Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
+Transfer-Encoding: chunked
+Vary: Accept-Encoding
+X-Cocoon-Version: 2.2.0
+
+$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
+HTTP/1.1 200 OK
+Connection: keep-alive
+Content-Encoding: gzip
+Content-Language: en-US
+Content-Type: text/html;charset=utf-8
+Date: Tue, 15 Nov 2016 08:49:59 GMT
+Server: nginx
+Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
+Transfer-Encoding: chunked
+Vary: Accept-Encoding
+X-Cocoon-Version: 2.2.0
+
+ + + +
<!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
+<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
+       crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
+
+ + + +
$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
+Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
+Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
+Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
+Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
+Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
+
+ diff --git a/public/2016/11/dspacetest-tomcat-jvm-day.png b/public/2016/11/dspacetest-tomcat-jvm-day.png new file mode 100644 index 000000000..422b70166 Binary files /dev/null and b/public/2016/11/dspacetest-tomcat-jvm-day.png differ diff --git a/public/2016/11/dspacetest-tomcat-jvm-week.png b/public/2016/11/dspacetest-tomcat-jvm-week.png new file mode 100644 index 000000000..6dcd2c253 Binary files /dev/null and b/public/2016/11/dspacetest-tomcat-jvm-week.png differ diff --git a/public/index.xml b/public/index.xml index 32bde0f95..c145ba219 100644 --- a/public/index.xml +++ b/public/index.xml @@ -297,8 +297,70 @@ X-Cocoon-Version: 2.2.0 </code></pre> <ul> +<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li> <li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li> </ul> + +<h2 id="2016-11-15">2016-11-15</h2> + +<ul> +<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li> +</ul> + +<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" /> +<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p> + +<ul> +<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li> +</ul> + +<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:54 GMT +Server: nginx +Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:59 GMT +Server: nginx +Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +</code></pre> + +<ul> +<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li> +</ul> + +<pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt; +&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot; + crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt; +</code></pre> + +<ul> +<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li> +</ul> + +<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq +Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; +</code></pre> diff --git a/public/post/index.xml b/public/post/index.xml index 30c9482a7..d16a095cf 100644 --- a/public/post/index.xml +++ b/public/post/index.xml @@ -297,8 +297,70 @@ X-Cocoon-Version: 2.2.0 </code></pre> <ul> +<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li> <li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li> </ul> + +<h2 id="2016-11-15">2016-11-15</h2> + +<ul> +<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li> +</ul> + +<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" /> +<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p> + +<ul> +<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li> +</ul> + +<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:54 GMT +Server: nginx +Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:59 GMT +Server: nginx +Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +</code></pre> + +<ul> +<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li> +</ul> + +<pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt; +&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot; + crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt; +</code></pre> + +<ul> +<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li> +</ul> + +<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq +Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; +</code></pre> diff --git a/public/tags/notes/index.xml b/public/tags/notes/index.xml index cd4b90253..b6b3b9af6 100644 --- a/public/tags/notes/index.xml +++ b/public/tags/notes/index.xml @@ -296,8 +296,70 @@ X-Cocoon-Version: 2.2.0 </code></pre> <ul> +<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li> <li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li> </ul> + +<h2 id="2016-11-15">2016-11-15</h2> + +<ul> +<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li> +</ul> + +<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" /> +<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p> + +<ul> +<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li> +</ul> + +<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:54 GMT +Server: nginx +Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Tue, 15 Nov 2016 08:49:59 GMT +Server: nginx +Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +</code></pre> + +<ul> +<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li> +</ul> + +<pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt; +&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot; + crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt; +</code></pre> + +<ul> +<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li> +</ul> + +<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq +Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; +Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot; +</code></pre> diff --git a/static/2016/11/dspacetest-tomcat-jvm-day.png b/static/2016/11/dspacetest-tomcat-jvm-day.png new file mode 100644 index 000000000..422b70166 Binary files /dev/null and b/static/2016/11/dspacetest-tomcat-jvm-day.png differ diff --git a/static/2016/11/dspacetest-tomcat-jvm-week.png b/static/2016/11/dspacetest-tomcat-jvm-week.png new file mode 100644 index 000000000..6dcd2c253 Binary files /dev/null and b/static/2016/11/dspacetest-tomcat-jvm-week.png differ