Add notes for 2016-11-15

This commit is contained in:
Alan Orth 2016-11-15 11:29:24 +02:00
parent 006d0c8d6f
commit fa5e351452
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
9 changed files with 306 additions and 1 deletions

View File

@ -253,4 +253,61 @@ Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
```
- The first one gets a session, and any after thatwithin 60 secondswill be internally mapped to the same session by Tomcat
- This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!
## 2016-11-15
- The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:
![Tomcat JVM heap (day) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-day.png)
![Tomcat JVM heap (week) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-week.png)
- Seems the default regex doesn't catch Baidu, though:
```
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:59 GMT
Server: nginx
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
```
- Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
```
<!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
```
- Looking at the bots that were active yesterday it seems the above regex should be sufficient:
```
$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
```

View File

@ -30,7 +30,7 @@
<meta itemprop="dateModified" content="2016-11-01T09:21:00&#43;03:00" />
<meta itemprop="wordCount" content="1341">
<meta itemprop="wordCount" content="1561">
@ -388,9 +388,71 @@ X-Cocoon-Version: 2.2.0
</code></pre>
<ul>
<li>The first one gets a session, and any after thatwithin 60 secondswill be internally mapped to the same session by Tomcat</li>
<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li>
</ul>
<h2 id="2016-11-15">2016-11-15</h2>
<ul>
<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li>
</ul>
<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" />
<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p>
<ul>
<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li>
</ul>
<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:59 GMT
Server: nginx
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
</code></pre>
<ul>
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
</ul>
<pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt;
&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot;
crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt;
</code></pre>
<ul>
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
</ul>
<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
</code></pre>

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

View File

@ -297,8 +297,70 @@ X-Cocoon-Version: 2.2.0
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;The first one gets a session, and any after thatwithin 60 secondswill be internally mapped to the same session by Tomcat&lt;/li&gt;
&lt;li&gt;This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-11-15&#34;&gt;2016-11-15&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;2016/11/dspacetest-tomcat-jvm-day.png&#34; alt=&#34;Tomcat JVM heap (day) after setting up the Crawler Session Manager&#34; /&gt;
&lt;img src=&#34;2016/11/dspacetest-tomcat-jvm-week.png&#34; alt=&#34;Tomcat JVM heap (week) after setting up the Crawler Session Manager&#34; /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Seems the default regex doesn&amp;rsquo;t catch Baidu, though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:59 GMT
Server: nginx
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&amp;gt;
&amp;lt;Valve className=&amp;quot;org.apache.catalina.valves.CrawlerSessionManagerValve&amp;quot;
crawlerUserAgents=&amp;quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&amp;quot; /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Looking at the bots that were active yesterday it seems the above regex should be sufficient:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ grep -o -E &#39;Mozilla/5\.0 \(compatible;.*\&amp;quot;&#39; /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&amp;quot; &amp;quot;-&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
</description>
</item>

View File

@ -297,8 +297,70 @@ X-Cocoon-Version: 2.2.0
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;The first one gets a session, and any after thatwithin 60 secondswill be internally mapped to the same session by Tomcat&lt;/li&gt;
&lt;li&gt;This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-11-15&#34;&gt;2016-11-15&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;2016/11/dspacetest-tomcat-jvm-day.png&#34; alt=&#34;Tomcat JVM heap (day) after setting up the Crawler Session Manager&#34; /&gt;
&lt;img src=&#34;2016/11/dspacetest-tomcat-jvm-week.png&#34; alt=&#34;Tomcat JVM heap (week) after setting up the Crawler Session Manager&#34; /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Seems the default regex doesn&amp;rsquo;t catch Baidu, though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:59 GMT
Server: nginx
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&amp;gt;
&amp;lt;Valve className=&amp;quot;org.apache.catalina.valves.CrawlerSessionManagerValve&amp;quot;
crawlerUserAgents=&amp;quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&amp;quot; /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Looking at the bots that were active yesterday it seems the above regex should be sufficient:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ grep -o -E &#39;Mozilla/5\.0 \(compatible;.*\&amp;quot;&#39; /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&amp;quot; &amp;quot;-&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
</description>
</item>

View File

@ -296,8 +296,70 @@ X-Cocoon-Version: 2.2.0
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;The first one gets a session, and any after thatwithin 60 secondswill be internally mapped to the same session by Tomcat&lt;/li&gt;
&lt;li&gt;This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-11-15&#34;&gt;2016-11-15&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;2016/11/dspacetest-tomcat-jvm-day.png&#34; alt=&#34;Tomcat JVM heap (day) after setting up the Crawler Session Manager&#34; /&gt;
&lt;img src=&#34;2016/11/dspacetest-tomcat-jvm-week.png&#34; alt=&#34;Tomcat JVM heap (week) after setting up the Crawler Session Manager&#34; /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Seems the default regex doesn&amp;rsquo;t catch Baidu, though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:59 GMT
Server: nginx
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&amp;gt;
&amp;lt;Valve className=&amp;quot;org.apache.catalina.valves.CrawlerSessionManagerValve&amp;quot;
crawlerUserAgents=&amp;quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&amp;quot; /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Looking at the bots that were active yesterday it seems the above regex should be sufficient:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ grep -o -E &#39;Mozilla/5\.0 \(compatible;.*\&amp;quot;&#39; /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&amp;quot; &amp;quot;-&amp;quot;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&amp;quot; &amp;quot;-&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
</description>
</item>

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB