mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-10-31 20:33:00 +01:00
Add notes for 2016-11-15
This commit is contained in:
parent
006d0c8d6f
commit
fa5e351452
@ -253,4 +253,61 @@ Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
```
|
||||
|
||||
- The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat
|
||||
- This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!
|
||||
|
||||
## 2016-11-15
|
||||
|
||||
- The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:
|
||||
|
||||
![Tomcat JVM heap (day) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-day.png)
|
||||
![Tomcat JVM heap (week) after setting up the Crawler Session Manager](2016/11/dspacetest-tomcat-jvm-week.png)
|
||||
|
||||
- Seems the default regex doesn't catch Baidu, though:
|
||||
|
||||
```
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
```
|
||||
|
||||
- Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
|
||||
|
||||
```
|
||||
<!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
|
||||
<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
|
||||
crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
|
||||
```
|
||||
|
||||
- Looking at the bots that were active yesterday it seems the above regex should be sufficient:
|
||||
|
||||
```
|
||||
$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
|
||||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
|
||||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
|
||||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
|
||||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
|
||||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
|
||||
```
|
||||
|
@ -30,7 +30,7 @@
|
||||
|
||||
|
||||
<meta itemprop="dateModified" content="2016-11-01T09:21:00+03:00" />
|
||||
<meta itemprop="wordCount" content="1341">
|
||||
<meta itemprop="wordCount" content="1561">
|
||||
|
||||
|
||||
|
||||
@ -388,9 +388,71 @@ X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li>
|
||||
<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2016-11-15">2016-11-15</h2>
|
||||
|
||||
<ul>
|
||||
<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li>
|
||||
</ul>
|
||||
|
||||
<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" />
|
||||
<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p>
|
||||
|
||||
<ul>
|
||||
<li>Seems the default regex doesn’t catch Baidu, though:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code><!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
|
||||
<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
|
||||
crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
|
||||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
|
||||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
|
||||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
|
||||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
|
||||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
|
||||
</code></pre>
|
||||
|
||||
|
||||
|
||||
|
||||
|
BIN
public/2016/11/dspacetest-tomcat-jvm-day.png
Normal file
BIN
public/2016/11/dspacetest-tomcat-jvm-day.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 23 KiB |
BIN
public/2016/11/dspacetest-tomcat-jvm-week.png
Normal file
BIN
public/2016/11/dspacetest-tomcat-jvm-week.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 26 KiB |
@ -297,8 +297,70 @@ X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li>
|
||||
<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2016-11-15">2016-11-15</h2>
|
||||
|
||||
<ul>
|
||||
<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li>
|
||||
</ul>
|
||||
|
||||
<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" />
|
||||
<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p>
|
||||
|
||||
<ul>
|
||||
<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt;
|
||||
&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot;
|
||||
crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt;
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq
|
||||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
|
||||
</code></pre>
|
||||
</description>
|
||||
</item>
|
||||
|
||||
|
@ -297,8 +297,70 @@ X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li>
|
||||
<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2016-11-15">2016-11-15</h2>
|
||||
|
||||
<ul>
|
||||
<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li>
|
||||
</ul>
|
||||
|
||||
<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" />
|
||||
<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p>
|
||||
|
||||
<ul>
|
||||
<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt;
|
||||
&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot;
|
||||
crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt;
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq
|
||||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
|
||||
</code></pre>
|
||||
</description>
|
||||
</item>
|
||||
|
||||
|
@ -296,8 +296,70 @@ X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li>
|
||||
<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2016-11-15">2016-11-15</h2>
|
||||
|
||||
<ul>
|
||||
<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li>
|
||||
</ul>
|
||||
|
||||
<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" />
|
||||
<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p>
|
||||
|
||||
<ul>
|
||||
<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
|
||||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||||
HTTP/1.1 200 OK
|
||||
Connection: keep-alive
|
||||
Content-Encoding: gzip
|
||||
Content-Language: en-US
|
||||
Content-Type: text/html;charset=utf-8
|
||||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||||
Server: nginx
|
||||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||||
Transfer-Encoding: chunked
|
||||
Vary: Accept-Encoding
|
||||
X-Cocoon-Version: 2.2.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt;
|
||||
&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot;
|
||||
crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt;
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq
|
||||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
|
||||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
|
||||
</code></pre>
|
||||
</description>
|
||||
</item>
|
||||
|
||||
|
BIN
static/2016/11/dspacetest-tomcat-jvm-day.png
Normal file
BIN
static/2016/11/dspacetest-tomcat-jvm-day.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 23 KiB |
BIN
static/2016/11/dspacetest-tomcat-jvm-week.png
Normal file
BIN
static/2016/11/dspacetest-tomcat-jvm-week.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 26 KiB |
Loading…
Reference in New Issue
Block a user