From 006d0c8d6fc4c37b24e489c1e28594e43bd342b8 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Mon, 14 Nov 2016 21:48:55 +0200 Subject: [PATCH] Update notes for 2016-11-14 --- content/post/2016-11.md | 32 ++++++++++++++++++++++++++++++++ public/2016-11/index.html | 35 ++++++++++++++++++++++++++++++++++- public/index.xml | 33 +++++++++++++++++++++++++++++++++ public/post/index.xml | 33 +++++++++++++++++++++++++++++++++ public/tags/notes/index.xml | 33 +++++++++++++++++++++++++++++++++ 5 files changed, 165 insertions(+), 1 deletion(-) diff --git a/content/post/2016-11.md b/content/post/2016-11.md index b7d79d020..e65e8bd56 100644 --- a/content/post/2016-11.md +++ b/content/post/2016-11.md @@ -222,3 +222,35 @@ $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X P - I applied Atmire's suggestions to fix Listings and Reports for DSpace 5.5 and now it works - There were some issues with the `dspace/modules/jspui/pom.xml`, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire's installation procedure must have changed +- So there is apparently this Tomcat native way to limit web crawlers to one session: [Crawler Session Manager](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve) +- After adding that to `server.xml` bots matching the pattern in the configuration will all use ONE session, just like normal users: + +``` +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:29 GMT +Server: nginx +Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +X-Robots-Tag: none + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:35 GMT +Server: nginx +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +``` + +- This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM! diff --git a/public/2016-11/index.html b/public/2016-11/index.html index 6b5a0c309..2e337da9f 100644 --- a/public/2016-11/index.html +++ b/public/2016-11/index.html @@ -30,7 +30,7 @@ - + @@ -356,6 +356,39 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica + +
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
+HTTP/1.1 200 OK
+Connection: keep-alive
+Content-Encoding: gzip
+Content-Language: en-US
+Content-Type: text/html;charset=utf-8
+Date: Mon, 14 Nov 2016 19:47:29 GMT
+Server: nginx
+Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly
+Transfer-Encoding: chunked
+Vary: Accept-Encoding
+X-Cocoon-Version: 2.2.0
+X-Robots-Tag: none
+
+$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
+HTTP/1.1 200 OK
+Connection: keep-alive
+Content-Encoding: gzip
+Content-Language: en-US
+Content-Type: text/html;charset=utf-8
+Date: Mon, 14 Nov 2016 19:47:35 GMT
+Server: nginx
+Transfer-Encoding: chunked
+Vary: Accept-Encoding
+X-Cocoon-Version: 2.2.0
+
+ + diff --git a/public/index.xml b/public/index.xml index 01339ea2c..32bde0f95 100644 --- a/public/index.xml +++ b/public/index.xml @@ -265,6 +265,39 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-T <ul> <li>I applied Atmire&rsquo;s suggestions to fix Listings and Reports for DSpace 5.5 and now it works</li> <li>There were some issues with the <code>dspace/modules/jspui/pom.xml</code>, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire&rsquo;s installation procedure must have changed</li> +<li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li> +<li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li> +</ul> + +<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:29 GMT +Server: nginx +Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +X-Robots-Tag: none + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:35 GMT +Server: nginx +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +</code></pre> + +<ul> +<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li> </ul> diff --git a/public/post/index.xml b/public/post/index.xml index fb635e481..30c9482a7 100644 --- a/public/post/index.xml +++ b/public/post/index.xml @@ -265,6 +265,39 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-T <ul> <li>I applied Atmire&rsquo;s suggestions to fix Listings and Reports for DSpace 5.5 and now it works</li> <li>There were some issues with the <code>dspace/modules/jspui/pom.xml</code>, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire&rsquo;s installation procedure must have changed</li> +<li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li> +<li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li> +</ul> + +<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:29 GMT +Server: nginx +Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +X-Robots-Tag: none + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:35 GMT +Server: nginx +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +</code></pre> + +<ul> +<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li> </ul> diff --git a/public/tags/notes/index.xml b/public/tags/notes/index.xml index d855fe539..cd4b90253 100644 --- a/public/tags/notes/index.xml +++ b/public/tags/notes/index.xml @@ -264,6 +264,39 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-T <ul> <li>I applied Atmire&rsquo;s suggestions to fix Listings and Reports for DSpace 5.5 and now it works</li> <li>There were some issues with the <code>dspace/modules/jspui/pom.xml</code>, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire&rsquo;s installation procedure must have changed</li> +<li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li> +<li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li> +</ul> + +<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:29 GMT +Server: nginx +Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +X-Robots-Tag: none + +$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' +HTTP/1.1 200 OK +Connection: keep-alive +Content-Encoding: gzip +Content-Language: en-US +Content-Type: text/html;charset=utf-8 +Date: Mon, 14 Nov 2016 19:47:35 GMT +Server: nginx +Transfer-Encoding: chunked +Vary: Accept-Encoding +X-Cocoon-Version: 2.2.0 +</code></pre> + +<ul> +<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li> </ul>