Update notes for 2017-11-08

This commit is contained in:
Alan Orth 2017-11-08 22:26:37 +02:00
parent 1ce78705aa
commit d4cafdf5fe
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
3 changed files with 52 additions and 8 deletions

View File

@ -453,3 +453,24 @@ proxy_set_header User-Agent $ua;
- I will deploy this on CGSpace later this week
- I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on [2017-11-07](#2017-11-07) for example)
- I merged the clickable thumbnails code to `5_x-prod` ([#347](https://github.com/ilri/DSpace/pull/347)) and will deploy it later along with the new bot mapping stuff (and re-run the Asible `nginx` and `tomcat` tags)
- I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in `robots.txt`:
```
# zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
22229
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
0
```
- It seems that they rarely even bother checking `robots.txt`, but Google does multiple times per day!
```
# zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
14
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
1134
```
- I have been looking for a reason to ban Baidu and this is definitely a good one
- Disallowing `Baiduspider` in `robots.txt` probably won't work because this bot doesn't seem to respect the robot exclusion standard anyways!
- I will whip up something in nginx later

View File

@ -38,7 +38,7 @@ COPY 54701
<meta property="article:published_time" content="2017-11-02T09:37:54&#43;02:00"/>
<meta property="article:modified_time" content="2017-11-08T14:26:08&#43;02:00"/>
<meta property="article:modified_time" content="2017-11-08T15:20:49&#43;02:00"/>
@ -86,9 +86,9 @@ COPY 54701
"@type": "BlogPosting",
"headline": "November, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-11/",
"wordCount": "2568",
"wordCount": "2694",
"datePublished": "2017-11-02T09:37:54&#43;02:00",
"dateModified": "2017-11-08T14:26:08&#43;02:00",
"dateModified": "2017-11-08T15:20:49&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -651,6 +651,29 @@ proxy_set_header User-Agent $ua;
<li>I will deploy this on CGSpace later this week</li>
<li>I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on <a href="#2017-11-07">2017-11-07</a> for example)</li>
<li>I merged the clickable thumbnails code to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/347">#347</a>) and will deploy it later along with the new bot mapping stuff (and re-run the Asible <code>nginx</code> and <code>tomcat</code> tags)</li>
<li>I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in <code>robots.txt</code>:</li>
</ul>
<pre><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
22229
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
0
</code></pre>
<ul>
<li>It seems that they rarely even bother checking <code>robots.txt</code>, but Google does multiple times per day!</li>
</ul>
<pre><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
14
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
1134
</code></pre>
<ul>
<li>I have been looking for a reason to ban Baidu and this is definitely a good one</li>
<li>Disallowing <code>Baiduspider</code> in <code>robots.txt</code> probably won&rsquo;t work because this bot doesn&rsquo;t seem to respect the robot exclusion standard anyways!</li>
<li>I will whip up something in nginx later</li>
</ul>

View File

@ -4,7 +4,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-11/</loc>
<lastmod>2017-11-08T14:26:08+02:00</lastmod>
<lastmod>2017-11-08T15:20:49+02:00</lastmod>
</url>
<url>
@ -134,7 +134,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2017-11-08T14:26:08+02:00</lastmod>
<lastmod>2017-11-08T15:20:49+02:00</lastmod>
<priority>0</priority>
</url>
@ -145,7 +145,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2017-11-08T14:26:08+02:00</lastmod>
<lastmod>2017-11-08T15:20:49+02:00</lastmod>
<priority>0</priority>
</url>
@ -157,13 +157,13 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
<lastmod>2017-11-08T14:26:08+02:00</lastmod>
<lastmod>2017-11-08T15:20:49+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2017-11-08T14:26:08+02:00</lastmod>
<lastmod>2017-11-08T15:20:49+02:00</lastmod>
<priority>0</priority>
</url>