Update notes for 2017-11-07

This commit is contained in:
Alan Orth 2017-11-07 18:09:29 +02:00
parent 7e18f5e5d2
commit 0ffe1f07b0
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
6 changed files with 61 additions and 14 deletions

View File

@ -376,3 +376,29 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
``` ```
- I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API - I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API
- About Baidu, I found a link to their [robots.txt tester tool](http://ziyuan.baidu.com/robots/)
- It seems like our robots.txt file is valid, and they claim to recognize that URLs like `/discover` should be forbidden:
![Baidu robots.txt tester](/cgspace-notes/2017/11/baidu-robotstxt.png)
- But they literally just made this request today:
```
180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
```
- Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:
```
# grep -c Baiduspider /var/log/nginx/access.log
3806
# grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)"
1085
```
- I will think about blocking their IPs but they have 164 of them!
```
# grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
164
```

View File

@ -38,7 +38,7 @@ COPY 54701
<meta property="article:published_time" content="2017-11-02T09:37:54&#43;02:00"/> <meta property="article:published_time" content="2017-11-02T09:37:54&#43;02:00"/>
<meta property="article:modified_time" content="2017-11-07T17:03:49&#43;02:00"/> <meta property="article:modified_time" content="2017-11-07T17:26:16&#43;02:00"/>
@ -86,9 +86,9 @@ COPY 54701
"@type": "BlogPosting", "@type": "BlogPosting",
"headline": "November, 2017", "headline": "November, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-11/", "url": "https://alanorth.github.io/cgspace-notes/2017-11/",
"wordCount": "1997", "wordCount": "2084",
"datePublished": "2017-11-02T09:37:54&#43;02:00", "datePublished": "2017-11-02T09:37:54&#43;02:00",
"dateModified": "2017-11-07T17:03:49&#43;02:00", "dateModified": "2017-11-07T17:26:16&#43;02:00",
"author": { "author": {
"@type": "Person", "@type": "Person",
"name": "Alan Orth" "name": "Alan Orth"
@ -565,8 +565,29 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<ul> <ul>
<li>I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API</li> <li>I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API</li>
<li>About Baidu, I found a link to their <a href="http://ziyuan.baidu.com/robots/">robots.txt tester tool</a></li>
<li>It seems like our robots.txt file is valid, and they claim to recognize that URLs like <code>/discover</code> should be forbidden:</li>
</ul> </ul>
<p><img src="/cgspace-notes/2017/11/baidu-robotstxt.png" alt="Baidu robots.txt tester" /></p>
<ul>
<li>But they literally just made this request today:</li>
</ul>
<pre><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &quot;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&quot; 200 82265 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot;
</code></pre>
<ul>
<li>Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:</li>
</ul>
<pre><code># grep -c Baiduspider /var/log/nginx/access.log
3806
# grep Baiduspider /var/log/nginx/access.log | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
1085
</code></pre>

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

View File

@ -29,7 +29,7 @@ Disallow: /cgspace-notes/2015-12/
Disallow: /cgspace-notes/2015-11/ Disallow: /cgspace-notes/2015-11/
Disallow: /cgspace-notes/ Disallow: /cgspace-notes/
Disallow: /cgspace-notes/categories/ Disallow: /cgspace-notes/categories/
Disallow: /cgspace-notes/categories/notes/
Disallow: /cgspace-notes/tags/notes/ Disallow: /cgspace-notes/tags/notes/
Disallow: /cgspace-notes/categories/notes/
Disallow: /cgspace-notes/post/ Disallow: /cgspace-notes/post/
Disallow: /cgspace-notes/tags/ Disallow: /cgspace-notes/tags/

View File

@ -4,7 +4,7 @@
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/2017-11/</loc> <loc>https://alanorth.github.io/cgspace-notes/2017-11/</loc>
<lastmod>2017-11-07T17:03:49+02:00</lastmod> <lastmod>2017-11-07T17:26:16+02:00</lastmod>
</url> </url>
<url> <url>
@ -134,7 +134,7 @@
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/</loc> <loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2017-11-07T17:03:49+02:00</lastmod> <lastmod>2017-11-07T17:26:16+02:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>
@ -143,27 +143,27 @@
<priority>0</priority> <priority>0</priority>
</url> </url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2017-11-07T17:26:16+02:00</lastmod>
<priority>0</priority>
</url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc> <loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2017-09-28T12:00:49+03:00</lastmod> <lastmod>2017-09-28T12:00:49+03:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2017-11-07T17:03:49+02:00</lastmod>
<priority>0</priority>
</url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/post/</loc> <loc>https://alanorth.github.io/cgspace-notes/post/</loc>
<lastmod>2017-11-07T17:03:49+02:00</lastmod> <lastmod>2017-11-07T17:26:16+02:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc> <loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2017-11-07T17:03:49+02:00</lastmod> <lastmod>2017-11-07T17:26:16+02:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB