cgspace-notes/public/2017-08/index.html

240 lines
7.4 KiB
HTML
Raw Normal View History

2017-08-01 10:57:37 +02:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="August, 2017" />
<meta property="og:description" content="2017-08-01
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
2017-08-01 11:03:37 +02:00
It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
2017-08-01 10:57:37 +02:00
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-08/" />
<meta property="article:published_time" content="2017-08-01T11:51:52&#43;03:00"/>
2017-08-01 11:03:37 +02:00
<meta property="article:modified_time" content="2017-08-01T11:57:37&#43;03:00"/>
2017-08-01 10:57:37 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:text:title" content="August, 2017"/>
<meta name="twitter:title" content="August, 2017"/>
<meta name="twitter:description" content="2017-08-01
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
2017-08-01 11:03:37 +02:00
It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
2017-08-01 10:57:37 +02:00
"/>
<meta name="generator" content="Hugo 0.25.1" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "August, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
2017-08-01 11:03:37 +02:00
"wordCount": "166",
2017-08-01 10:57:37 +02:00
"datePublished": "2017-08-01T11:51:52&#43;03:00",
2017-08-01 11:03:37 +02:00
"dateModified": "2017-08-01T11:57:37&#43;03:00",
2017-08-01 10:57:37 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-08/">
<title>August, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-j3n8sYdzztDYtVc80KiiuOXoCg5Bjz0zYyLGzDMW8RbfA0u5djbF0GO3bVOPoLyN" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2017-08/">August, 2017</a></h2>
<p class="blog-post-meta"><time datetime="2017-08-01T11:51:52&#43;03:00">Tue Aug 01, 2017</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-08-01">2017-08-01</h2>
<ul>
<li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li>
<li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li>
<li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li>
<li>This means our Tomcat Crawler Session Valve is working</li>
<li>But many of the bots are browsing dynamic URLs like:
<ul>
<li>/handle/10568/3353/discover</li>
<li>/handle/10568/16510/browse</li>
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
2017-08-01 11:03:37 +02:00
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
2017-08-01 10:57:37 +02:00
</ul>
<p></p>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 offset-sm-1 blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2017-08/">August, 2017</a></li>
<li><a href="/cgspace-notes/2017-07/">July, 2017</a></li>
<li><a href="/cgspace-notes/2017-06/">June, 2017</a></li>
<li><a href="/cgspace-notes/2017-05/">May, 2017</a></li>
<li><a href="/cgspace-notes/2017-04/">April, 2017</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>