cgspace-notes/public/2017-08/index.html

234 lines
6.6 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="August, 2017" />
<meta property="og:description" content="2017-08-01
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-08/" />
<meta property="article:published_time" content="2017-08-01T11:51:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-08-01T11:51:52&#43;03:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:text:title" content="August, 2017"/>
<meta name="twitter:title" content="August, 2017"/>
<meta name="twitter:description" content="2017-08-01
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
"/>
<meta name="generator" content="Hugo 0.25.1" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "August, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
"wordCount": "123",
"datePublished": "2017-08-01T11:51:52&#43;03:00",
"dateModified": "2017-08-01T11:51:52&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-08/">
<title>August, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-j3n8sYdzztDYtVc80KiiuOXoCg5Bjz0zYyLGzDMW8RbfA0u5djbF0GO3bVOPoLyN" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2017-08/">August, 2017</a></h2>
<p class="blog-post-meta"><time datetime="2017-08-01T11:51:52&#43;03:00">Tue Aug 01, 2017</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-08-01">2017-08-01</h2>
<ul>
<li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li>
<li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li>
<li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li>
<li>This means our Tomcat Crawler Session Valve is working</li>
<li>But many of the bots are browsing dynamic URLs like:
<ul>
<li>/handle/10568/3353/discover</li>
<li>/handle/10568/16510/browse</li>
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
</ul>
<p></p>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 offset-sm-1 blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2017-08/">August, 2017</a></li>
<li><a href="/cgspace-notes/2017-07/">July, 2017</a></li>
<li><a href="/cgspace-notes/2017-06/">June, 2017</a></li>
<li><a href="/cgspace-notes/2017-05/">May, 2017</a></li>
<li><a href="/cgspace-notes/2017-04/">April, 2017</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>