cgspace-notes/public/2017-08/index.html

<!DOCTYPE html>
<html lang="en">

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<meta property="og:title" content="August, 2017" />
<meta property="og:description" content="2017-08-01


Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:


/handle/10568/3353/discover
/handle/10568/16510/browse

The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962


" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-08/" />


<meta property="article:published_time" content="2017-08-01T11:51:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-08-01T11:51:52&#43;03:00"/>


  <meta name="twitter:card" content="summary"/>


<meta name="twitter:text:title" content="August, 2017"/>
<meta name="twitter:title" content="August, 2017"/>
<meta name="twitter:description" content="2017-08-01


Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:


/handle/10568/3353/discover
/handle/10568/16510/browse

The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962


"/>

<meta name="generator" content="Hugo 0.25.1" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "August, 2017",
  "url": "https://alanorth.github.io/cgspace-notes/2017-08/",
  "wordCount": "123",
  "datePublished": "2017-08-01T11:51:52&#43;03:00",
  "dateModified": "2017-08-01T11:51:52&#43;03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-08/">

    <title>August, 2017 | CGSpace Notes</title>

    <!-- combined, minified CSS -->
    <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-j3n8sYdzztDYtVc80KiiuOXoCg5Bjz0zYyLGzDMW8RbfA0u5djbF0GO3bVOPoLyN" crossorigin="anonymous">

    
  </head>

  <body>

    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
          
          
        </nav>
      </div>
    </div>

    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>

    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          
<article class="blog-post">
  <header>
    <h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2017-08/">August, 2017</a></h2>
    <p class="blog-post-meta"><time datetime="2017-08-01T11:51:52&#43;03:00">Tue Aug 01, 2017</time> by Alan Orth in 

<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>

</p>
  </header>
  <h2 id="2017-08-01">2017-08-01</h2>

<ul>
<li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li>
<li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li>
<li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li>
<li>This means our Tomcat Crawler Session Valve is working</li>
<li>But many of the bots are browsing dynamic URLs like:

<ul>
<li>/handle/10568/3353/discover</li>
<li>/handle/10568/16510/browse</li>
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
</ul>

<p></p>

  
</article> 


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 offset-sm-1 blog-sidebar">
  

        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2017-08/">August, 2017</a></li>

<li><a href="/cgspace-notes/2017-07/">July, 2017</a></li>

<li><a href="/cgspace-notes/2017-06/">June, 2017</a></li>

<li><a href="/cgspace-notes/2017-05/">May, 2017</a></li>

<li><a href="/cgspace-notes/2017-04/">April, 2017</a></li>

    </ol>
  </section>

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->

    <footer class="blog-footer">
      <p>
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>

  </body>

</html>
Add notes for 2017-08-01 2017-08-01 11:57:37 +03:00			`<!DOCTYPE html>`
			`<html lang="en">`

			`<head>`
			`<meta charset="utf-8">`
			`<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">`

			`<meta property="og:title" content="August, 2017" />`
			`<meta property="og:description" content="2017-08-01`


			`Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours`
			`I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)`
			`The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session`
			`This means our Tomcat Crawler Session Valve is working`
			`But many of the bots are browsing dynamic URLs like:`


			`/handle/10568/3353/discover`
			`/handle/10568/16510/browse`

			`The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!`
			`Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962`


			`" />`
			`<meta property="og:type" content="article" />`
			`<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-08/" />`



			`<meta property="article:published_time" content="2017-08-01T11:51:52+03:00"/>`
			`<meta property="article:modified_time" content="2017-08-01T11:51:52+03:00"/>`













			`<meta name="twitter:card" content="summary"/>`



			`<meta name="twitter:text:title" content="August, 2017"/>`
			`<meta name="twitter:title" content="August, 2017"/>`
			`<meta name="twitter:description" content="2017-08-01`


			`Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours`
			`I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)`
			`The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session`
			`This means our Tomcat Crawler Session Valve is working`
			`But many of the bots are browsing dynamic URLs like:`


			`/handle/10568/3353/discover`
			`/handle/10568/16510/browse`

			`The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!`
			`Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962`


			`"/>`

			`<meta name="generator" content="Hugo 0.25.1" />`



			`<script type="application/ld+json">`
			`{`
			`"@context": "http://schema.org",`
			`"@type": "BlogPosting",`
			`"headline": "August, 2017",`
			`"url": "https://alanorth.github.io/cgspace-notes/2017-08/",`
			`"wordCount": "123",`
			`"datePublished": "2017-08-01T11:51:52+03:00",`
			`"dateModified": "2017-08-01T11:51:52+03:00",`
			`"author": {`
			`"@type": "Person",`
			`"name": "Alan Orth"`
			`},`
			`"keywords": "Notes"`
			`}`
			`</script>`



			`<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-08/">`

			`<title>August, 2017 \| CGSpace Notes</title>`

			`<!-- combined, minified CSS -->`
			`<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-j3n8sYdzztDYtVc80KiiuOXoCg5Bjz0zYyLGzDMW8RbfA0u5djbF0GO3bVOPoLyN" crossorigin="anonymous">`









			`</head>`

			`<body>`

			`<div class="blog-masthead">`
			`<div class="container">`
			`<nav class="nav blog-nav">`
			`<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>`


			`</nav>`
			`</div>`
			`</div>`

			`<header class="blog-header">`
			`<div class="container">`
			`<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>`
			`<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>`
			`</div>`
			`</header>`

			`<div class="container">`
			`<div class="row">`
			`<div class="col-sm-8 blog-main">`




			`<article class="blog-post">`
			`<header>`
			`<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2017-08/">August, 2017</a></h2>`
			`<p class="blog-post-meta"><time datetime="2017-08-01T11:51:52+03:00">Tue Aug 01, 2017</time> by Alan Orth in`

			`<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>`

			`</p>`
			`</header>`
			`<h2 id="2017-08-01">2017-08-01</h2>`

			`<ul>`
			`<li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li>`
			`<li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li>`
			`<li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li>`
			`<li>This means our Tomcat Crawler Session Valve is working</li>`
			`<li>But many of the bots are browsing dynamic URLs like:`

			`<ul>`
			`<li>/handle/10568/3353/discover</li>`
			`<li>/handle/10568/16510/browse</li>`
			`</ul></li>`
			`<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs… we will need to find a way to forbid them from accessing these!</li>`
			`<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>`
			`</ul>`

			`<p></p>`





			`</article>`



			`</div> <!-- /.blog-main -->`

			`<aside class="col-sm-3 offset-sm-1 blog-sidebar">`



			`<section class="sidebar-module">`
			`<h4>Recent Posts</h4>`
			`<ol class="list-unstyled">`


			`<li><a href="/cgspace-notes/2017-08/">August, 2017</a></li>`

			`<li><a href="/cgspace-notes/2017-07/">July, 2017</a></li>`

			`<li><a href="/cgspace-notes/2017-06/">June, 2017</a></li>`

			`<li><a href="/cgspace-notes/2017-05/">May, 2017</a></li>`

			`<li><a href="/cgspace-notes/2017-04/">April, 2017</a></li>`

			`</ol>`
			`</section>`




			`<section class="sidebar-module">`
			`<h4>Links</h4>`
			`<ol class="list-unstyled">`

			`<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>`

			`<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>`

			`<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>`

			`</ol>`
			`</section>`

			`</aside>`


			`</div> <!-- /.row -->`
			`</div> <!-- /.container -->`

			`<footer class="blog-footer">`
			`<p>`

			`Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.`

			`</p>`
			`<p>`
			`<a href="#">Back to top</a>`
			`</p>`
			`</footer>`

			`</body>`

			`</html>`