mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-05 14:53:02 +01:00
265 lines
10 KiB
HTML
265 lines
10 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
<meta property="og:title" content="August, 2017" />
|
|
<meta property="og:description" content="2017-08-01
|
|
|
|
|
|
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
|
|
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
|
|
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
|
|
This means our Tomcat Crawler Session Valve is working
|
|
But many of the bots are browsing dynamic URLs like:
|
|
|
|
|
|
/handle/10568/3353/discover
|
|
/handle/10568/16510/browse
|
|
|
|
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
|
|
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
|
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
|
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
|
We might actually have to block these requests with HTTP 403 depending on the user agent
|
|
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
|
|
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
|
|
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
|
|
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
|
|
|
|
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-08/" />
|
|
|
|
|
|
|
|
<meta property="article:published_time" content="2017-08-01T11:51:52+03:00"/>
|
|
<meta property="article:modified_time" content="2017-08-01T16:31:58+03:00"/>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
|
|
|
|
|
|
<meta name="twitter:text:title" content="August, 2017"/>
|
|
<meta name="twitter:title" content="August, 2017"/>
|
|
<meta name="twitter:description" content="2017-08-01
|
|
|
|
|
|
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
|
|
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
|
|
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
|
|
This means our Tomcat Crawler Session Valve is working
|
|
But many of the bots are browsing dynamic URLs like:
|
|
|
|
|
|
/handle/10568/3353/discover
|
|
/handle/10568/16510/browse
|
|
|
|
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
|
|
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
|
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
|
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
|
We might actually have to block these requests with HTTP 403 depending on the user agent
|
|
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
|
|
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
|
|
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
|
|
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
|
|
|
|
|
|
"/>
|
|
|
|
<meta name="generator" content="Hugo 0.25.1" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "August, 2017",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
|
|
"wordCount": "390",
|
|
"datePublished": "2017-08-01T11:51:52+03:00",
|
|
"dateModified": "2017-08-01T16:31:58+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-08/">
|
|
|
|
<title>August, 2017 | CGSpace Notes</title>
|
|
|
|
<!-- combined, minified CSS -->
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-j3n8sYdzztDYtVc80KiiuOXoCg5Bjz0zYyLGzDMW8RbfA0u5djbF0GO3bVOPoLyN" crossorigin="anonymous">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
|
|
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2017-08/">August, 2017</a></h2>
|
|
<p class="blog-post-meta"><time datetime="2017-08-01T11:51:52+03:00">Tue Aug 01, 2017</time> by Alan Orth in
|
|
|
|
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2017-08-01">2017-08-01</h2>
|
|
|
|
<ul>
|
|
<li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li>
|
|
<li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li>
|
|
<li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li>
|
|
<li>This means our Tomcat Crawler Session Valve is working</li>
|
|
<li>But many of the bots are browsing dynamic URLs like:
|
|
|
|
<ul>
|
|
<li>/handle/10568/3353/discover</li>
|
|
<li>/handle/10568/16510/browse</li>
|
|
</ul></li>
|
|
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs… we will need to find a way to forbid them from accessing these!</li>
|
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
|
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
|
</ul>
|
|
|
|
<p></p>
|
|
|
|
<h2 id="2017-08-02">2017-08-02</h2>
|
|
|
|
<ul>
|
|
<li>Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)</li>
|
|
<li>I think Atmire’s Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can’t figure it out</li>
|
|
<li>I had a look at the moduel configuration and couldn’t figure out a way to do this, so I <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets">opened a ticket on the Atmire tracker</a></li>
|
|
<li>Atmire responded about the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=500">missing workflow statistics issue</a> a few weeks ago but I didn’t see it for some reason</li>
|
|
<li>They said they added a publication and saw the workflow stat for the user, so I should try again and let them know</li>
|
|
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 offset-sm-1 blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2017-08/">August, 2017</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2017-07/">July, 2017</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2017-06/">June, 2017</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2017-05/">May, 2017</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2017-04/">April, 2017</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
<footer class="blog-footer">
|
|
<p>
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
</body>
|
|
|
|
</html>
|