cgspace-notes/docs/2018-04/index.html

436 lines
14 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="April, 2018" />
<meta property="og:description" content="2018-04-01
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-04/" />
<meta property="article:published_time" content="2018-04-01T16:13:54&#43;02:00"/>
<meta property="article:modified_time" content="2018-04-04T17:01:08&#43;03:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2018"/>
<meta name="twitter:description" content="2018-04-01
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
<meta name="generator" content="Hugo 0.38.2" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-04/",
"wordCount": "1005",
"datePublished": "2018-04-01T16:13:54&#43;02:00",
"dateModified": "2018-04-04T17:01:08&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-04/">
<title>April, 2018 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-CoMzlF7G4xk3ftqRr7leobnWP85AuISUJljMFjtTG/UHyP/&#43;bBwWAvBlXkB4VQQk" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-04/">April, 2018</a></h2>
<p class="blog-post-meta"><time datetime="2018-04-01T16:13:54&#43;02:00">Sun Apr 01, 2018</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2018-04-01">2018-04-01</h2>
<ul>
<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li>
<li>Catalina logs at least show some memory errors yesterday:</li>
</ul>
<p></p>
<pre><code>Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;ContainerBackgroundProcessor[StandardEngine[Catalina]]&quot; java.lang.OutOfMemoryError: Java heap space
</code></pre>
<ul>
<li>So this is getting super annoying</li>
<li>I ran all system updates on DSpace Test and rebooted it</li>
<li>For some reason Listings and Reports is not giving any results for any queries now&hellip;</li>
<li>I posted a message on Yammer to ask if people are using the Duplicate Check step from the Metadata Quality Module</li>
<li>Help Lili Szilagyi with a question about statistics on some CCAFS items</li>
</ul>
<h2 id="2018-04-04">2018-04-04</h2>
<ul>
<li>Peter noticed that there were still some old CRP names on CGSpace, because I hadn&rsquo;t forced the Discovery index to be updated after I fixed the others last week</li>
<li>For completeness I re-ran the CRP corrections on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
</code></pre>
<ul>
<li>Then started a full Discovery index:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 76m13.841s
user 8m22.960s
sys 2m2.498s
</code></pre>
<ul>
<li>Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme&rsquo;s items</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
</code></pre>
<ul>
<li>The CSV format of <code>jtohme-2018-04-04.csv</code> was:</li>
</ul>
<pre><code class="language-csv">dc.contributor.author,cg.creator.id
&quot;Tohme, Joseph M.&quot;,Joe Tohme: 0000-0003-2765-7101
</code></pre>
<ul>
<li>There was a quoting error in my CRP CSV and the replacements for <code>Forests, Trees and Agroforestry</code> got messed up</li>
<li>So I fixed them and had to re-index again!</li>
<li>I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:</li>
</ul>
<pre><code>$ git checkout -b 5_x-dspace-5.8 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.8
</code></pre>
<ul>
<li>I was prepared to skip some commits that I had cherry picked from the upstream <code>dspace-5_x</code> branch when we did the DSpace 5.5 upgrade (see notes on 2016-10-19 and 2017-12-17):
<ul>
<li>[DS-3246] Improve cleanup in recyclable components (upstream commit on dspace-5_x: 9f0f5940e7921765c6a22e85337331656b18a403)</li>
<li>[DS-3250] applying patch provided by Atmire (upstream commit on dspace-5_x: c6fda557f731dbc200d7d58b8b61563f86fe6d06)</li>
<li>bump up to latest minor pdfbox version (upstream commit on dspace-5_x: b5330b78153b2052ed3dc2fd65917ccdbfcc0439)</li>
<li>DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761)</li>
</ul></li>
<li>&hellip; but somehow git knew, and didn&rsquo;t include them in my interactive rebase!</li>
<li>I need to send this branch to Atmire and also arrange payment (see <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">ticket #560</a> in their tracker)</li>
<li>Fix Sisay&rsquo;s SSH access to the new DSpace Test server (linode19)</li>
</ul>
<h2 id="2018-04-05">2018-04-05</h2>
<ul>
<li>Fix Sisay&rsquo;s sudo access on the new DSpace Test server (linode19)</li>
<li>The reindexing process on DSpace Test took <em>forever</em> yesterday:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 599m32.961s
user 9m3.947s
sys 2m52.585s
</code></pre>
<ul>
<li>So we really should not use this Linode block storage for Solr</li>
<li>Assetstore might be fine but would complicate things with configuration and deployment (ughhh)</li>
<li>Better to use Linode block storage only for backup</li>
<li>Help Peter with the GDPR compliance / reporting form for CGSpace</li>
<li>DSpace Test crashed due to memory issues again:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
16
</code></pre>
<ul>
<li>I ran all system updates on DSpace Test and rebooted it</li>
<li>Proof some records on DSpace Test for Udana from IWMI</li>
<li>He has done better with the small syntax and consistency issues but then there are larger concerns with not linking to DOIs, copying titles incorrectly, etc</li>
</ul>
<h2 id="2018-04-10">2018-04-10</h2>
<ul>
<li>I got a notice that CGSpace CPU usage was very high this morning</li>
<li>Looking at the nginx logs, here are the top users today so far:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
282 207.46.13.112
286 54.175.208.220
287 207.46.13.113
298 66.249.66.153
322 207.46.13.114
780 104.196.152.243
3994 178.154.200.38
4295 70.32.83.92
4388 95.108.181.88
7653 45.5.186.2
</code></pre>
<ul>
<li>45.5.186.2 is of course CIAT</li>
<li>95.108.181.88 appears to be Yandex:</li>
</ul>
<pre><code>95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] &quot;GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1&quot; 200 2638 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
</code></pre>
<ul>
<li>And for some reason Yandex created a lot of Tomcat sessions today:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
4363
</code></pre>
<ul>
<li>70.32.83.92 appears to be some harvester we&rsquo;ve seen before, but on a new IP</li>
<li>They are not creating new Tomcat sessions so there is no problem there</li>
<li>178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
3982
</code></pre>
<ul>
<li>I&rsquo;m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve</li>
<li>Let&rsquo;s try a manual request with and without their user agent:</li>
</ul>
<pre><code>$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:18:37 GMT
Expires: Tue, 10 Apr 2018 06:18:37 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: HTTPie/0.9.9
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:20:08 GMT
Expires: Tue, 10 Apr 2018 06:20:08 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=31635DB42B66D6A4208CFCC96DD96875; Path=/; Secure; HttpOnly
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
</code></pre>
<ul>
<li>So it definitely looks like Yandex requests are getting assigned a session from the Crawler Session Manager valve</li>
<li>And if I look at the DSpace log I see its IP sharing a session with other crawlers like Google (66.249.66.153)</li>
<li>Indeed the number of Tomcat sessions appears to be normal:</li>
</ul>
<p><img src="/cgspace-notes/2018/04/jmx_dspace_sessions-week.png" alt="Tomcat sessions week" /></p>
<ul>
<li>Looks like the number of total requests processed by nginx in March went down from the previous months:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2018&quot;
2266594
real 0m13.658s
user 0m16.533s
sys 0m1.087s
</code></pre>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2018-04/">April, 2018</a></li>
<li><a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li><a href="/cgspace-notes/2018-02/">February, 2018</a></li>
<li><a href="/cgspace-notes/2018-01/">January, 2018</a></li>
<li><a href="/cgspace-notes/2017-12/">December, 2017</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>