cgspace-notes/docs/2020-05/index.html

392 lines
16 KiB
HTML
Raw Normal View History

2020-05-02 09:08:14 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2020" />
<meta property="og:description" content="2020-05-02
Peter said that CTA is having problems submitting an item to CGSpace
Looking at the PostgreSQL stats it seems to be the same issue that Tezira was having last week, as I see the number of connections in &lsquo;idle in transaction&rsquo; and &lsquo;waiting for lock&rsquo; state are increasing again
I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2.11, and there were some bugs related to transactions fixed in 42.2.12 (which I had updated in the Ansible playbooks, but not deployed yet)
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-05/" />
<meta property="article:published_time" content="2020-05-02T09:52:04+03:00" />
2020-05-25 10:52:28 +02:00
<meta property="article:modified_time" content="2020-05-20T09:44:36+03:00" />
2020-05-02 09:08:14 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2020"/>
<meta name="twitter:description" content="2020-05-02
Peter said that CTA is having problems submitting an item to CGSpace
Looking at the PostgreSQL stats it seems to be the same issue that Tezira was having last week, as I see the number of connections in &lsquo;idle in transaction&rsquo; and &lsquo;waiting for lock&rsquo; state are increasing again
I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2.11, and there were some bugs related to transactions fixed in 42.2.12 (which I had updated in the Ansible playbooks, but not deployed yet)
"/>
2020-05-25 10:52:28 +02:00
<meta name="generator" content="Hugo 0.71.0" />
2020-05-02 09:08:14 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-05/",
2020-05-25 10:52:28 +02:00
"wordCount": "1288",
2020-05-02 09:08:14 +02:00
"datePublished": "2020-05-02T09:52:04+03:00",
2020-05-25 10:52:28 +02:00
"dateModified": "2020-05-20T09:44:36+03:00",
2020-05-02 09:08:14 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-05/">
<title>May, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-05/">May, 2020</a></h2>
<p class="blog-post-meta"><time datetime="2020-05-02T09:52:04+03:00">Sat May 02, 2020</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-05-02">2020-05-02</h2>
<ul>
<li>Peter said that CTA is having problems submitting an item to CGSpace
<ul>
<li>Looking at the PostgreSQL stats it seems to be the same issue that Tezira was having last week, as I see the number of connections in &lsquo;idle in transaction&rsquo; and &lsquo;waiting for lock&rsquo; state are increasing again</li>
<li>I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2.11, and there were some bugs related to transactions fixed in 42.2.12 (which I had updated in the Ansible playbooks, but not deployed yet)</li>
</ul>
</li>
</ul>
2020-05-03 15:10:21 +02:00
<h2 id="2020-05-03">2020-05-03</h2>
<ul>
<li>Purge a few remaining bots from CGSpace Solr statistics that I had identified a few months ago
<ul>
<li><code>lua-resty-http/0.10 (Lua) ngx_lua/10000</code></li>
<li><code>omgili/0.5 +http://omgili.com</code></li>
<li><code>IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)</code></li>
<li><code>Twurly v1.1 (https://twurly.org)</code></li>
<li><code>Pattern/2.6 +http://www.clips.ua.ac.be/pattern</code></li>
<li><code>CyotekWebCopy/1.7 CyotekHTTP/2.0</code></li>
</ul>
</li>
<li>This is only about 2,500 hits total from the last ten years, and half of these bots no longer seem to exist, so I won&rsquo;t bother submitting them to the COUNTER-Robots project</li>
<li>I noticed that our custom themes were incorrectly linking to the OpenSearch XML file
<ul>
<li>The bug <a href="https://jira.lyrasis.org/browse/DS-2592">was fixed</a> for Mirage2 in 2015</li>
<li>Note that this did not prevent OpenSearch itself from working</li>
<li>I will patch this on our DSpace 5.x and 6.x branches</li>
</ul>
</li>
</ul>
2020-05-06 15:03:29 +02:00
<h2 id="2020-05-06">2020-05-06</h2>
<ul>
<li>Atmire responded asking for more information about the Solr statistics processing bug in CUA so I sent them some full logs
<ul>
<li>Also I asked again about the Maven variable interpolation issue for <code>cua.version.number</code>, and if they would be willing to upgrade CUA to use Font Awesome 5 instead of 4.</li>
</ul>
</li>
</ul>
2020-05-07 10:45:25 +02:00
<h2 id="2020-05-07">2020-05-07</h2>
<ul>
<li>Linode sent an alert that there was high CPU usage on CGSpace (linode18) early this morning
<ul>
<li>I looked at the nginx logs using goaccess and I found a few IPs making lots of requests around then:</li>
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;07/May/2020:(01|03|04)&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The two main IPs making requests around then are 188.134.31.88 and 212.34.8.188
<ul>
<li>The first is in Russia and it is hitting mostly XMLUI Discover links using <em>dozens</em> of different user agents, a total of 20,000 requests this week</li>
<li>The second IP is CodeObia testing AReS, a total of 171,000 hits this month</li>
<li>I will purge both of those IPs from the Solr stats using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
Purging 171641 hits from 212.34.8.188 in statistics
Purging 20691 hits from 188.134.31.88 in statistics
Total number of bot hits purged: 192332
</code></pre><ul>
<li>And then I will add 188.134.31.88 to the nginx bad bot list and tell CodeObia to please use a &ldquo;bot&rdquo; user agent</li>
<li>I also changed the nginx config to block requests with blank user agents</li>
</ul>
2020-05-11 15:50:27 +02:00
<h2 id="2020-05-11">2020-05-11</h2>
<ul>
<li>Bizu said she was having issues submitting to CGSpace last week
<ul>
<li>The issue sounds like the one Tezira and CTA were having in the last few weeks</li>
<li>I looked at the PostgreSQL graphs and see there are a lot of connections in &ldquo;idle in transaction&rdquo; and &ldquo;waiting for lock&rdquo; state:</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2020/05/postgres_connections_cgspace-week.png" alt="PostgreSQL connections"></p>
<ul>
<li>I think I&rsquo;ll downgrade the PostgreSQL JDBC driver from 42.2.12 to 42.2.10, which was the version we were using before these issues started happening</li>
<li>Atmire sent some feedback about my ongoing issues with their CUA module, but none of it was conclusive yet
<ul>
<li>Regarding Font Awesome 5 they will check how much work it will take and give me a quote</li>
</ul>
</li>
<li>Abenet said some users are questioning why the statistics dropped so much lately, so I made a <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=674923030216704">post to Yammer</a> to explain about the robots</li>
<li>Last week Peter had asked me to add a new ILRI author&rsquo;s ORCID iD
<ul>
<li>I added it to the controlled vocabulary and tagged the user&rsquo;s existing ~11 items in CGSpace using this CSV file with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-11-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Lutakome, P.&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
&quot;Lutakome, Pius&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
2020-05-11 16:15:13 +02:00
</code></pre><ul>
2020-05-11 16:30:31 +02:00
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>I had to restart Tomcat five times before all Solr statistics cores came up OK, ugh.</li>
</ul>
</li>
2020-05-11 16:15:13 +02:00
</ul>
2020-05-17 19:03:17 +02:00
<h2 id="2020-05-12">2020-05-12</h2>
<ul>
<li>Peter noticed that CGSpace is no longer on AReS, because I blocked all requests that don&rsquo;t specify a user agent
<ul>
<li>I&rsquo;ve temporarily disabled that restriction and asked Moayad to look into how he can specify a user agent in the AReS harvester</li>
</ul>
</li>
</ul>
<h2 id="2020-05-13">2020-05-13</h2>
<ul>
<li>Atmire responded about Font Awesome and said they can switch to version 5 for 16 credits
<ul>
<li>I told them to go ahead</li>
</ul>
</li>
<li>Also, Atmire gave me a small workaround for the <code>cua.version.number</code> interpolation issue and said they would look into the crash that happens when processing our Solr stats</li>
<li>Run system updates and reboot AReS server (linode20) for the first time in almost 100 days
<ul>
<li>I notice that AReS now has some of CGSpace&rsquo;s data in it (but not all) since I dropped the user-agent restriction on the REST API yesterday</li>
</ul>
</li>
</ul>
<h2 id="2020-05-17">2020-05-17</h2>
<ul>
<li>Create an issue in the OpenRXV project for Moayad to change the default harvester user agent (<a href="https://github.com/ilri/OpenRXV/issues/36">#36</a>)</li>
</ul>
2020-05-19 10:13:48 +02:00
<h2 id="2020-05-18">2020-05-18</h2>
<ul>
<li>Atmire responded and said they still can&rsquo;t figure out the CUA statistics issue, though they seem to only be trying to understand what&rsquo;s going on using static analysis
<ul>
<li>I told them that they should try to run the code with the Solr statistics that I shared with them a few weeks ago</li>
</ul>
</li>
</ul>
<h2 id="2020-05-19">2020-05-19</h2>
<ul>
<li>Add ORCID identifier for Sirak Bahta
<ul>
<li>I added it to the controlled vocabulary and tagged the user&rsquo;s existing ~40 items in CGSpace using this CSV file with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-19-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Bahta, Sirak T.&quot;,&quot;Sirak Bahta: 0000-0002-5728-2489&quot;
2020-05-25 10:52:28 +02:00
$ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
2020-05-20 08:44:36 +02:00
</code></pre><ul>
<li>An IITA user is having issues submitting to CGSpace and I see there are a rising number of PostgreSQL connections waiting in transaction and in lock:</li>
</ul>
<p><img src="/cgspace-notes/2020/05/postgres_connections_cgspace-week2.png" alt="PostgreSQL connections"></p>
<ul>
<li>This is the same issue Tezira, Bizu, and CTA were having in the last few weeks and it I already downgraded the PostgreSQL JDBC driver version to the last version I was using before this started (42.2.10)
<ul>
<li>I will downgrade it to version 42.2.9 for now&hellip;</li>
<li>The only other thing I can think of is that I upgraded Tomcat to 7.0.103 in March</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode26) and reboot it</li>
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>After the system came back up I had to restart Tomcat 7 three times before all the Solr statistics cores came up OK</li>
</ul>
</li>
<li>Send Atmire a snapshot of the CGSpace database for them to possibly troubleshoot the CUA issue with DSpace 6</li>
</ul>
<h2 id="2020-05-20">2020-05-20</h2>
<ul>
<li>Send CodeObia some logos and footer text for the next phase of OpenRXV development (<a href="https://github.com/ilri/OpenRXV/issues/18">#18</a>)</li>
</ul>
2020-05-25 10:52:28 +02:00
<h2 id="2020-05-25">2020-05-25</h2>
<ul>
<li>Add ORCID identifier for CIAT author Manuel Francisco
<ul>
<li>I added it to the controlled vocabulary and tagged the user&rsquo;s existing ~27 items in CGSpace using this CSV file with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-25-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Díaz, Manuel F.&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
&quot;Díaz, Manuel Francisco&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
</code></pre><ul>
<li>Last week Maria asked again about searching for items by accession or issue date
<ul>
<li>A few months ago I had told her to search for the ISO8601 date in Discovery search, which appears to work because it filters the results down quite a bit</li>
<li>She pointed out that the results include hits that don&rsquo;t exactly match, for example if part of the search string appears elsewhere like in the timestamp</li>
<li>I checked in Solr and the results are the same, so perhaps it&rsquo;s a limitation in Solr&hellip;?</li>
<li>So this effectively means that we don&rsquo;t have a way to create reports for items in an arbitrary date range shorter than a year:
<ul>
<li>DSpace advanced search is buggy or simply not designed to work like that</li>
<li>AReS Explorer currently only allows filtering by year, but will allow months soon</li>
<li>Atmire Listings and Reports only allows a &ldquo;Timespan&rdquo; of a year</li>
</ul>
</li>
</ul>
</li>
</ul>
2020-05-20 08:44:36 +02:00
<!-- raw HTML omitted -->
2020-05-02 09:08:14 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2020-05/">May, 2020</a></li>
<li><a href="/cgspace-notes/2020-04/">April, 2020</a></li>
<li><a href="/cgspace-notes/2020-03/">March, 2020</a></li>
<li><a href="/cgspace-notes/2020-02/">February, 2020</a></li>
<li><a href="/cgspace-notes/2020-01/">January, 2020</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>