Update notes for 2019-03-18

This commit is contained in:
2019-03-18 15:32:22 +02:00
parent af2a71393c
commit 400a265bde
4 changed files with 320 additions and 14 deletions

View File

@ -42,7 +42,7 @@ sys 0m1.979s
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-02/" />
<meta property="article:published_time" content="2019-02-01T21:37:30&#43;02:00"/>
<meta property="article:modified_time" content="2019-03-07T11:37:53&#43;02:00"/>
<meta property="article:modified_time" content="2019-03-18T15:25:49&#43;02:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2019"/>
@ -91,7 +91,7 @@ sys 0m1.979s
"url": "https://alanorth.github.io/cgspace-notes/2019-02/",
"wordCount": "7700",
"datePublished": "2019-02-01T21:37:30&#43;02:00",
"dateModified": "2019-03-07T11:37:53&#43;02:00",
"dateModified": "2019-03-18T15:25:49&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -712,7 +712,7 @@ Please see the DSpace documentation for assistance.
<li>Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses
<ul>
<li>Today I made <a href="http://13.57.134.254/app/license_requests/15/">a request</a> to the <a href="https://github.com/spdx/license-list-XML/blob/master/CONTRIBUTING.md">SPDX using their web form</a> to include this class of Creative Commons licenses](<a href="https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations">https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations</a>)</li>
<li>Today I made <a href="http://13.57.134.254/app/license_requests/15/">a request</a> to the <a href="https://github.com/spdx/license-list-XML/blob/master/CONTRIBUTING.md">SPDX using their web form</a> to include this <a href="https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations">class of Creative Commons licenses</a></li>
</ul></li>
<li>Testing the <code>mail.server.disabled</code> property that I noticed in <code>dspace.cfg</code> recently

View File

@ -25,7 +25,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-03/" />
<meta property="article:published_time" content="2019-03-01T12:16:30&#43;01:00"/>
<meta property="article:modified_time" content="2019-03-17T21:50:55&#43;02:00"/>
<meta property="article:modified_time" content="2019-03-17T22:24:02&#43;02:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2019"/>
@ -55,9 +55,9 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
"@type": "BlogPosting",
"headline": "March, 2019",
"url": "https://alanorth.github.io/cgspace-notes/2019-03/",
"wordCount": "2178",
"wordCount": "2959",
"datePublished": "2019-03-01T12:16:30&#43;01:00",
"dateModified": "2019-03-17T21:50:55&#43;02:00",
"dateModified": "2019-03-17T22:24:02&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -371,7 +371,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
</code></pre>
<ul>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, <del>but spikes of over 1,000 today</del>, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
</ul>
<pre><code># grep -c 'SQL QueryTable Error' dspace.log.2019-0*
@ -526,6 +526,166 @@ $ psql dspacetest -c 'update bundle set primary_bitstream_id=NULL where primary_
UPDATE 1
</code></pre>
<h2 id="2019-03-18">2019-03-18</h2>
<ul>
<li>I noticed that the regular expression for validating lines from input files in my <code>agrovoc-lookup.py</code> script was skipping characters with accents, etc, so I changed it to use the <code>\w</code> character class for words instead of trying to match <code>[A-Z]</code> etc&hellip;
<ul>
<li>We have a Spanish and French subjects so this is very important</li>
<li>Also there were some subjects with apostrophes, dashes, and periods&hellip; these are probably invalid AGROVOC subject terms, but we should save them to the rejects file instead of skipping them nevertheless</li>
</ul></li>
<li>Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my <code>agrovoc-lookup.py</code> script:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
dspace=# \q
$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv &gt; 2019-03-18-top-1500-subject.csv
$ ./agrovoc-lookup.py -l en -i 2019-03-18-top-1500-subject.csv -om /tmp/en-subjects-matched.txt -or /tmp/en-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l es -i 2019-03-18-top-1500-subject.csv -om /tmp/es-subjects-matched.txt -or /tmp/es-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l fr -i 2019-03-18-top-1500-subject.csv -om /tmp/fr-subjects-matched.txt -or /tmp/fr-subjects-unmatched.txt
$ cat /tmp/*-subjects-matched.txt | sort -u &gt; /tmp/subjects-matched-sorted.txt
$ wc -l /tmp/subjects-matched-sorted.txt
1317 /tmp/subjects-matched-sorted.txt
$ sort -u 2019-03-18-top-1500-subject.csv &gt; /tmp/1500-subjects-sorted.txt
$ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt &gt; 2019-03-18-subjects-unmatch
ed.txt
$ wc -l 2019-03-18-subjects-unmatched.txt
183 2019-03-18-subjects-unmatched.txt
</code></pre>
<ul>
<li>So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187)</li>
<li>We are getting the blank page issue on CGSpace again today and I see a <del>large number</del> of the &ldquo;SQL QueryTable Error&rdquo; in the DSpace log again (last time was 2019-03-15):</li>
</ul>
<pre><code>$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
dspace.log.2019-03-15:929
dspace.log.2019-03-16:67
dspace.log.2019-03-17:72
dspace.log.2019-03-18:1038
</code></pre>
<ul>
<li>Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower:</li>
</ul>
<pre><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
8
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
9 dspace.log.2019-03-08
25 dspace.log.2019-03-14
12 dspace.log.2019-03-15
67 dspace.log.2019-03-16
72 dspace.log.2019-03-17
8 dspace.log.2019-03-18
</code></pre>
<ul>
<li>It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use <code>-I</code> to say binary files don&rsquo;t match</li>
<li>Anyways, the full error in DSpace&rsquo;s log is:</li>
</ul>
<pre><code>2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
</code></pre>
<ul>
<li>There is a low number of connections to PostgreSQL currently:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | wc -l
33
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
6 dspaceApi
7 dspaceCli
15 dspaceWeb
</code></pre>
<ul>
<li>I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:</li>
</ul>
<pre><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &quot;waiting&quot; does not exist at character 217
</code></pre>
<ul>
<li>This is unrelated and apparently due to <a href="https://github.com/munin-monitoring/munin/issues/746">Munin checking a column that was changed in PostgreSQL 9.6</a></li>
<li>I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it&rsquo;s a Cocoon thing?</li>
<li>Looking in the cocoon logs I see a large number of warnings about &ldquo;Can not load requested doc&rdquo; around 11AM and 12PM:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
2 2019-03-18 00:
6 2019-03-18 02:
3 2019-03-18 04:
1 2019-03-18 05:
1 2019-03-18 07:
2 2019-03-18 08:
4 2019-03-18 09:
5 2019-03-18 10:
863 2019-03-18 11:
203 2019-03-18 12:
14 2019-03-18 13:
1 2019-03-18 14:
</code></pre>
<ul>
<li>And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 12PM:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
4 2019-03-15 01:
3 2019-03-15 02:
1 2019-03-15 03:
13 2019-03-15 04:
1 2019-03-15 05:
2 2019-03-15 06:
3 2019-03-15 07:
27 2019-03-15 09:
9 2019-03-15 10:
3 2019-03-15 11:
2 2019-03-15 12:
531 2019-03-15 13:
274 2019-03-15 14:
4 2019-03-15 15:
75 2019-03-15 16:
5 2019-03-15 17:
5 2019-03-15 18:
6 2019-03-15 19:
2 2019-03-15 20:
4 2019-03-15 21:
3 2019-03-15 22:
1 2019-03-15 23:
</code></pre>
<ul>
<li>And again on 2019-03-08, surprise surprise, it happened in the morning:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
11 2019-03-08 01:
3 2019-03-08 02:
1 2019-03-08 03:
2 2019-03-08 04:
1 2019-03-08 05:
1 2019-03-08 06:
1 2019-03-08 08:
425 2019-03-08 09:
432 2019-03-08 10:
717 2019-03-08 11:
59 2019-03-08 12:
</code></pre>
<ul>
<li>I&rsquo;m not sure if it&rsquo;s cocoon or that&rsquo;s just a symptom of something else</li>
</ul>
<!-- vim: set sw=2 ts=2: -->

View File

@ -4,12 +4,12 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-03/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-02/</loc>
<lastmod>2019-03-07T11:37:53+02:00</lastmod>
<lastmod>2019-03-18T15:25:49+02:00</lastmod>
</url>
<url>
@ -214,7 +214,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>
@ -225,7 +225,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>
@ -237,13 +237,13 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>