Update notes for 2019-03-18

This commit is contained in:
Alan Orth 2019-03-18 15:32:22 +02:00
parent af2a71393c
commit 400a265bde
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
4 changed files with 320 additions and 14 deletions

View File

@ -199,7 +199,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:300)
```
- Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently
- Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, ~~but spikes of over 1,000 today~~, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently
```
# grep -c 'SQL QueryTable Error' dspace.log.2019-0*
@ -339,4 +339,150 @@ $ psql dspacetest -c 'update bundle set primary_bitstream_id=NULL where primary_
UPDATE 1
```
## 2019-03-18
- I noticed that the regular expression for validating lines from input files in my `agrovoc-lookup.py` script was skipping characters with accents, etc, so I changed it to use the `\w` character class for words instead of trying to match `[A-Z]` etc...
- We have a Spanish and French subjects so this is very important
- Also there were some subjects with apostrophes, dashes, and periods... these are probably invalid AGROVOC subject terms, but we should save them to the rejects file instead of skipping them nevertheless
- Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my `agrovoc-lookup.py` script:
```
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
dspace=# \q
$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv > 2019-03-18-top-1500-subject.csv
$ ./agrovoc-lookup.py -l en -i 2019-03-18-top-1500-subject.csv -om /tmp/en-subjects-matched.txt -or /tmp/en-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l es -i 2019-03-18-top-1500-subject.csv -om /tmp/es-subjects-matched.txt -or /tmp/es-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l fr -i 2019-03-18-top-1500-subject.csv -om /tmp/fr-subjects-matched.txt -or /tmp/fr-subjects-unmatched.txt
$ cat /tmp/*-subjects-matched.txt | sort -u > /tmp/subjects-matched-sorted.txt
$ wc -l /tmp/subjects-matched-sorted.txt
1317 /tmp/subjects-matched-sorted.txt
$ sort -u 2019-03-18-top-1500-subject.csv > /tmp/1500-subjects-sorted.txt
$ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt > 2019-03-18-subjects-unmatch
ed.txt
$ wc -l 2019-03-18-subjects-unmatched.txt
183 2019-03-18-subjects-unmatched.txt
```
- So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187)
- We are getting the blank page issue on CGSpace again today and I see a ~~large number~~ of the "SQL QueryTable Error" in the DSpace log again (last time was 2019-03-15):
```
$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
dspace.log.2019-03-15:929
dspace.log.2019-03-16:67
dspace.log.2019-03-17:72
dspace.log.2019-03-18:1038
```
- Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower:
```
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
8
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
9 dspace.log.2019-03-08
25 dspace.log.2019-03-14
12 dspace.log.2019-03-15
67 dspace.log.2019-03-16
72 dspace.log.2019-03-17
8 dspace.log.2019-03-18
```
- It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use `-I` to say binary files don't match
- Anyways, the full error in DSpace's log is:
```
2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
```
- There is a low number of connections to PostgreSQL currently:
```
$ psql -c 'select * from pg_stat_activity' | wc -l
33
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
6 dspaceApi
7 dspaceCli
15 dspaceWeb
```
- I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:
```
2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column "waiting" does not exist at character 217
```
- This is unrelated and apparently due to [Munin checking a column that was changed in PostgreSQL 9.6](https://github.com/munin-monitoring/munin/issues/746)
- I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it's a Cocoon thing?
- Looking in the cocoon logs I see a large number of warnings about "Can not load requested doc" around 11AM and 12PM:
```
$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
2 2019-03-18 00:
6 2019-03-18 02:
3 2019-03-18 04:
1 2019-03-18 05:
1 2019-03-18 07:
2 2019-03-18 08:
4 2019-03-18 09:
5 2019-03-18 10:
863 2019-03-18 11:
203 2019-03-18 12:
14 2019-03-18 13:
1 2019-03-18 14:
```
- And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 12PM:
```
$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
4 2019-03-15 01:
3 2019-03-15 02:
1 2019-03-15 03:
13 2019-03-15 04:
1 2019-03-15 05:
2 2019-03-15 06:
3 2019-03-15 07:
27 2019-03-15 09:
9 2019-03-15 10:
3 2019-03-15 11:
2 2019-03-15 12:
531 2019-03-15 13:
274 2019-03-15 14:
4 2019-03-15 15:
75 2019-03-15 16:
5 2019-03-15 17:
5 2019-03-15 18:
6 2019-03-15 19:
2 2019-03-15 20:
4 2019-03-15 21:
3 2019-03-15 22:
1 2019-03-15 23:
```
- And again on 2019-03-08, surprise surprise, it happened in the morning:
```
$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
11 2019-03-08 01:
3 2019-03-08 02:
1 2019-03-08 03:
2 2019-03-08 04:
1 2019-03-08 05:
1 2019-03-08 06:
1 2019-03-08 08:
425 2019-03-08 09:
432 2019-03-08 10:
717 2019-03-08 11:
59 2019-03-08 12:
```
- I'm not sure if it's cocoon or that's just a symptom of something else
<!-- vim: set sw=2 ts=2: -->

View File

@ -42,7 +42,7 @@ sys 0m1.979s
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-02/" />
<meta property="article:published_time" content="2019-02-01T21:37:30&#43;02:00"/>
<meta property="article:modified_time" content="2019-03-07T11:37:53&#43;02:00"/>
<meta property="article:modified_time" content="2019-03-18T15:25:49&#43;02:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2019"/>
@ -91,7 +91,7 @@ sys 0m1.979s
"url": "https://alanorth.github.io/cgspace-notes/2019-02/",
"wordCount": "7700",
"datePublished": "2019-02-01T21:37:30&#43;02:00",
"dateModified": "2019-03-07T11:37:53&#43;02:00",
"dateModified": "2019-03-18T15:25:49&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -712,7 +712,7 @@ Please see the DSpace documentation for assistance.
<li>Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses
<ul>
<li>Today I made <a href="http://13.57.134.254/app/license_requests/15/">a request</a> to the <a href="https://github.com/spdx/license-list-XML/blob/master/CONTRIBUTING.md">SPDX using their web form</a> to include this class of Creative Commons licenses](<a href="https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations">https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations</a>)</li>
<li>Today I made <a href="http://13.57.134.254/app/license_requests/15/">a request</a> to the <a href="https://github.com/spdx/license-list-XML/blob/master/CONTRIBUTING.md">SPDX using their web form</a> to include this <a href="https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations">class of Creative Commons licenses</a></li>
</ul></li>
<li>Testing the <code>mail.server.disabled</code> property that I noticed in <code>dspace.cfg</code> recently

View File

@ -25,7 +25,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-03/" />
<meta property="article:published_time" content="2019-03-01T12:16:30&#43;01:00"/>
<meta property="article:modified_time" content="2019-03-17T21:50:55&#43;02:00"/>
<meta property="article:modified_time" content="2019-03-17T22:24:02&#43;02:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2019"/>
@ -55,9 +55,9 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
"@type": "BlogPosting",
"headline": "March, 2019",
"url": "https://alanorth.github.io/cgspace-notes/2019-03/",
"wordCount": "2178",
"wordCount": "2959",
"datePublished": "2019-03-01T12:16:30&#43;01:00",
"dateModified": "2019-03-17T21:50:55&#43;02:00",
"dateModified": "2019-03-17T22:24:02&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -371,7 +371,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
</code></pre>
<ul>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, <del>but spikes of over 1,000 today</del>, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
</ul>
<pre><code># grep -c 'SQL QueryTable Error' dspace.log.2019-0*
@ -526,6 +526,166 @@ $ psql dspacetest -c 'update bundle set primary_bitstream_id=NULL where primary_
UPDATE 1
</code></pre>
<h2 id="2019-03-18">2019-03-18</h2>
<ul>
<li>I noticed that the regular expression for validating lines from input files in my <code>agrovoc-lookup.py</code> script was skipping characters with accents, etc, so I changed it to use the <code>\w</code> character class for words instead of trying to match <code>[A-Z]</code> etc&hellip;
<ul>
<li>We have a Spanish and French subjects so this is very important</li>
<li>Also there were some subjects with apostrophes, dashes, and periods&hellip; these are probably invalid AGROVOC subject terms, but we should save them to the rejects file instead of skipping them nevertheless</li>
</ul></li>
<li>Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my <code>agrovoc-lookup.py</code> script:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
dspace=# \q
$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv &gt; 2019-03-18-top-1500-subject.csv
$ ./agrovoc-lookup.py -l en -i 2019-03-18-top-1500-subject.csv -om /tmp/en-subjects-matched.txt -or /tmp/en-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l es -i 2019-03-18-top-1500-subject.csv -om /tmp/es-subjects-matched.txt -or /tmp/es-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l fr -i 2019-03-18-top-1500-subject.csv -om /tmp/fr-subjects-matched.txt -or /tmp/fr-subjects-unmatched.txt
$ cat /tmp/*-subjects-matched.txt | sort -u &gt; /tmp/subjects-matched-sorted.txt
$ wc -l /tmp/subjects-matched-sorted.txt
1317 /tmp/subjects-matched-sorted.txt
$ sort -u 2019-03-18-top-1500-subject.csv &gt; /tmp/1500-subjects-sorted.txt
$ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt &gt; 2019-03-18-subjects-unmatch
ed.txt
$ wc -l 2019-03-18-subjects-unmatched.txt
183 2019-03-18-subjects-unmatched.txt
</code></pre>
<ul>
<li>So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187)</li>
<li>We are getting the blank page issue on CGSpace again today and I see a <del>large number</del> of the &ldquo;SQL QueryTable Error&rdquo; in the DSpace log again (last time was 2019-03-15):</li>
</ul>
<pre><code>$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
dspace.log.2019-03-15:929
dspace.log.2019-03-16:67
dspace.log.2019-03-17:72
dspace.log.2019-03-18:1038
</code></pre>
<ul>
<li>Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower:</li>
</ul>
<pre><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
8
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
9 dspace.log.2019-03-08
25 dspace.log.2019-03-14
12 dspace.log.2019-03-15
67 dspace.log.2019-03-16
72 dspace.log.2019-03-17
8 dspace.log.2019-03-18
</code></pre>
<ul>
<li>It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use <code>-I</code> to say binary files don&rsquo;t match</li>
<li>Anyways, the full error in DSpace&rsquo;s log is:</li>
</ul>
<pre><code>2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
</code></pre>
<ul>
<li>There is a low number of connections to PostgreSQL currently:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | wc -l
33
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
6 dspaceApi
7 dspaceCli
15 dspaceWeb
</code></pre>
<ul>
<li>I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:</li>
</ul>
<pre><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &quot;waiting&quot; does not exist at character 217
</code></pre>
<ul>
<li>This is unrelated and apparently due to <a href="https://github.com/munin-monitoring/munin/issues/746">Munin checking a column that was changed in PostgreSQL 9.6</a></li>
<li>I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it&rsquo;s a Cocoon thing?</li>
<li>Looking in the cocoon logs I see a large number of warnings about &ldquo;Can not load requested doc&rdquo; around 11AM and 12PM:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
2 2019-03-18 00:
6 2019-03-18 02:
3 2019-03-18 04:
1 2019-03-18 05:
1 2019-03-18 07:
2 2019-03-18 08:
4 2019-03-18 09:
5 2019-03-18 10:
863 2019-03-18 11:
203 2019-03-18 12:
14 2019-03-18 13:
1 2019-03-18 14:
</code></pre>
<ul>
<li>And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 12PM:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
4 2019-03-15 01:
3 2019-03-15 02:
1 2019-03-15 03:
13 2019-03-15 04:
1 2019-03-15 05:
2 2019-03-15 06:
3 2019-03-15 07:
27 2019-03-15 09:
9 2019-03-15 10:
3 2019-03-15 11:
2 2019-03-15 12:
531 2019-03-15 13:
274 2019-03-15 14:
4 2019-03-15 15:
75 2019-03-15 16:
5 2019-03-15 17:
5 2019-03-15 18:
6 2019-03-15 19:
2 2019-03-15 20:
4 2019-03-15 21:
3 2019-03-15 22:
1 2019-03-15 23:
</code></pre>
<ul>
<li>And again on 2019-03-08, surprise surprise, it happened in the morning:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
11 2019-03-08 01:
3 2019-03-08 02:
1 2019-03-08 03:
2 2019-03-08 04:
1 2019-03-08 05:
1 2019-03-08 06:
1 2019-03-08 08:
425 2019-03-08 09:
432 2019-03-08 10:
717 2019-03-08 11:
59 2019-03-08 12:
</code></pre>
<ul>
<li>I&rsquo;m not sure if it&rsquo;s cocoon or that&rsquo;s just a symptom of something else</li>
</ul>
<!-- vim: set sw=2 ts=2: -->

View File

@ -4,12 +4,12 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-03/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-02/</loc>
<lastmod>2019-03-07T11:37:53+02:00</lastmod>
<lastmod>2019-03-18T15:25:49+02:00</lastmod>
</url>
<url>
@ -214,7 +214,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>
@ -225,7 +225,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>
@ -237,13 +237,13 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2019-03-17T21:50:55+02:00</lastmod>
<lastmod>2019-03-17T22:24:02+02:00</lastmod>
<priority>0</priority>
</url>