diff --git a/content/posts/2019-03.md b/content/posts/2019-03.md index 4a0199484..a49f780f2 100644 --- a/content/posts/2019-03.md +++ b/content/posts/2019-03.md @@ -199,7 +199,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:300) ``` -- Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently +- Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, ~~but spikes of over 1,000 today~~, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently ``` # grep -c 'SQL QueryTable Error' dspace.log.2019-0* @@ -339,4 +339,150 @@ $ psql dspacetest -c 'update bundle set primary_bitstream_id=NULL where primary_ UPDATE 1 ``` +## 2019-03-18 + +- I noticed that the regular expression for validating lines from input files in my `agrovoc-lookup.py` script was skipping characters with accents, etc, so I changed it to use the `\w` character class for words instead of trying to match `[A-Z]` etc... + - We have a Spanish and French subjects so this is very important + - Also there were some subjects with apostrophes, dashes, and periods... these are probably invalid AGROVOC subject terms, but we should save them to the rejects file instead of skipping them nevertheless +- Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my `agrovoc-lookup.py` script: + +``` +dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER; +COPY 1500 +dspace=# \q +$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv > 2019-03-18-top-1500-subject.csv +$ ./agrovoc-lookup.py -l en -i 2019-03-18-top-1500-subject.csv -om /tmp/en-subjects-matched.txt -or /tmp/en-subjects-unmatched.txt +$ ./agrovoc-lookup.py -l es -i 2019-03-18-top-1500-subject.csv -om /tmp/es-subjects-matched.txt -or /tmp/es-subjects-unmatched.txt +$ ./agrovoc-lookup.py -l fr -i 2019-03-18-top-1500-subject.csv -om /tmp/fr-subjects-matched.txt -or /tmp/fr-subjects-unmatched.txt +$ cat /tmp/*-subjects-matched.txt | sort -u > /tmp/subjects-matched-sorted.txt +$ wc -l /tmp/subjects-matched-sorted.txt +1317 /tmp/subjects-matched-sorted.txt +$ sort -u 2019-03-18-top-1500-subject.csv > /tmp/1500-subjects-sorted.txt +$ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt > 2019-03-18-subjects-unmatch +ed.txt +$ wc -l 2019-03-18-subjects-unmatched.txt +183 2019-03-18-subjects-unmatched.txt +``` + +- So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187) +- We are getting the blank page issue on CGSpace again today and I see a ~~large number~~ of the "SQL QueryTable Error" in the DSpace log again (last time was 2019-03-15): + +``` +$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678] +dspace.log.2019-03-15:929 +dspace.log.2019-03-16:67 +dspace.log.2019-03-17:72 +dspace.log.2019-03-18:1038 +``` + +- Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower: + +``` +$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l +8 +$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c + 9 dspace.log.2019-03-08 + 25 dspace.log.2019-03-14 + 12 dspace.log.2019-03-15 + 67 dspace.log.2019-03-16 + 72 dspace.log.2019-03-17 + 8 dspace.log.2019-03-18 +``` + +- It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use `-I` to say binary files don't match +- Anyways, the full error in DSpace's log is: + +``` +2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - +java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed. + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398) + at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279) + at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313) + at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220) +``` + +- There is a low number of connections to PostgreSQL currently: + +``` +$ psql -c 'select * from pg_stat_activity' | wc -l +33 +$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c + 6 dspaceApi + 7 dspaceCli + 15 dspaceWeb +``` + +- I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January: + +``` +2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column "waiting" does not exist at character 217 +``` + +- This is unrelated and apparently due to [Munin checking a column that was changed in PostgreSQL 9.6](https://github.com/munin-monitoring/munin/issues/746) +- I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it's a Cocoon thing? +- Looking in the cocoon logs I see a large number of warnings about "Can not load requested doc" around 11AM and 12PM: + +``` +$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c + 2 2019-03-18 00: + 6 2019-03-18 02: + 3 2019-03-18 04: + 1 2019-03-18 05: + 1 2019-03-18 07: + 2 2019-03-18 08: + 4 2019-03-18 09: + 5 2019-03-18 10: + 863 2019-03-18 11: + 203 2019-03-18 12: + 14 2019-03-18 13: + 1 2019-03-18 14: +``` + +- And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 1–2PM: + +``` +$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c + 4 2019-03-15 01: + 3 2019-03-15 02: + 1 2019-03-15 03: + 13 2019-03-15 04: + 1 2019-03-15 05: + 2 2019-03-15 06: + 3 2019-03-15 07: + 27 2019-03-15 09: + 9 2019-03-15 10: + 3 2019-03-15 11: + 2 2019-03-15 12: + 531 2019-03-15 13: + 274 2019-03-15 14: + 4 2019-03-15 15: + 75 2019-03-15 16: + 5 2019-03-15 17: + 5 2019-03-15 18: + 6 2019-03-15 19: + 2 2019-03-15 20: + 4 2019-03-15 21: + 3 2019-03-15 22: + 1 2019-03-15 23: +``` + +- And again on 2019-03-08, surprise surprise, it happened in the morning: + +``` +$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c + 11 2019-03-08 01: + 3 2019-03-08 02: + 1 2019-03-08 03: + 2 2019-03-08 04: + 1 2019-03-08 05: + 1 2019-03-08 06: + 1 2019-03-08 08: + 425 2019-03-08 09: + 432 2019-03-08 10: + 717 2019-03-08 11: + 59 2019-03-08 12: +``` + +- I'm not sure if it's cocoon or that's just a symptom of something else + diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html index a9864120e..b28b09864 100644 --- a/docs/2019-02/index.html +++ b/docs/2019-02/index.html @@ -42,7 +42,7 @@ sys 0m1.979s - + @@ -91,7 +91,7 @@ sys 0m1.979s "url": "https://alanorth.github.io/cgspace-notes/2019-02/", "wordCount": "7700", "datePublished": "2019-02-01T21:37:30+02:00", - "dateModified": "2019-03-07T11:37:53+02:00", + "dateModified": "2019-03-18T15:25:49+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -712,7 +712,7 @@ Please see the DSpace documentation for assistance.
  • Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses
  • Testing the mail.server.disabled property that I noticed in dspace.cfg recently diff --git a/docs/2019-03/index.html b/docs/2019-03/index.html index 60e9cb376..e002695f9 100644 --- a/docs/2019-03/index.html +++ b/docs/2019-03/index.html @@ -25,7 +25,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca - + @@ -55,9 +55,9 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca "@type": "BlogPosting", "headline": "March, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-03/", - "wordCount": "2178", + "wordCount": "2959", "datePublished": "2019-03-01T12:16:30+01:00", - "dateModified": "2019-03-17T21:50:55+02:00", + "dateModified": "2019-03-17T22:24:02+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -371,7 +371,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
    # grep -c 'SQL QueryTable Error' dspace.log.2019-0*
    @@ -526,6 +526,166 @@ $ psql dspacetest -c 'update bundle set primary_bitstream_id=NULL where primary_
     UPDATE 1
     
    +

    2019-03-18

    + + + +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
    +COPY 1500
    +dspace=# \q
    +$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv > 2019-03-18-top-1500-subject.csv
    +$ ./agrovoc-lookup.py -l en -i 2019-03-18-top-1500-subject.csv -om /tmp/en-subjects-matched.txt -or /tmp/en-subjects-unmatched.txt
    +$ ./agrovoc-lookup.py -l es -i 2019-03-18-top-1500-subject.csv -om /tmp/es-subjects-matched.txt -or /tmp/es-subjects-unmatched.txt
    +$ ./agrovoc-lookup.py -l fr -i 2019-03-18-top-1500-subject.csv -om /tmp/fr-subjects-matched.txt -or /tmp/fr-subjects-unmatched.txt
    +$ cat /tmp/*-subjects-matched.txt | sort -u > /tmp/subjects-matched-sorted.txt
    +$ wc -l /tmp/subjects-matched-sorted.txt                                                              
    +1317 /tmp/subjects-matched-sorted.txt
    +$ sort -u 2019-03-18-top-1500-subject.csv > /tmp/1500-subjects-sorted.txt
    +$ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt > 2019-03-18-subjects-unmatch
    +ed.txt
    +$ wc -l 2019-03-18-subjects-unmatched.txt
    +183 2019-03-18-subjects-unmatched.txt
    +
    + + + +
    $ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
    +dspace.log.2019-03-15:929
    +dspace.log.2019-03-16:67
    +dspace.log.2019-03-17:72
    +dspace.log.2019-03-18:1038
    +
    + + + +
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
    +8
    +$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
    +      9 dspace.log.2019-03-08
    +     25 dspace.log.2019-03-14
    +     12 dspace.log.2019-03-15
    +     67 dspace.log.2019-03-16
    +     72 dspace.log.2019-03-17
    +      8 dspace.log.2019-03-18
    +
    + + + +
    2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
    +java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
    +        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
    +        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
    +        at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
    +        at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
    +
    + + + +
    $ psql -c 'select * from pg_stat_activity' | wc -l
    +33
    +$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +      6 dspaceApi
    +      7 dspaceCli
    +     15 dspaceWeb
    +
    + + + +
    2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
    +
    + + + +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
    +      2 2019-03-18 00:
    +      6 2019-03-18 02:
    +      3 2019-03-18 04:
    +      1 2019-03-18 05:
    +      1 2019-03-18 07:
    +      2 2019-03-18 08:
    +      4 2019-03-18 09:
    +      5 2019-03-18 10:
    +    863 2019-03-18 11:
    +    203 2019-03-18 12:
    +     14 2019-03-18 13:
    +      1 2019-03-18 14:
    +
    + + + +
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
    +      4 2019-03-15 01:
    +      3 2019-03-15 02:
    +      1 2019-03-15 03:
    +     13 2019-03-15 04:
    +      1 2019-03-15 05:
    +      2 2019-03-15 06:
    +      3 2019-03-15 07:
    +     27 2019-03-15 09:
    +      9 2019-03-15 10:
    +      3 2019-03-15 11:
    +      2 2019-03-15 12:
    +    531 2019-03-15 13:
    +    274 2019-03-15 14:
    +      4 2019-03-15 15:
    +     75 2019-03-15 16:
    +      5 2019-03-15 17:
    +      5 2019-03-15 18:
    +      6 2019-03-15 19:
    +      2 2019-03-15 20:
    +      4 2019-03-15 21:
    +      3 2019-03-15 22:
    +      1 2019-03-15 23:
    +
    + + + +
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
    +     11 2019-03-08 01:
    +      3 2019-03-08 02:
    +      1 2019-03-08 03:
    +      2 2019-03-08 04:
    +      1 2019-03-08 05:
    +      1 2019-03-08 06:
    +      1 2019-03-08 08:
    +    425 2019-03-08 09:
    +    432 2019-03-08 10:
    +    717 2019-03-08 11:
    +     59 2019-03-08 12:
    +
    + + + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index c485ed190..8d20443c6 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,12 +4,12 @@ https://alanorth.github.io/cgspace-notes/2019-03/ - 2019-03-17T21:50:55+02:00 + 2019-03-17T22:24:02+02:00 https://alanorth.github.io/cgspace-notes/2019-02/ - 2019-03-07T11:37:53+02:00 + 2019-03-18T15:25:49+02:00 @@ -214,7 +214,7 @@ https://alanorth.github.io/cgspace-notes/ - 2019-03-17T21:50:55+02:00 + 2019-03-17T22:24:02+02:00 0 @@ -225,7 +225,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2019-03-17T21:50:55+02:00 + 2019-03-17T22:24:02+02:00 0 @@ -237,13 +237,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2019-03-17T21:50:55+02:00 + 2019-03-17T22:24:02+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2019-03-17T21:50:55+02:00 + 2019-03-17T22:24:02+02:00 0