cgspace-notes/2018-02.md at 389ab31e63766600c73980ffa0c9b9e5f5acddbb

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-16 20:07:03 +01:00

Alan Orth 389ab31e63

Update notes for 2018-02-07

2018-02-07 18:33:26 +02:00

9.5 KiB

Raw Blame History

title

date

author

2018-02-01

Peter gave feedback on the dc.rights proof of concept that I had sent him last week
We don't need to distinguish between internal and external works, so that makes it just a simple list
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX [in 2018-01]({{< relref "2018-01.md" >}})

Run all system updates and reboot DSpace Test
Wow, I packaged up the jmx_dspace_sessions stuff in the Ansible infrastructure scripts and deployed it on CGSpace and it totally works:

# munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0

2018-02-03

Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
We will close that ticket for now and wait for the 5.8 stuff: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January
After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:

$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'

Then I started a full Discovery reindex:

$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b

real    96m39.823s
user    14m10.975s
sys     2m29.088s

Generate a new list of affiliations for Peter to sort through:

dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723

Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in [December]({{< relref "2017-12.md" >}}):

# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
3126109

real    0m23.839s
user    0m27.225s
sys     0m1.905s

2018-02-05

Toying with correcting authors with trailing spaces via PostgreSQL:

dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20

I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn't go away
This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:

dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630

2018-02-06

UptimeRobot says CGSpace is down this morning around 9:15
I see 308 PostgreSQL connections in pg_stat_activity
The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:

# date
Tue Feb  6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      2 223.185.41.40
      2 66.249.64.14
      2 77.246.52.40
      4 157.55.39.82
      4 193.205.105.8
      5 207.46.13.63
      5 207.46.13.64
      6 154.68.16.34
      7 207.46.13.66
   1548 50.116.102.77
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     77 213.55.99.121
     86 66.249.64.14
    101 104.196.152.243
    103 207.46.13.64
    118 157.55.39.82
    133 207.46.13.66
    136 207.46.13.63
    156 68.180.228.157
    295 197.210.168.174
    752 144.76.64.79

I did notice in /var/log/tomcat7/catalina.out that Atmire's update thing was running though
So I restarted Tomcat and now everything is fine
Next time I see that many database connections I need to save the output so I can analyze it later
I'm going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning
If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later
Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on
I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server
Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called 'dspaceCli' so that I can distinguish it in pg_stat_activity
I implemented some changes to the pooling in the Ansible infrastructure scripts so that each DSpace web application can use its own pool (web, api, and solr)
Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down
Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine
I'm not actually sure if the Solr web application uses the database though, so I'll have to check later and remove it if necessary
I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week

2018-02-07

Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace
I tried on DSpace Test and indeed the lookup just doesn't work!
The ORCiD code in DSpace appears to be using http://pub.orcid.org/, but when I go there in the browser it redirects me to https://pub.orcid.org/v2.0/
According to the announcement the v1 API was moved from http://pub.orcid.org/ to https://pub.orcid.org/v1.2 until March 1st when it will be discontinued for good
But the old URL is hard coded in DSpace and it doesn't work anyways, because it currently redirects you to https://pub.orcid.org/v2.0/v1.2
So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018
I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:

$ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
/tmp/pg_stat_activity3.txt:168
/tmp/pg_stat_activity4.txt:5
/tmp/pg_stat_activity5.txt:6

Interestingly, all of those 751 connections were idle!

$ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
751

Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps
Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):

Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:

$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1828

CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)
What's interesting is that the DSpace log says the connections are all busy:

org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].

... but in PostgreSQL I see them idle or idle in transaction:

$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
187

What the fuck, does DSpace think all connections are busy?
I suspect these are issues with abandoned connections or maybe a leak, so I'm going to try adding the removeAbandoned='true' parameter which is apparently off by default
I will try testOnReturn='true' too, just to add more validation, because I'm fucking grasping at straws

9.5 KiB Raw Blame History

2018-02-01

2018-02-03

2018-02-05

2018-02-06

2018-02-07

9.5 KiB

Raw Blame History