--- title: "February, 2018" date: 2018-02-01T16:28:54+02:00 author: "Alan Orth" tags: ["Notes"] --- ## 2018-02-01 - Peter gave feedback on the `dc.rights` proof of concept that I had sent him last week - We don't need to distinguish between internal and external works, so that makes it just a simple list - Yesterday I figured out how to monitor DSpace sessions using JMX - I copied the logic in the `jmx_tomcat_dbpools` provided by Ubuntu's `munin-plugins-java` package and used the stuff I discovered about JMX [in 2018-01]({{< relref "2018-01.md" >}}) ![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace_sessions-day.png) - Run all system updates and reboot DSpace Test - Wow, I packaged up the `jmx_dspace_sessions` stuff in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) and deployed it on CGSpace and it totally works: ``` # munin-run jmx_dspace_sessions v_.value 223 v_jspui.value 1 v_oai.value 0 ``` ## 2018-02-03 - Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566 - We will close that ticket for now and wait for the 5.8 stuff: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560 - I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January - After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace: ``` $ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu' $ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu' ``` - Then I started a full Discovery reindex: ``` $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b real 96m39.823s user 14m10.975s sys 2m29.088s ``` - Generate a new list of affiliations for Peter to sort through: ``` dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv; COPY 3723 ``` - Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in [December]({{< relref "2017-12.md" >}}): ``` # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018" 3126109 real 0m23.839s user 0m27.225s sys 0m1.905s ``` ## 2018-02-05 - Toying with correcting authors with trailing spaces via PostgreSQL: ``` dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$'; UPDATE 20 ``` - I tried the `TRIM(TRAILING from text_value)` function and it said it changed 20 items but the spaces didn't go away - This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird. - Anyways, Peter wants a new list of authors to clean up, so I exported another CSV: ``` dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv; COPY 55630 ``` ## 2018-02-06 - UptimeRobot says CGSpace is down this morning around 9:15 - I see 308 PostgreSQL connections in `pg_stat_activity` - The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour: ``` # date Tue Feb 6 09:30:32 UTC 2018 # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 2 223.185.41.40 2 66.249.64.14 2 77.246.52.40 4 157.55.39.82 4 193.205.105.8 5 207.46.13.63 5 207.46.13.64 6 154.68.16.34 7 207.46.13.66 1548 50.116.102.77 # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 77 213.55.99.121 86 66.249.64.14 101 104.196.152.243 103 207.46.13.64 118 157.55.39.82 133 207.46.13.66 136 207.46.13.63 156 68.180.228.157 295 197.210.168.174 752 144.76.64.79 ``` - I did notice in `/var/log/tomcat7/catalina.out` that Atmire's update thing was running though - So I restarted Tomcat and now everything is fine - Next time I see that many database connections I need to save the output so I can analyze it later - I'm going to re-schedule the taskUpdateSolrStatsMetadata task as [Bram detailed in ticket 566](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566) to see if it makes CGSpace stop crashing every morning - If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later - Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on - I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server - Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called 'dspaceCli' so that I can distinguish it in `pg_stat_activity` - I implemented some changes to the pooling in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) so that each DSpace web application can use its own pool (web, api, and solr) - Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down - Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine - I'm not actually sure if the Solr web application uses the database though, so I'll have to check later and remove it if necessary - I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week ## 2018-02-07 - Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace - I tried on DSpace Test and indeed the lookup just doesn't work! - The ORCiD code in DSpace appears to be using `http://pub.orcid.org/`, but when I go there in the browser it redirects me to `https://pub.orcid.org/v2.0/` - According to [the announcement](https://groups.google.com/forum/#!topic/orcid-api-users/qfg-HwAB1bk) the v1 API was moved from `http://pub.orcid.org/` to `https://pub.orcid.org/v1.2` until March 1st when it will be discontinued for good - But the old URL is hard coded in DSpace and it doesn't work anyways, because it currently redirects you to `https://pub.orcid.org/v2.0/v1.2` - So I guess we have to disable that shit once and for all and switch to a controlled vocabulary - CGSpace crashed again, this time around `Wed Feb 7 11:20:28 UTC 2018` - I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own: ``` $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity* /tmp/pg_stat_activity1.txt:300 /tmp/pg_stat_activity2.txt:272 /tmp/pg_stat_activity3.txt:168 /tmp/pg_stat_activity4.txt:5 /tmp/pg_stat_activity5.txt:6 ``` - Interestingly, all of those 751 connections were idle! ``` $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle 751 ``` - Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps - Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!): ![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace-sessions-day.png) - Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM: ``` $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 1828 ``` - CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme) - What's interesting is that the DSpace log says the connections are all busy: ``` org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000]. ``` - ... but in PostgreSQL I see them `idle` or `idle in transaction`: ``` $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb 250 $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle 250 $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction" 187 ``` - What the fuck, does DSpace think all connections are busy? - I suspect these are issues with abandoned connections or maybe a leak, so I'm going to try adding the `removeAbandoned='true'` parameter which is apparently off by default - I will try `testOnReturn='true'` too, just to add more validation, because I'm fucking grasping at straws