cgspace-notes/content/post/2018-02.md

---
title: "February, 2018"
date: 2018-02-01T16:28:54+02:00
author: "Alan Orth"
tags: ["Notes"]
---

## 2018-02-01

- Peter gave feedback on the `dc.rights` proof of concept that I had sent him last week
- We don't need to distinguish between internal and external works, so that makes it just a simple list
- Yesterday I figured out how to monitor DSpace sessions using JMX
- I copied the logic in the `jmx_tomcat_dbpools` provided by Ubuntu's `munin-plugins-java` package and used the stuff I discovered about JMX [in 2018-01]({{< relref "2018-01.md" >}})

<!--more-->

![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace_sessions-day.png)

- Run all system updates and reboot DSpace Test
- Wow, I packaged up the `jmx_dspace_sessions` stuff in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) and deployed it on CGSpace and it totally works:

```
# munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0
```

## 2018-02-03

- Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
- We will close that ticket for now and wait for the 5.8 stuff: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
- I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January
- After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:

```
$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
```

- Then I started a full Discovery reindex:

```
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b

real    96m39.823s
user    14m10.975s
sys     2m29.088s
```

- Generate a new list of affiliations for Peter to sort through:

```
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723
```

- Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in [December]({{< relref "2017-12.md" >}}):

```
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
3126109

real    0m23.839s
user    0m27.225s
sys     0m1.905s
```

## 2018-02-05

- Toying with correcting authors with trailing spaces via PostgreSQL:

```
dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20
```

- I tried the `TRIM(TRAILING from text_value)` function and it said it changed 20 items but the spaces didn't go away
- This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
- Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:

```
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630
```

## 2018-02-06

- UptimeRobot says CGSpace is down this morning around 9:15
- I see 308 PostgreSQL connections in `pg_stat_activity`
- The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:

```
# date
Tue Feb  6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      2 223.185.41.40
      2 66.249.64.14
      2 77.246.52.40
      4 157.55.39.82
      4 193.205.105.8
      5 207.46.13.63
      5 207.46.13.64
      6 154.68.16.34
      7 207.46.13.66
   1548 50.116.102.77
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     77 213.55.99.121
     86 66.249.64.14
    101 104.196.152.243
    103 207.46.13.64
    118 157.55.39.82
    133 207.46.13.66
    136 207.46.13.63
    156 68.180.228.157
    295 197.210.168.174
    752 144.76.64.79
```

- I did notice in `/var/log/tomcat7/catalina.out` that Atmire's update thing was running though
- So I restarted Tomcat and now everything is fine
- Next time I see that many database connections I need to save the output so I can analyze it later
- I'm going to re-schedule the taskUpdateSolrStatsMetadata task as [Bram detailed in ticket 566](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566) to see if it makes CGSpace stop crashing every morning
- If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later
- Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on
- I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server
- Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called 'dspaceCli' so that I can distinguish it in `pg_stat_activity`
- I implemented some changes to the pooling in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) so that each DSpace web application can use its own pool (web, api, and solr)
- Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down
- Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine
- I'm not actually sure if the Solr web application uses the database though, so I'll have to check later and remove it if necessary
- I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week

## 2018-02-07

- Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace
- I tried on DSpace Test and indeed the lookup just doesn't work!
- The ORCiD code in DSpace appears to be using `http://pub.orcid.org/`, but when I go there in the browser it redirects me to `https://pub.orcid.org/v2.0/`
- According to [the announcement](https://groups.google.com/forum/#!topic/orcid-api-users/qfg-HwAB1bk) the v1 API was moved from `http://pub.orcid.org/` to `https://pub.orcid.org/v1.2` until March 1st when it will be discontinued for good
- But the old URL is hard coded in DSpace and it doesn't work anyways, because it currently redirects you to `https://pub.orcid.org/v2.0/v1.2`
- So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
- CGSpace crashed again, this time around `Wed Feb  7 11:20:28 UTC 2018`
- I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:

```
$ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
/tmp/pg_stat_activity3.txt:168
/tmp/pg_stat_activity4.txt:5
/tmp/pg_stat_activity5.txt:6
```

- Interestingly, all of those 751 connections were idle!

```
$ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
751
```

- Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps
- Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):

![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace-sessions-day.png)

- Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:

```
$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1828
```

- CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)
- What's interesting is that the DSpace log says the connections are all busy:

```
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
```

- ... but in PostgreSQL I see them `idle` or `idle in transaction`:

```
$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
187
```

- What the fuck, does DSpace think all connections are busy?
- I suspect these are issues with abandoned connections or maybe a leak, so I'm going to try adding the `removeAbandoned='true'` parameter which is apparently off by default
- I will try `testOnReturn='true'` too, just to add more validation, because I'm fucking grasping at straws
- Also, WTF, there was a heap space error randomly in catalina.out:

```
Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
```

- I'm trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
- Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:

```
$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
     34 ip_addr=46.229.168.67
     34 ip_addr=46.229.168.73
     37 ip_addr=46.229.168.76
     40 ip_addr=34.232.65.41
     41 ip_addr=46.229.168.71
     44 ip_addr=197.210.168.174
     55 ip_addr=181.137.2.214
     55 ip_addr=213.55.99.121
     58 ip_addr=46.229.168.65
     64 ip_addr=66.249.66.91
     67 ip_addr=66.249.66.90
     71 ip_addr=207.46.13.54
     78 ip_addr=130.82.1.40
    104 ip_addr=40.77.167.36
    151 ip_addr=68.180.228.157
    174 ip_addr=207.46.13.135
    194 ip_addr=54.83.138.123
    198 ip_addr=40.77.167.62
    210 ip_addr=207.46.13.71
    214 ip_addr=104.196.152.243
```

- These IPs made thousands of sessions today:

```
$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
530
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
859
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
610
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
8
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
826
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
727
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
181
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
24
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
166
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
992

```

- Let's investigate who these IPs belong to:
  - 104.196.152.243 is CIAT, which is already marked as a bot via nginx!
  - 207.46.13.71 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
  - 40.77.167.62 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
  - 207.46.13.135 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
  - 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
  - 40.77.167.36 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
  - 207.46.13.54 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
  - 46.229.168.x is Semrush, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!
- Nice, so these are all known bots that are already crammed into one session by Tomcat's Crawler Session Manager Valve.
- What in the actual fuck, why is our load doing this? It's gotta be something fucked up with the database pool being "busy" but everything is fucking idle
- One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:

```
BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
```

- This one makes two thousand requests per day or so recently:

```
# grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
/var/log/nginx/access.log:1925
/var/log/nginx/access.log.1:2029
```

- And they have 30 IPs, so fuck that shit I'm going to add them to the Tomcat Crawler Session Manager Valve nowwww
- Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace
- Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker
- This is how the connections looked when it crashed this afternoon:

```
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
    290 dspaceWeb
```

- This is how it is right now:

```
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      5 dspaceWeb
```

- So is this just some fucked up XMLUI database leaking?
- I notice there is an issue (that I've probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
- I seriously doubt this leaking shit is fixed for sure, but I'm gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I'm fed up with this shit
- I cherry-picked all the commits for DS-3551 but it won't build on our current DSpace 5.5!
- I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle

## 2018-02-11

- Magdalena from CCAFS emailed to ask why one of their items has such a weird thumbnail: [10568/90735](https://cgspace.cgiar.org/handle/10568/90735)

![Weird thumbnail](/cgspace-notes/2018/02/CCAFS_WP_223.pdf.jpg)

- I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:

```
$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
```

![Manual thumbnail](/cgspace-notes/2018/02/CCAFS_WP_223.jpg)
Add notes for 2018-02-01 2018-02-01 15:30:28 +01:00			`---`
			`title: "February, 2018"`
			`date: 2018-02-01T16:28:54+02:00`
			`author: "Alan Orth"`
			`tags: ["Notes"]`
			`---`

			`## 2018-02-01`

			- Peter gave feedback on the `dc.rights` proof of concept that I had sent him last week
			`- We don't need to distinguish between internal and external works, so that makes it just a simple list`
Update notes 2018-02-01 18:04:07 +01:00			`- Yesterday I figured out how to monitor DSpace sessions using JMX`
			- I copied the logic in the `jmx_tomcat_dbpools` provided by Ubuntu's `munin-plugins-java` package and used the stuff I discovered about JMX [in 2018-01]({{< relref "2018-01.md" >}})
Add notes for 2018-02-01 2018-02-01 15:30:28 +01:00
			`<!--more-->`
Update notes 2018-02-01 18:04:07 +01:00
			`![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace_sessions-day.png)`

			`- Run all system updates and reboot DSpace Test`
			- Wow, I packaged up the `jmx_dspace_sessions` stuff in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) and deployed it on CGSpace and it totally works:

			```
			`# munin-run jmx_dspace_sessions`
			`v_.value 223`
			`v_jspui.value 1`
			`v_oai.value 0`
			```
Updates 2018-02-03 08:44:31 +01:00
			`## 2018-02-03`

			`- Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566`
			`- We will close that ticket for now and wait for the 5.8 stuff: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560`
Update notes 2018-02-03 23:00:51 +01:00			`- I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January`
			`- After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:`

			```
			`$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'`
			`$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'`
			```

			`- Then I started a full Discovery reindex:`

			```
			`$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b`
Update notes 2018-02-04 10:26:07 +01:00
			`real 96m39.823s`
			`user 14m10.975s`
			`sys 2m29.088s`
Update notes 2018-02-03 23:00:51 +01:00			```

			`- Generate a new list of affiliations for Peter to sort through:`

			```
			`dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;`
			`COPY 3723`
			```

			`- Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in [December]({{< relref "2017-12.md" >}}):`

			```
			`# time zcat --force /var/log/nginx/* \| grep -cE "[0-9]{1,2}/Jan/2018"`
			`3126109`

			`real 0m23.839s`
			`user 0m27.225s`
			`sys 0m1.905s`
			```
Update notes 2018-02-05 18:08:05 +01:00
			`## 2018-02-05`

			`- Toying with correcting authors with trailing spaces via PostgreSQL:`

			```
			`dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';`
			`UPDATE 20`
			```

			- I tried the `TRIM(TRAILING from text_value)` function and it said it changed 20 items but the spaces didn't go away
			`- This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.`
			`- Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:`

			```
			`dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;`
			`COPY 55630`
			```
Add notes for 2018-02-06 2018-02-06 13:03:07 +01:00
			`## 2018-02-06`

			`- UptimeRobot says CGSpace is down this morning around 9:15`
			- I see 308 PostgreSQL connections in `pg_stat_activity`
			`- The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:`

			```
			`# date`
			`Tue Feb 6 09:30:32 UTC 2018`
			`# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 \| grep -E "6/Feb/2018:(08\|09)" \| awk '{print $1}' \| sort \| uniq -c \| sort -n \| tail -n 10`
			`2 223.185.41.40`
			`2 66.249.64.14`
			`2 77.246.52.40`
			`4 157.55.39.82`
			`4 193.205.105.8`
			`5 207.46.13.63`
			`5 207.46.13.64`
			`6 154.68.16.34`
			`7 207.46.13.66`
			`1548 50.116.102.77`
			`# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 \| grep -E "6/Feb/2018:(08\|09)" \| awk '{print $1}' \| sort \| uniq -c \| sort -n \| tail -n 10`
			`77 213.55.99.121`
			`86 66.249.64.14`
			`101 104.196.152.243`
			`103 207.46.13.64`
			`118 157.55.39.82`
			`133 207.46.13.66`
			`136 207.46.13.63`
			`156 68.180.228.157`
			`295 197.210.168.174`
			`752 144.76.64.79`
			```

			- I did notice in `/var/log/tomcat7/catalina.out` that Atmire's update thing was running though
			`- So I restarted Tomcat and now everything is fine`
			`- Next time I see that many database connections I need to save the output so I can analyze it later`
			`- I'm going to re-schedule the taskUpdateSolrStatsMetadata task as [Bram detailed in ticket 566](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566) to see if it makes CGSpace stop crashing every morning`
			`- If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later`
			`- Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on`
			`- I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server`
			- Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called 'dspaceCli' so that I can distinguish it in `pg_stat_activity`
			`- I implemented some changes to the pooling in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) so that each DSpace web application can use its own pool (web, api, and solr)`
			`- Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down`
			`- Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine`
			`- I'm not actually sure if the Solr web application uses the database though, so I'll have to check later and remove it if necessary`
Update notes for 2018-02-06 2018-02-06 13:03:46 +01:00			`- I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week`
Add notes for 2018-02-07 2018-02-07 11:34:36 +01:00
			`## 2018-02-07`

			`- Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace`
			`- I tried on DSpace Test and indeed the lookup just doesn't work!`
			- The ORCiD code in DSpace appears to be using `http://pub.orcid.org/`, but when I go there in the browser it redirects me to `https://pub.orcid.org/v2.0/`
			- According to [the announcement](https://groups.google.com/forum/#!topic/orcid-api-users/qfg-HwAB1bk) the v1 API was moved from `http://pub.orcid.org/` to `https://pub.orcid.org/v1.2` until March 1st when it will be discontinued for good
			- But the old URL is hard coded in DSpace and it doesn't work anyways, because it currently redirects you to `https://pub.orcid.org/v2.0/v1.2`
			`- So I guess we have to disable that shit once and for all and switch to a controlled vocabulary`
Update notes for 2018-02-07 2018-02-07 17:33:26 +01:00			- CGSpace crashed again, this time around `Wed Feb 7 11:20:28 UTC 2018`
			`- I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:`

			```
			`$ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt`
			`$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*`
			`/tmp/pg_stat_activity1.txt:300`
			`/tmp/pg_stat_activity2.txt:272`
			`/tmp/pg_stat_activity3.txt:168`
			`/tmp/pg_stat_activity4.txt:5`
			`/tmp/pg_stat_activity5.txt:6`
			```

			`- Interestingly, all of those 751 connections were idle!`

			```
			`$ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* \| grep -c idle`
			`751`
			```

			`- Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps`
			`- Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):`

			`![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace-sessions-day.png)`

			`- Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:`

			```
			`$ grep -E '^2018-02-07 (10\|11)' dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`1828`
			```

			`- CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)`
			`- What's interesting is that the DSpace log says the connections are all busy:`

			```
			`org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].`
			```

			- ... but in PostgreSQL I see them `idle` or `idle in transaction`:

			```
			`$ psql -c 'select * from pg_stat_activity' \| grep -c dspaceWeb`
			`250`
			`$ psql -c 'select * from pg_stat_activity' \| grep dspaceWeb \| grep -c idle`
			`250`
			`$ psql -c 'select * from pg_stat_activity' \| grep dspaceWeb \| grep -c "idle in transaction"`
			`187`
			```

			`- What the fuck, does DSpace think all connections are busy?`
			- I suspect these are issues with abandoned connections or maybe a leak, so I'm going to try adding the `removeAbandoned='true'` parameter which is apparently off by default
			- I will try `testOnReturn='true'` too, just to add more validation, because I'm fucking grasping at straws
Update notes for 2018-02-07 2018-02-07 17:34:39 +01:00			`- Also, WTF, there was a heap space error randomly in catalina.out:`

			```
			`Wed Feb 07 15:01:54 UTC 2018 \| Query:containerItem:91917 AND type:2`
			`Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space`
			```
Update notes for 2018-02-07 2018-02-07 23:55:28 +01:00
			`- I'm trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!`
			`- Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:`

			```
			`$ grep -E '^2018-02-07 (10\|11)' dspace.log.2018-02-07 \| grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' \| sort -n \| uniq -c \| sort -n \| tail -n 20`
			`34 ip_addr=46.229.168.67`
			`34 ip_addr=46.229.168.73`
			`37 ip_addr=46.229.168.76`
			`40 ip_addr=34.232.65.41`
			`41 ip_addr=46.229.168.71`
			`44 ip_addr=197.210.168.174`
			`55 ip_addr=181.137.2.214`
			`55 ip_addr=213.55.99.121`
			`58 ip_addr=46.229.168.65`
			`64 ip_addr=66.249.66.91`
			`67 ip_addr=66.249.66.90`
			`71 ip_addr=207.46.13.54`
			`78 ip_addr=130.82.1.40`
			`104 ip_addr=40.77.167.36`
			`151 ip_addr=68.180.228.157`
			`174 ip_addr=207.46.13.135`
			`194 ip_addr=54.83.138.123`
			`198 ip_addr=40.77.167.62`
			`210 ip_addr=207.46.13.71`
			`214 ip_addr=104.196.152.243`
			```

			`- These IPs made thousands of sessions today:`

			```
			`$ grep 104.196.152.243 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`530`
			`$ grep 207.46.13.71 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`859`
			`$ grep 40.77.167.62 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`610`
			`$ grep 54.83.138.123 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`8`
			`$ grep 207.46.13.135 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`826`
			`$ grep 68.180.228.157 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`727`
			`$ grep 40.77.167.36 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`181`
			`$ grep 130.82.1.40 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`24`
			`$ grep 207.46.13.54 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`166`
			`$ grep 46.229.168 dspace.log.2018-02-07 \| grep -o -E 'session_id=[A-Z0-9]{32}' \| sort -n \| uniq \| wc -l`
			`992`

			```

			`- Let's investigate who these IPs belong to:`
			`- 104.196.152.243 is CIAT, which is already marked as a bot via nginx!`
			`- 207.46.13.71 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!`
			`- 40.77.167.62 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!`
			`- 207.46.13.135 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!`
			`- 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!`
			`- 40.77.167.36 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!`
			`- 207.46.13.54 is Bing, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!`
			`- 46.229.168.x is Semrush, which is already marked as a bot in Tomcat's Crawler Session Manager Valve!`
			`- Nice, so these are all known bots that are already crammed into one session by Tomcat's Crawler Session Manager Valve.`
			`- What in the actual fuck, why is our load doing this? It's gotta be something fucked up with the database pool being "busy" but everything is fucking idle`
			`- One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:`

			```
			`BUbiNG (+http://law.di.unimi.it/BUbiNG.html)`
			```

			`- This one makes two thousand requests per day or so recently:`

			```
			`# grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1`
			`/var/log/nginx/access.log:1925`
			`/var/log/nginx/access.log.1:2029`
			```

			`- And they have 30 IPs, so fuck that shit I'm going to add them to the Tomcat Crawler Session Manager Valve nowwww`
			`- Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace`
			`- Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker`
			`- This is how the connections looked when it crashed this afternoon:`

			```
			`$ psql -c 'select * from pg_stat_activity' \| grep -o -E '(dspaceWeb\|dspaceApi\|dspaceCli)' \| sort \| uniq -c`
			`5 dspaceApi`
			`290 dspaceWeb`
			```

			`- This is how it is right now:`

			```
			`$ psql -c 'select * from pg_stat_activity' \| grep -o -E '(dspaceWeb\|dspaceApi\|dspaceCli)' \| sort \| uniq -c`
			`5 dspaceApi`
			`5 dspaceWeb`
			```

			`- So is this just some fucked up XMLUI database leaking?`
			`- I notice there is an issue (that I've probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551`
			`- I seriously doubt this leaking shit is fixed for sure, but I'm gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I'm fed up with this shit`
Update notes 2018-02-07 23:56:53 +01:00			`- I cherry-picked all the commits for DS-3551 but it won't build on our current DSpace 5.5!`
Update notes for 2018-02-07 2018-02-08 00:08:36 +01:00			`- I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle`
Add notes for 2018-02-11 2018-02-11 09:01:13 +01:00
			`## 2018-02-11`

			`- Magdalena from CCAFS emailed to ask why one of their items has such a weird thumbnail: [10568/90735](https://cgspace.cgiar.org/handle/10568/90735)`

			`![Weird thumbnail](/cgspace-notes/2018/02/CCAFS_WP_223.pdf.jpg)`

			`- I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:`

			```
			`$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg`
			```

			`![Manual thumbnail](/cgspace-notes/2018/02/CCAFS_WP_223.jpg)`