cgspace-notes/content/post/2017-09.md

+++
date = "2017-09-07T16:54:52+07:00"
author = "Alan Orth"
title = "September, 2017"
tags = ["Notes"]

+++
## 2017-09-06

- Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours

## 2017-09-07

- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group

<!--more-->

## 2017-09-10

- Delete 58 blank metadata values from the CGSpace database:

```
dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 58
```

- I also ran it on DSpace Test because we'll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
- Run system updates and restart DSpace Test
- We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)
- I still have the original data from the CGIAR Library so I've zipped it up and sent it off to linode18 for now
- sha256sum of `original-cgiar-library-6.6GB.tar.gz` is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a
- Start doing a test run of the CGIAR Library migration locally
- Notes and todo checklist here for now: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c
- Create pull request for Phase I and II changes to CCAFS Project Tags: [#336](https://github.com/ilri/DSpace/pull/336)
- We've been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized
- There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in [2017-07](/cgspace-notes/2017-07), but I've asked for more clarification from Lili just in case
- Looking at the DSpace logs to see if we've had a change in the "Cannot get a connection" errors since last month when we adjusted the `db.maxconnections` parameter on CGSpace:

```
# grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
dspace.log.2017-09-01:0
dspace.log.2017-09-02:0
dspace.log.2017-09-03:9
dspace.log.2017-09-04:17
dspace.log.2017-09-05:752
dspace.log.2017-09-06:0
dspace.log.2017-09-07:0
dspace.log.2017-09-08:10
dspace.log.2017-09-09:0
dspace.log.2017-09-10:0
```

- Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I'm sure that helped
- There are still some errors, though, so maybe I should bump the connection limit up a bit
- I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we're currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system's PostgreSQL `max_connections` (formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)
- I updated both CGSpace and DSpace Test to use these new settings (60 connections per web app and 183 for system PostgreSQL limit)
- I'm expecting to see 0 connection errors for the next few months

## 2017-09-11

- Lots of work testing the CGIAR Library migration
- Many technical notes and TODOs here: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c

## 2017-09-12

- I was testing the [METS XSD caching during AIP ingest](https://wiki.duraspace.org/display/DSDOC5x/AIP+Backup+and+Restore#AIPBackupandRestore-AIPConfigurationsToImproveIngestionSpeedwhileValidating) but it doesn't seem to help actually
- The import process takes the same amount of time with and without the caching
- Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):

```
$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
```

- Great TCP dump guide here: https://danielmiessler.com/study/tcpdump
- The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation
- I sent a message to the mailing list to see if anyone knows more about this
- In looking at the tcpdump results I notice that there is an update check to the ehcache server on _every_ iteration of the ingest loop, for example:

```
09:39:36.008956 IP 192.168.8.124.50515 > 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_144&platform=x86_64&tc-version=UNKNOWN&tc-product=Ehcache+Core+1.7.2&source=Ehcache+Core&uptime-secs=0&patch=UNKNOWN HTTP/1.1
```

- Turns out this is a known issue and Ehcache has refused to make it opt-in: https://jira.terracotta.org/jira/browse/EHC-461
- But we can disable it by adding an `updateCheck="false"` attribute to the main `<ehcache >` tag in `dspace-services/src/main/resources/caching/ehcache-config.xml`
- After re-compiling and re-deploying DSpace I no longer see those update checks during item submission
- I had a Skype call with Bram Luyten from Atmire to discuss various issues related to ORCID in DSpace
  - First, ORCID is deprecating their version 1 API (which DSpace uses) and in version 2 API they have removed the ability to search for users by name
  - The logic is that searching by name actually isn't very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names
  - Atmire's proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)
  - Once the association between name and ORCID is made in the authority then it can be autocompleted in the lookup field
  - Ideally there could also be a user interface for cleanup and merging of authorities
  - He will prepare a quote for us with keeping in mind that this could be useful to contribute back to the community for a 5.x release
  - As far as exposing ORCIDs as flat metadata along side all other metadata, he says this should be possible and will work on a quote for us

## 2017-09-13

- Last night Linode sent an alert about CGSpace (linode18) that it has exceeded the outbound traffic rate threshold of 10Mb/s for the last two hours
- I wonder what was going on, and looking into the nginx logs I think maybe it's OAI...
- Here is yesterday's top ten IP addresses making requests to `/oai`:

```
# awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
      1 213.136.89.78
      1 66.249.66.90
      1 66.249.66.92
      3 68.180.229.31
      4 35.187.22.255
  13745 54.70.175.86
  15814 34.211.17.113
  15825 35.161.215.53
  16704 54.70.51.7
```

- Compared to the previous day's logs it looks VERY high:

```
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
      1 207.46.13.39
      1 66.249.66.93
      2 66.249.66.91
      4 216.244.66.194
     14 66.249.66.90
```

- The user agents for those top IPs are:
  - 54.70.175.86: API scraper
  - 34.211.17.113: API scraper
  - 35.161.215.53: API scraper
  - 54.70.51.7: API scraper
- And this user agent has never been seen before today (or at least recently!):

```
# grep -c "API scraper" /var/log/nginx/oai.log
62088
# zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
/var/log/nginx/oai.log.10.gz:0
/var/log/nginx/oai.log.11.gz:0
/var/log/nginx/oai.log.12.gz:0
/var/log/nginx/oai.log.13.gz:0
/var/log/nginx/oai.log.14.gz:0
/var/log/nginx/oai.log.15.gz:0
/var/log/nginx/oai.log.16.gz:0
/var/log/nginx/oai.log.17.gz:0
/var/log/nginx/oai.log.18.gz:0
/var/log/nginx/oai.log.19.gz:0
/var/log/nginx/oai.log.20.gz:0
/var/log/nginx/oai.log.21.gz:0
/var/log/nginx/oai.log.22.gz:0
/var/log/nginx/oai.log.23.gz:0
/var/log/nginx/oai.log.24.gz:0
/var/log/nginx/oai.log.25.gz:0
/var/log/nginx/oai.log.26.gz:0
/var/log/nginx/oai.log.27.gz:0
/var/log/nginx/oai.log.28.gz:0
/var/log/nginx/oai.log.29.gz:0
/var/log/nginx/oai.log.2.gz:0
/var/log/nginx/oai.log.30.gz:0
/var/log/nginx/oai.log.3.gz:0
/var/log/nginx/oai.log.4.gz:0
/var/log/nginx/oai.log.5.gz:0
/var/log/nginx/oai.log.6.gz:0
/var/log/nginx/oai.log.7.gz:0
/var/log/nginx/oai.log.8.gz:0
/var/log/nginx/oai.log.9.gz:0
```

- Some of these heavy users are also using XMLUI, and their user agent isn't matched by the [Tomcat Session Crawler valve](https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158), so each request uses a different session
- Yesterday alone the IP addresses using the `API scraper` user agent were responsible for 16,000 sessions in XMLUI:

```
# grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
15924
```

- If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex
- Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:

```
WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
```
-												Add notes for 2017-09-07

											
										
										
											2017-09-07 12:02:57 +02:00
+								+++
 								date = "2017-09-07T16:54:52+07:00"
 								author = "Alan Orth"
 								title = "September, 2017"
 								tags = ["Notes"]
 								+++
 								## 2017-09-06
 								- Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
 								## 2017-09-07
 								- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
-												Update notes

											
										
										
											2017-09-10 12:35:51 +02:00
-												Add notes for 2017-09-13

											
										
										
											2017-09-13 08:53:54 +02:00
+								<!--more-->
-												Update notes

											
										
										
											2017-09-10 12:35:51 +02:00
+								## 2017-09-10
 								- Delete 58 blank metadata values from the CGSpace database:
 								```
 								dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
 								DELETE 58
 								```
 								- I also ran it on DSpace Test because we'll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
 								- Run system updates and restart DSpace Test
 								- We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)
 								- I still have the original data from the CGIAR Library so I've zipped it up and sent it off to linode18 for now
 								- sha256sum of `original-cgiar-library-6.6GB.tar.gz` is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a
-												Update notes

											
										
										
											2017-09-10 16:46:54 +02:00
+								- Start doing a test run of the CGIAR Library migration locally
 								- Notes and todo checklist here for now: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c
 								- Create pull request for Phase I and II changes to CCAFS Project Tags: [#336](https://github.com/ilri/DSpace/pull/336)
 								- We've been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized
 								- There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in [2017-07](/cgspace-notes/2017-07), but I've asked for more clarification from Lili just in case
-												Update notes

											
										
										
											2017-09-10 17:17:25 +02:00
+								- Looking at the DSpace logs to see if we've had a change in the "Cannot get a connection" errors since last month when we adjusted the `db.maxconnections` parameter on CGSpace:
 								```
 								# grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
 								dspace.log.2017-09-01:0
 								dspace.log.2017-09-02:0
 								dspace.log.2017-09-03:9
 								dspace.log.2017-09-04:17
 								dspace.log.2017-09-05:752
 								dspace.log.2017-09-06:0
 								dspace.log.2017-09-07:0
 								dspace.log.2017-09-08:10
 								dspace.log.2017-09-09:0
 								dspace.log.2017-09-10:0
 								```
 								- Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I'm sure that helped
 								- There are still some errors, though, so maybe I should bump the connection limit up a bit
 								- I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we're currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system's PostgreSQL `max_connections` (formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)
 								- I updated both CGSpace and DSpace Test to use these new settings (60 connections per web app and 183 for system PostgreSQL limit)
-												Update notes

											
										
										
											2017-09-10 17:21:38 +02:00
+								- I'm expecting to see 0 connection errors for the next few months
-												Add notes for 2017-09-12

											
										
										
											2017-09-12 15:57:19 +02:00
 								## 2017-09-11
 								- Lots of work testing the CGIAR Library migration
 								- Many technical notes and TODOs here: https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c
 								## 2017-09-12
 								- I was testing the [METS XSD caching during AIP ingest](https://wiki.duraspace.org/display/DSDOC5x/AIP+Backup+and+Restore#AIPBackupandRestore-AIPConfigurationsToImproveIngestionSpeedwhileValidating) but it doesn't seem to help actually
 								- The import process takes the same amount of time with and without the caching
 								- Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):
 								```
 								$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
 								```
 								- Great TCP dump guide here: https://danielmiessler.com/study/tcpdump
 								- The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation
 								- I sent a message to the mailing list to see if anyone knows more about this
 								- In looking at the tcpdump results I notice that there is an update check to the ehcache server on _every_ iteration of the ingest loop, for example:
 								```
 :39:36.008956 IP 192.168.8.124.50515 > 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_144&platform=x86_64&tc-version=UNKNOWN&tc-product=Ehcache+Core+1.7.2&source=Ehcache+Core&uptime-secs=0&patch=UNKNOWN HTTP/1.1
 								```
 								- Turns out this is a known issue and Ehcache has refused to make it opt-in: https://jira.terracotta.org/jira/browse/EHC-461
 								- But we can disable it by adding an `updateCheck="false"` attribute to the main `<ehcache >` tag in `dspace-services/src/main/resources/caching/ehcache-config.xml`
 								- After re-compiling and re-deploying DSpace I no longer see those update checks during item submission
 								- I had a Skype call with Bram Luyten from Atmire to discuss various issues related to ORCID in DSpace
 								  - First, ORCID is deprecating their version 1 API (which DSpace uses) and in version 2 API they have removed the ability to search for users by name
 								  - The logic is that searching by name actually isn't very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names
 								  - Atmire's proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)
 								  - Once the association between name and ORCID is made in the authority then it can be autocompleted in the lookup field
 								  - Ideally there could also be a user interface for cleanup and merging of authorities
 								  - He will prepare a quote for us with keeping in mind that this could be useful to contribute back to the community for a 5.x release
 								  - As far as exposing ORCIDs as flat metadata along side all other metadata, he says this should be possible and will work on a quote for us
-												Add notes for 2017-09-13

											
										
										
											2017-09-13 08:53:54 +02:00
 								## 2017-09-13
 								- Last night Linode sent an alert about CGSpace (linode18) that it has exceeded the outbound traffic rate threshold of 10Mb/s for the last two hours
 								- I wonder what was going on, and looking into the nginx logs I think maybe it's OAI...
 								- Here is yesterday's top ten IP addresses making requests to `/oai`:
 								```
 								# awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
 213.136.89.78
 66.249.66.90
 66.249.66.92
 68.180.229.31
 35.187.22.255
 54.70.175.86
 34.211.17.113
 35.161.215.53
 54.70.51.7
 								```
 								- Compared to the previous day's logs it looks VERY high:
 								```
 								# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
 207.46.13.39
 66.249.66.93
 66.249.66.91
 216.244.66.194
 66.249.66.90
 								```
 								- The user agents for those top IPs are:
 								  - 54.70.175.86: API scraper
 								  - 34.211.17.113: API scraper
 								  - 35.161.215.53: API scraper
 								  - 54.70.51.7: API scraper
 								- And this user agent has never been seen before today (or at least recently!):
 								```
 								# grep -c "API scraper" /var/log/nginx/oai.log
 
 								# zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
 								/var/log/nginx/oai.log.10.gz:0
 								/var/log/nginx/oai.log.11.gz:0
 								/var/log/nginx/oai.log.12.gz:0
 								/var/log/nginx/oai.log.13.gz:0
 								/var/log/nginx/oai.log.14.gz:0
 								/var/log/nginx/oai.log.15.gz:0
 								/var/log/nginx/oai.log.16.gz:0
 								/var/log/nginx/oai.log.17.gz:0
 								/var/log/nginx/oai.log.18.gz:0
 								/var/log/nginx/oai.log.19.gz:0
 								/var/log/nginx/oai.log.20.gz:0
 								/var/log/nginx/oai.log.21.gz:0
 								/var/log/nginx/oai.log.22.gz:0
 								/var/log/nginx/oai.log.23.gz:0
 								/var/log/nginx/oai.log.24.gz:0
 								/var/log/nginx/oai.log.25.gz:0
 								/var/log/nginx/oai.log.26.gz:0
 								/var/log/nginx/oai.log.27.gz:0
 								/var/log/nginx/oai.log.28.gz:0
 								/var/log/nginx/oai.log.29.gz:0
 								/var/log/nginx/oai.log.2.gz:0
 								/var/log/nginx/oai.log.30.gz:0
 								/var/log/nginx/oai.log.3.gz:0
 								/var/log/nginx/oai.log.4.gz:0
 								/var/log/nginx/oai.log.5.gz:0
 								/var/log/nginx/oai.log.6.gz:0
 								/var/log/nginx/oai.log.7.gz:0
 								/var/log/nginx/oai.log.8.gz:0
 								/var/log/nginx/oai.log.9.gz:0
 								```
 								- Some of these heavy users are also using XMLUI, and their user agent isn't matched by the [Tomcat Session Crawler valve](https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158), so each request uses a different session
 								- Yesterday alone the IP addresses using the `API scraper` user agent were responsible for 16,000 sessions in XMLUI:
 								```
 								# grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
 
 								```
 								- If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex
 								- Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:
 								```
 								WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
 								```