cgspace-notes/content/posts/2018-08.md

---
title: "August, 2018"
date: 2018-08-01T11:52:54+03:00
author: "Alan Orth"
tags: ["Notes"]
---

## 2018-08-01

- DSpace Test had crashed at some point yesterday morning and I see the following in `dmesg`:

```
[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```

- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
- From the DSpace log I see that eventually Solr stopped responding, so I guess the `java` process that was OOM killed above was Tomcat's
- I'm not sure why Tomcat didn't crash with an OutOfMemoryError...
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it

<!--more-->

- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: [IITA July_30](https://dspacetest.cgiar.org/handle/10568/103250)
  - incorrect authorship types
  - dozens of inconsistencies, spelling mistakes, and white space in author affiliations
  - minor issues in countries (California is not a country)
  - minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects

## 2018-08-02

- DSpace Test crashed again and I don't see the only error I see is this in `dmesg`:

```
[Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
```

- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
- So basically we need a new test server with more RAM very soon...
- Abenet asked about the workflow statistics in the Atmire CUA module again
- Last year Atmire told me that it's disabled by default but you can enable it with `workflow.stats.enabled = true` in the CUA configuration file
- There was a bug with adding users so they sent a patch, but I didn't merge it because it was [very dirty](https://github.com/ilri/DSpace/pull/319) and I wasn't sure it actually fixed the problem
- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows "No data available"
- As a test I submitted a new item and I was able to see it in the workflow statistics "data" tab, but not in the graph

## 2018-08-15

- Run through Peter's list of author affiliations from earlier this month
- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
- Finally I did a test run with the [`fix-metadata-value.py`](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) script:

```
$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
```

## 2018-08-16

- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:

```
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
```

- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
- I might need to overhaul the [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
- After checking a few examples I see that checking only the `text_value` and `place` when adding ORCID fields is not enough anymore
- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission
- Now it is better to check if there is _any_ existing ORCID identifier for a given author for the item...
- I will have to update my script to extract the ORCID identifier and search for that
- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:

```
$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-08-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
```

<!-- vim: set sw=2 ts=2: -->
Add notes for 2018-08-01 2018-08-01 11:49:05 +02:00			`---`
			`title: "August, 2018"`
			`date: 2018-08-01T11:52:54+03:00`
			`author: "Alan Orth"`
			`tags: ["Notes"]`
			`---`

			`## 2018-08-01`

			- DSpace Test had crashed at some point yesterday morning and I see the following in `dmesg`:

			```
			`[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child`
			`[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB`
			`[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB`
			```

			`- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight`
			- From the DSpace log I see that eventually Solr stopped responding, so I guess the `java` process that was OOM killed above was Tomcat's
			`- I'm not sure why Tomcat didn't crash with an OutOfMemoryError...`
			`- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core`
			`- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes`
			`- I ran all system updates on DSpace Test and rebooted it`

			`<!--more-->`

Update notes for 2018-08-01 2018-08-01 16:24:49 +02:00			`- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: [IITA July_30](https://dspacetest.cgiar.org/handle/10568/103250)`
			`- incorrect authorship types`
			`- dozens of inconsistencies, spelling mistakes, and white space in author affiliations`
			`- minor issues in countries (California is not a country)`
			`- minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects`

Add notes for 2018-08-02 2018-08-02 13:29:59 +02:00			`## 2018-08-02`

			- DSpace Test crashed again and I don't see the only error I see is this in `dmesg`:

			```
			`[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child`
			`[Thu Aug 2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB`
			```

			`- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?`
			`- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat`
			`- So basically we need a new test server with more RAM very soon...`
			`- Abenet asked about the workflow statistics in the Atmire CUA module again`
			- Last year Atmire told me that it's disabled by default but you can enable it with `workflow.stats.enabled = true` in the CUA configuration file
			`- There was a bug with adding users so they sent a patch, but I didn't merge it because it was [very dirty](https://github.com/ilri/DSpace/pull/319) and I wasn't sure it actually fixed the problem`
			`- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows "No data available"`
			`- As a test I submitted a new item and I was able to see it in the workflow statistics "data" tab, but not in the graph`

Add notes for 2018-08-15 2018-08-15 11:56:38 +02:00			`## 2018-08-15`

			`- Run through Peter's list of author affiliations from earlier this month`
			`- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors`
Update notes for 2018-08-16 2018-08-16 14:40:38 +02:00			- Finally I did a test run with the [`fix-metadata-value.py`](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) script:
Add notes for 2018-08-15 2018-08-15 11:56:38 +02:00
			```
			`$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211`
			`$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211`
			```

Update notes for 2018-08-16 2018-08-16 14:40:38 +02:00			`## 2018-08-16`

			`- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:`

			```
			`dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;`
			```

			`- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month`
			`- I might need to overhaul the [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration`
Update notes for 2018-08-16 2018-08-16 17:59:45 +02:00			- After checking a few examples I see that checking only the `text_value` and `place` when adding ORCID fields is not enough anymore
			`- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission`
			`- Now it is better to check if there is _any_ existing ORCID identifier for a given author for the item...`
			`- I will have to update my script to extract the ORCID identifier and search for that`
			`- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:`

			```
			`$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine`
			`$ createuser -h localhost -U postgres --pwprompt dspacetest`
			`$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest`
			`$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'`
			`$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-08-16.backup`
			`$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'`
			`$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest`
			```
Update notes for 2018-08-16 2018-08-16 14:40:38 +02:00
Add notes for 2018-08-01 2018-08-01 11:49:05 +02:00			`<!-- vim: set sw=2 ts=2: -->`