[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
- From the DSpace log I see that eventually Solr stopped responding, so I guess the `java` process that was OOM killed above was Tomcat's
- I'm not sure why Tomcat didn't crash with an OutOfMemoryError...
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: [IITA July_30](https://dspacetest.cgiar.org/handle/10568/103250)
- incorrect authorship types
- dozens of inconsistencies, spelling mistakes, and white space in author affiliations
- minor issues in countries (California is not a country)
- minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects
- DSpace Test crashed again and I don't see the only error I see is this in `dmesg`:
```
[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug 2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
```
- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
- So basically we need a new test server with more RAM very soon...
- Abenet asked about the workflow statistics in the Atmire CUA module again
- Last year Atmire told me that it's disabled by default but you can enable it with `workflow.stats.enabled = true` in the CUA configuration file
- There was a bug with adding users so they sent a patch, but I didn't merge it because it was [very dirty](https://github.com/ilri/DSpace/pull/319) and I wasn't sure it actually fixed the problem
- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows "No data available"
- As a test I submitted a new item and I was able to see it in the workflow statistics "data" tab, but not in the graph
- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:
```
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
```
- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
- I might need to overhaul the [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
- After checking a few examples I see that checking only the `text_value` and `place` when adding ORCID fields is not enough anymore
- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission
- Now it is better to check if there is _any_ existing ORCID identifier for a given author for the item...
- I will have to update my script to extract the ORCID identifier and search for that
- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:
- Keep working on the CIAT ORCID identifiers from Elizabeth
- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
- This is less obvious and more error prone with names like "Peters" where there are many more authors
- I see some errors in the variations of names as well, for example:
```
Verchot, Louis
Verchot, L
Verchot, L. V.
Verchot, L.V
Verchot, L.V.
Verchot, LV
Verchot, Louis V.
```
- I'll just tag them all with Louis Verchot's ORCID identifier...
- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script:
```
dc.contributor.author,cg.creator.id
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
- Looking at the list of author affialitions from Peter one last time
- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:
```
or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
isNotNull(value.match(/.*\u2019.*/)),
isNotNull(value.match(/.*\u00b4.*/))
)
```
- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
- I will run the following on DSpace Test and CGSpace:
- I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set `JAVA_OPTS` to 1024m and tried the `mvn package` again
- Still the `mvn package` takes forever and essentially hangs on processing the xmlui-mirage2 overlay (though after building all the themes)
- I will try to reduce Tomcat memory from 4608m to 4096m and then retry the `mvn package` with 1024m of `JAVA_OPTS` again
- After running the `mvn package` for the third time and waiting an hour, I attached `strace` to the Java process and saw that it was indeed reading XMLUI theme data... so I guess I just need to wait more
- After waiting two hours the maven process completed and installation was successful
- I restarted Tomcat and it seems everything is working well, so I'll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
- I merged [Atmire's pull request](https://github.com/ilri/DSpace/pull/378) into our `5_x-dspace-5.8` temporary brach and then cherry-picked all the changes from `5_x-prod` since April, 2018 when that temporary branch was created
- As the branch histories are very different I cannot merge the new 5.8 branch into the current `5_x-prod` branch
- Instead, I will archive the current `5_x-prod` DSpace 5.5 branch as `5_x-prod-dspace-5.5` and then hard reset `5_x-prod` based on `5_x-dspace-5.8`
- Unfortunately this will mess up the references in pull requests and issues on GitHub
- And I notice that Atmire changed something in the XMLUI module's `pom.xml` as part of the DSpace 5.8 changes, specifically to remove the exclude for `node_modules` in the `maven-war-plugin` step
- This exclude is *present* in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
- It makes sense that it would take longer to complete this step because the `node_modules` folder has tens of thousands of files, and we have 27 themes!
- In other news, I see there was a pull request in DSpace 5.9 that fixes the issue with not being able to have blank lines in CSVs when importing via command line or webui ([DS-3245](https://jira.duraspace.org/browse/DS-3245))
- Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
- It appears that the web UI's upload interface *requires* you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the `collections` file inside each item in the bundle
- I imported the CTA items on CGSpace for Sisay:
```
$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map