cgspace-notes/content/posts/2018-08.md

---
title: "August, 2018"
date: 2018-08-01T11:52:54+03:00
author: "Alan Orth"
tags: ["Notes"]
---

## 2018-08-01

- DSpace Test had crashed at some point yesterday morning and I see the following in `dmesg`:

```
[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```

- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
- From the DSpace log I see that eventually Solr stopped responding, so I guess the `java` process that was OOM killed above was Tomcat's
- I'm not sure why Tomcat didn't crash with an OutOfMemoryError...
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it

<!--more-->

- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: [IITA July_30](https://dspacetest.cgiar.org/handle/10568/103250)
  - incorrect authorship types
  - dozens of inconsistencies, spelling mistakes, and white space in author affiliations
  - minor issues in countries (California is not a country)
  - minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects

## 2018-08-02

- DSpace Test crashed again and I don't see the only error I see is this in `dmesg`:

```
[Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
```

- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
- So basically we need a new test server with more RAM very soon...
- Abenet asked about the workflow statistics in the Atmire CUA module again
- Last year Atmire told me that it's disabled by default but you can enable it with `workflow.stats.enabled = true` in the CUA configuration file
- There was a bug with adding users so they sent a patch, but I didn't merge it because it was [very dirty](https://github.com/ilri/DSpace/pull/319) and I wasn't sure it actually fixed the problem
- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows "No data available"
- As a test I submitted a new item and I was able to see it in the workflow statistics "data" tab, but not in the graph

## 2018-08-15

- Run through Peter's list of author affiliations from earlier this month
- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
- Finally I did a test run with the [`fix-metadata-value.py`](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) script:

```
$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
```

## 2018-08-16

- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:

```
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
```

- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
- I might need to overhaul the [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
- After checking a few examples I see that checking only the `text_value` and `place` when adding ORCID fields is not enough anymore
- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission
- Now it is better to check if there is _any_ existing ORCID identifier for a given author for the item...
- I will have to update my script to extract the ORCID identifier and search for that
- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:

```
$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-08-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
```

## 2018-08-19

- Keep working on the CIAT ORCID identifiers from Elizabeth
- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
- This is less obvious and more error prone with names like "Peters" where there are many more authors
- I see some errors in the variations of names as well, for example:

```
Verchot, Louis
Verchot, L
Verchot, L. V.
Verchot, L.V
Verchot, L.V.
Verchot, LV
Verchot, Louis V.
```

- I'll just tag them all with Louis Verchot's ORCID identifier...
- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script:

```
dc.contributor.author,cg.creator.id
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
"Peters, Michael",Michael Peters: 0000-0003-4237-3916
"Peters, M.",Michael Peters: 0000-0003-4237-3916
"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
"Verchot, L",Louis Verchot: 0000-0001-8309-6754
"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
```

- The invocation would be:

```
$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
```

- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
- Looking at the list of author affialitions from Peter one last time
- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:

```
or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/))
)
```

- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
- I will run the following on DSpace Test and CGSpace:

```
$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
```

- Then force an update of the Discovery index on DSpace Test:

```
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    72m12.570s
user    6m45.305s
sys     2m2.461s
```

- And then on CGSpace:

```
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    79m44.392s
user    8m50.730s
sys     2m20.248s
```

- Run system updates on DSpace Test and reboot the server
- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:

```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
1553
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19 
1724
```

- I don't even know how its possible for the bot to use MORE sessions than total requests...
- The user agent is:

```
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
```

- So I'm thinking we should add "crawl" to the Tomcat Crawler Session Manager valve, as we already have "bot" that catches Googlebot, Bingbot, etc.

## 2018-08-20

- Help Sisay with some UTF-8 encoding issues in a file Peter sent him
- Finish up reconciling Atmire's pull request for DSpace 5.8 changes with the latest status of our `5_x-prod` branch
- I had to do some `git rev-list --reverse --no-merges oldestcommit..newestcommit` and `git cherry-pick -S` hackery to get everything all in order
- After building I ran the Atmire schema migrations and forced old migrations, then did the `ant update`
- I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set `JAVA_OPTS` to 1024m and tried the `mvn package` again
- Still the `mvn package` takes forever and essentially hangs on processing the xmlui-mirage2 overlay (though after building all the themes)
- I will try to reduce Tomcat memory from 4608m to 4096m and then retry the `mvn package` with 1024m of `JAVA_OPTS` again
- After running the `mvn package` for the third time and waiting an hour, I attached `strace` to the Java process and saw that it was indeed reading XMLUI theme data... so I guess I just need to wait more
- After waiting two hours the maven process completed and installation was successful
- I restarted Tomcat and it seems everything is working well, so I'll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
- I merged [Atmire's pull request](https://github.com/ilri/DSpace/pull/378) into our `5_x-dspace-5.8` temporary brach and then cherry-picked all the changes from `5_x-prod` since April, 2018 when that temporary branch was created
- As the branch histories are very different I cannot merge the new 5.8 branch into the current `5_x-prod` branch
- Instead, I will archive the current `5_x-prod` DSpace 5.5 branch as `5_x-prod-dspace-5.5` and then hard reset `5_x-prod` based on `5_x-dspace-5.8`
- Unfortunately this will mess up the references in pull requests and issues on GitHub

## 2018-08-21

- Something must have happened, as the `mvn package` *always* takes about two hours now, stopping for a very long time near the end at this step:

```
[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
```

- It's the same on DSpace Test, my local laptop, and CGSpace...
- It wasn't this way before when I was constantly building the previous 5.8 branch with Atmire patches...
- I will restore the previous `5_x-dspace-5.8` and `atmire-module-upgrades-5.8` branches to see if the build time is different there
- ... it seems that the `atmire-module-upgrades-5.8` branch still takes 1 hour and 23 minutes on my local machine...
- Let me try to build the old `5_x-prod-dspace-5.5` branch on my local machine and see how long it takes
- That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8
- I notice that the step this pauses at is:

```
[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
```

- And I notice that Atmire changed something in the XMLUI module's `pom.xml` as part of the DSpace 5.8 changes, specifically to remove the exclude for `node_modules` in the `maven-war-plugin` step
- This exclude is *present* in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
- It makes sense that it would take longer to complete this step because the `node_modules` folder has tens of thousands of files, and we have 27 themes!
- I need to test to see if this has any side effects when deployed...
- In other news, I see there was a pull request in DSpace 5.9 that fixes the issue with not being able to have blank lines in CSVs when importing via command line or webui ([DS-3245](https://jira.duraspace.org/browse/DS-3245))

## 2018-08-23

- Skype meeting with CKM people to meet new web dev guy Tariku
- They say they want to start working on the ContentDM harvester middleware again
- I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace
- Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
- It appears that the web UI's upload interface *requires* you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the `collections` file inside each item in the bundle
- I imported the CTA items on CGSpace for Sisay:

```
$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
```

## 2018-08-26

- Doing the DSpace 5.8 upgrade on CGSpace (linode18)
- I already finished the Maven build, now I'll take a backup of the PostgreSQL database and do a database cleanup just in case:

```
$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
$ dspace cleanup -v
```

- Now I can stop Tomcat and do the install:

```
$ cd dspace/target/dspace-installer
$ ant update clean_backups update_geolite
```

- After the successful Ant update I can run the database migrations:

```
$ psql dspace dspace

dspace=> \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql 
DELETE 0
UPDATE 1
DELETE 1
dspace=> \q

$ dspace database migrate ignored
```

- Then I'll run all system updates and reboot the server:

```
$ sudo su -
# apt update && apt full-upgrade
# apt clean && apt autoclean && apt autoremove
# reboot
```

- After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine
- Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now
- They want a CSV with *all* metadata, which the Atmire Listings and Reports module can't do
- I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject `GENDER` or `GENDER POVERTY AND INSTITUTIONS`, and CRP `Water, Land and Ecosystems`
- Then I extracted the Handle links from the report so I could export each item's metadata as CSV

```
$ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
```

- Then on the DSpace server I exported the metadata for each item one by one:

```
$ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
```

- But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
- I'm not sure how to proceed without writing some script to parse and join the CSVs, and I don't think it's worth my time

<!-- vim: set sw=2 ts=2: -->
-												Add notes for 2018-08-01

											
										
										
											2018-08-01 11:49:05 +02:00
+								---
 								title: "August, 2018"
 								date: 2018-08-01T11:52:54+03:00
 								author: "Alan Orth"
 								tags: ["Notes"]
 								---
 								## 2018-08-01
 								- DSpace Test had crashed at some point yesterday morning and I see the following in `dmesg`:
 								```
 								[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
 								[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
 								[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
 								```
 								- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
 								- From the DSpace log I see that eventually Solr stopped responding, so I guess the `java` process that was OOM killed above was Tomcat's
 								- I'm not sure why Tomcat didn't crash with an OutOfMemoryError...
 								- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
 								- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
 								- I ran all system updates on DSpace Test and rebooted it
 								<!--more-->
-												Update notes for 2018-08-01

											
										
										
											2018-08-01 16:24:49 +02:00
+								- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: [IITA July_30](https://dspacetest.cgiar.org/handle/10568/103250)
 								  - incorrect authorship types
 								  - dozens of inconsistencies, spelling mistakes, and white space in author affiliations
 								  - minor issues in countries (California is not a country)
 								  - minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects
-												Add notes for 2018-08-02

											
										
										
											2018-08-02 13:29:59 +02:00
+								## 2018-08-02
 								- DSpace Test crashed again and I don't see the only error I see is this in `dmesg`:
 								```
 								[Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
 								[Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
 								```
 								- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
 								- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
 								- So basically we need a new test server with more RAM very soon...
 								- Abenet asked about the workflow statistics in the Atmire CUA module again
 								- Last year Atmire told me that it's disabled by default but you can enable it with `workflow.stats.enabled = true` in the CUA configuration file
 								- There was a bug with adding users so they sent a patch, but I didn't merge it because it was [very dirty](https://github.com/ilri/DSpace/pull/319) and I wasn't sure it actually fixed the problem
 								- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows "No data available"
 								- As a test I submitted a new item and I was able to see it in the workflow statistics "data" tab, but not in the graph
-												Add notes for 2018-08-15

											
										
										
											2018-08-15 11:56:38 +02:00
+								## 2018-08-15
 								- Run through Peter's list of author affiliations from earlier this month
 								- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 14:40:38 +02:00
+								- Finally I did a test run with the [`fix-metadata-value.py`](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) script:
-												Add notes for 2018-08-15

											
										
										
											2018-08-15 11:56:38 +02:00
 								```
 								$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
 								$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
 								```
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 14:40:38 +02:00
+								## 2018-08-16
 								- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:
 								```
 								dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
 								```
 								- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
 								- I might need to overhaul the [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 17:59:45 +02:00
+								- After checking a few examples I see that checking only the `text_value` and `place` when adding ORCID fields is not enough anymore
 								- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission
 								- Now it is better to check if there is _any_ existing ORCID identifier for a given author for the item...
 								- I will have to update my script to extract the ORCID identifier and search for that
 								- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:
 								```
 								$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
 								$ createuser -h localhost -U postgres --pwprompt dspacetest
 								$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
 								$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
 								$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-08-16.backup
 								$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
 								$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
 								```
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 14:40:38 +02:00
-												Add notes for 2018-08-19

											
										
										
											2018-08-19 17:42:55 +02:00
+								## 2018-08-19
 								- Keep working on the CIAT ORCID identifiers from Elizabeth
 								- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
 								- This is less obvious and more error prone with names like "Peters" where there are many more authors
 								- I see some errors in the variations of names as well, for example:
 								```
 								Verchot, Louis
 								Verchot, L
 								Verchot, L. V.
 								Verchot, L.V
 								Verchot, L.V.
 								Verchot, LV
 								Verchot, Louis V.
 								```
 								- I'll just tag them all with Louis Verchot's ORCID identifier...
 								- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script:
 								```
 								dc.contributor.author,cg.creator.id
 								"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
 								"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
 								"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
 								"Peters, Michael",Michael Peters: 0000-0003-4237-3916
 								"Peters, M.",Michael Peters: 0000-0003-4237-3916
 								"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
 								"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
 								"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
 								"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
 								"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
 								"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
 								"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
 								"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
 								"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
 								"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
 								"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
 								"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
 								"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
 								"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
 								"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
 								"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
 								```
 								- The invocation would be:
 								```
 								$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
 								```
 								- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
 								- Looking at the list of author affialitions from Peter one last time
 								- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:
 								```
 								or(
 								  isNotNull(value.match(/.*\uFFFD.*/)),
 								  isNotNull(value.match(/.*\u00A0.*/)),
 								  isNotNull(value.match(/.*\u200A.*/)),
 								  isNotNull(value.match(/.*\u2019.*/)),
 								  isNotNull(value.match(/.*\u00b4.*/))
 								)
 								```
 								- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
 								- I will run the following on DSpace Test and CGSpace:
 								```
 								$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
 								$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
 								```
 								- Then force an update of the Discovery index on DSpace Test:
 								```
 								$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
 								$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
 								real    72m12.570s
 								user    6m45.305s
 								sys     2m2.461s
 								```
 								- And then on CGSpace:
 								```
 								$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
 								$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
 								real    79m44.392s
 								user    8m50.730s
 								sys     2m20.248s
 								```
 								- Run system updates on DSpace Test and reboot the server
 								- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:
 								```
 								# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
 
 								# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
 
 								```
 								- I don't even know how its possible for the bot to use MORE sessions than total requests...
 								- The user agent is:
 								```
 								Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
 								```
 								- So I'm thinking we should add "crawl" to the Tomcat Crawler Session Manager valve, as we already have "bot" that catches Googlebot, Bingbot, etc.
-												Add notes for 2018-08-20

											
										
										
											2018-08-20 12:49:43 +02:00
+								## 2018-08-20
 								- Help Sisay with some UTF-8 encoding issues in a file Peter sent him
 								- Finish up reconciling Atmire's pull request for DSpace 5.8 changes with the latest status of our `5_x-prod` branch
 								- I had to do some `git rev-list --reverse --no-merges oldestcommit..newestcommit` and `git cherry-pick -S` hackery to get everything all in order
 								- After building I ran the Atmire schema migrations and forced old migrations, then did the `ant update`
-												Update notes for 2018-08-20

											
										
										
											2018-08-20 16:52:38 +02:00
+								- I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set `JAVA_OPTS` to 1024m and tried the `mvn package` again
 								- Still the `mvn package` takes forever and essentially hangs on processing the xmlui-mirage2 overlay (though after building all the themes)
 								- I will try to reduce Tomcat memory from 4608m to 4096m and then retry the `mvn package` with 1024m of `JAVA_OPTS` again
 								- After running the `mvn package` for the third time and waiting an hour, I attached `strace` to the Java process and saw that it was indeed reading XMLUI theme data... so I guess I just need to wait more
 								- After waiting two hours the maven process completed and installation was successful
 								- I restarted Tomcat and it seems everything is working well, so I'll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
 								- I merged [Atmire's pull request](https://github.com/ilri/DSpace/pull/378) into our `5_x-dspace-5.8` temporary brach and then cherry-picked all the changes from `5_x-prod` since April, 2018 when that temporary branch was created
 								- As the branch histories are very different I cannot merge the new 5.8 branch into the current `5_x-prod` branch
 								- Instead, I will archive the current `5_x-prod` DSpace 5.5 branch as `5_x-prod-dspace-5.5` and then hard reset `5_x-prod` based on `5_x-dspace-5.8`
 								- Unfortunately this will mess up the references in pull requests and issues on GitHub
-												Add notes for 2018-08-20

											
										
										
											2018-08-20 12:49:43 +02:00
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 13:41:34 +02:00
+								## 2018-08-21
 								- Something must have happened, as the `mvn package` *always* takes about two hours now, stopping for a very long time near the end at this step:
 								```
 								[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
 								```
 								- It's the same on DSpace Test, my local laptop, and CGSpace...
 								- It wasn't this way before when I was constantly building the previous 5.8 branch with Atmire patches...
 								- I will restore the previous `5_x-dspace-5.8` and `atmire-module-upgrades-5.8` branches to see if the build time is different there
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 21:39:00 +02:00
+								- ... it seems that the `atmire-module-upgrades-5.8` branch still takes 1 hour and 23 minutes on my local machine...
 								- Let me try to build the old `5_x-prod-dspace-5.5` branch on my local machine and see how long it takes
 								- That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8
-												Update notes for 2018-08-21

											
										
										
											2018-08-22 00:42:03 +02:00
+								- I notice that the step this pauses at is:
 								```
 								[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
 								```
 								- And I notice that Atmire changed something in the XMLUI module's `pom.xml` as part of the DSpace 5.8 changes, specifically to remove the exclude for `node_modules` in the `maven-war-plugin` step
 								- This exclude is *present* in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
-												Update notes for 2018-08-21

											
										
										
											2018-08-22 00:44:07 +02:00
+								- It makes sense that it would take longer to complete this step because the `node_modules` folder has tens of thousands of files, and we have 27 themes!
-												Update notes for 2018-08-21

											
										
										
											2018-08-22 00:42:03 +02:00
+								- I need to test to see if this has any side effects when deployed...
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 23:38:23 +02:00
+								- In other news, I see there was a pull request in DSpace 5.9 that fixes the issue with not being able to have blank lines in CSVs when importing via command line or webui ([DS-3245](https://jira.duraspace.org/browse/DS-3245))
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 13:41:34 +02:00
-												Add notes for 2018-08-23

											
										
										
											2018-08-23 09:49:23 +02:00
+								## 2018-08-23
 								- Skype meeting with CKM people to meet new web dev guy Tariku
 								- They say they want to start working on the ContentDM harvester middleware again
 								- I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace
-												Update notes for 2018-08-23

											
										
										
											2018-08-23 15:34:16 +02:00
+								- Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
 								- It appears that the web UI's upload interface *requires* you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the `collections` file inside each item in the bundle
 								- I imported the CTA items on CGSpace for Sisay:
 								```
 								$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
 								```
-												Add notes for 2018-08-23

											
										
										
											2018-08-23 09:49:23 +02:00
-												Update notes for 2018-08-26

											
										
										
											2018-08-26 08:38:15 +02:00
+								## 2018-08-26
 								- Doing the DSpace 5.8 upgrade on CGSpace (linode18)
 								- I already finished the Maven build, now I'll take a backup of the PostgreSQL database and do a database cleanup just in case:
 								```
 								$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
 								$ dspace cleanup -v
 								```
 								- Now I can stop Tomcat and do the install:
 								```
 								$ cd dspace/target/dspace-installer
 								$ ant update clean_backups update_geolite
 								```
 								- After the successful Ant update I can run the database migrations:
 								```
 								$ psql dspace dspace
 								dspace=> \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql
 								DELETE 0
 								UPDATE 1
 								DELETE 1
 								dspace=> \q
 								$ dspace database migrate ignored
 								```
 								- Then I'll run all system updates and reboot the server:
 								```
 								$ sudo su -
 								# apt update && apt full-upgrade
 								# apt clean && apt autoclean && apt autoremove
 								# reboot
 								```
 								- After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine
-												Update notes for 2018-08-26

											
										
										
											2018-08-27 09:27:13 +02:00
+								- Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now
 								- They want a CSV with *all* metadata, which the Atmire Listings and Reports module can't do
 								- I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject `GENDER` or `GENDER POVERTY AND INSTITUTIONS`, and CRP `Water, Land and Ecosystems`
 								- Then I extracted the Handle links from the report so I could export each item's metadata as CSV
 								```
 								$ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
 								```
 								- Then on the DSpace server I exported the metadata for each item one by one:
 								```
 								$ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
 								```
 								- But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
 								- I'm not sure how to proceed without writing some script to parse and join the CSVs, and I don't think it's worth my time
-												Update notes for 2018-08-26

											
										
										
											2018-08-26 08:38:15 +02:00
-												Add notes for 2018-08-01

											
										
										
											2018-08-01 11:49:05 +02:00
+								<!-- vim: set sw=2 ts=2: -->