cgspace-notes/content/posts/2018-08.md

---
title: "August, 2018"
date: 2018-08-01T11:52:54+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2018-08-01

- DSpace Test had crashed at some point yesterday morning and I see the following in `dmesg`:

```
[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```

- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
- From the DSpace log I see that eventually Solr stopped responding, so I guess the `java` process that was OOM killed above was Tomcat's
- I'm not sure why Tomcat didn't crash with an OutOfMemoryError...
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it

<!--more-->

- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: [IITA July_30](https://dspacetest.cgiar.org/handle/10568/103250)
  - incorrect authorship types
  - dozens of inconsistencies, spelling mistakes, and white space in author affiliations
  - minor issues in countries (California is not a country)
  - minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects

## 2018-08-02

- DSpace Test crashed again and I don't see the only error I see is this in `dmesg`:

```
[Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
```

- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
- So basically we need a new test server with more RAM very soon...
- Abenet asked about the workflow statistics in the Atmire CUA module again
- Last year Atmire told me that it's disabled by default but you can enable it with `workflow.stats.enabled = true` in the CUA configuration file
- There was a bug with adding users so they sent a patch, but I didn't merge it because it was [very dirty](https://github.com/ilri/DSpace/pull/319) and I wasn't sure it actually fixed the problem
- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows "No data available"
- As a test I submitted a new item and I was able to see it in the workflow statistics "data" tab, but not in the graph

## 2018-08-15

- Run through Peter's list of author affiliations from earlier this month
- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
- Finally I did a test run with the [`fix-metadata-value.py`](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) script:

```
$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
```

## 2018-08-16

- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:

```
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
```

- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
- I might need to overhaul the [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
- After checking a few examples I see that checking only the `text_value` and `place` when adding ORCID fields is not enough anymore
- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission
- Now it is better to check if there is _any_ existing ORCID identifier for a given author for the item...
- I will have to update my script to extract the ORCID identifier and search for that
- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:

```
$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest ~/Downloads/cgspace_2018-08-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
```

## 2018-08-19

- Keep working on the CIAT ORCID identifiers from Elizabeth
- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
- This is less obvious and more error prone with names like "Peters" where there are many more authors
- I see some errors in the variations of names as well, for example:

```
Verchot, Louis
Verchot, L
Verchot, L. V.
Verchot, L.V
Verchot, L.V.
Verchot, LV
Verchot, Louis V.
```

- I'll just tag them all with Louis Verchot's ORCID identifier...
- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script:

```
dc.contributor.author,cg.creator.id
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
"Peters, Michael",Michael Peters: 0000-0003-4237-3916
"Peters, M.",Michael Peters: 0000-0003-4237-3916
"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
"Verchot, L",Louis Verchot: 0000-0001-8309-6754
"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
```

- The invocation would be:

```
$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
```

- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
- Looking at the list of author affialitions from Peter one last time
- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:

```
or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/))
)
```

- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
- I will run the following on DSpace Test and CGSpace:

```
$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
```

- Then force an update of the Discovery index on DSpace Test:

```
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    72m12.570s
user    6m45.305s
sys     2m2.461s
```

- And then on CGSpace:

```
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    79m44.392s
user    8m50.730s
sys     2m20.248s
```

- Run system updates on DSpace Test and reboot the server
- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:

```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
1553
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
1724
```

- I don't even know how its possible for the bot to use MORE sessions than total requests...
- The user agent is:

```
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
```

- So I'm thinking we should add "crawl" to the Tomcat Crawler Session Manager valve, as we already have "bot" that catches Googlebot, Bingbot, etc.

## 2018-08-20

- Help Sisay with some UTF-8 encoding issues in a file Peter sent him
- Finish up reconciling Atmire's pull request for DSpace 5.8 changes with the latest status of our `5_x-prod` branch
- I had to do some `git rev-list --reverse --no-merges oldestcommit..newestcommit` and `git cherry-pick -S` hackery to get everything all in order
- After building I ran the Atmire schema migrations and forced old migrations, then did the `ant update`
- I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set `JAVA_OPTS` to 1024m and tried the `mvn package` again
- Still the `mvn package` takes forever and essentially hangs on processing the xmlui-mirage2 overlay (though after building all the themes)
- I will try to reduce Tomcat memory from 4608m to 4096m and then retry the `mvn package` with 1024m of `JAVA_OPTS` again
- After running the `mvn package` for the third time and waiting an hour, I attached `strace` to the Java process and saw that it was indeed reading XMLUI theme data... so I guess I just need to wait more
- After waiting two hours the maven process completed and installation was successful
- I restarted Tomcat and it seems everything is working well, so I'll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
- I merged [Atmire's pull request](https://github.com/ilri/DSpace/pull/378) into our `5_x-dspace-5.8` temporary brach and then cherry-picked all the changes from `5_x-prod` since April, 2018 when that temporary branch was created
- As the branch histories are very different I cannot merge the new 5.8 branch into the current `5_x-prod` branch
- Instead, I will archive the current `5_x-prod` DSpace 5.5 branch as `5_x-prod-dspace-5.5` and then hard reset `5_x-prod` based on `5_x-dspace-5.8`
- Unfortunately this will mess up the references in pull requests and issues on GitHub

## 2018-08-21

- Something must have happened, as the `mvn package` *always* takes about two hours now, stopping for a very long time near the end at this step:

```
[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
```

- It's the same on DSpace Test, my local laptop, and CGSpace...
- It wasn't this way before when I was constantly building the previous 5.8 branch with Atmire patches...
- I will restore the previous `5_x-dspace-5.8` and `atmire-module-upgrades-5.8` branches to see if the build time is different there
- ... it seems that the `atmire-module-upgrades-5.8` branch still takes 1 hour and 23 minutes on my local machine...
- Let me try to build the old `5_x-prod-dspace-5.5` branch on my local machine and see how long it takes
- That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8
- I notice that the step this pauses at is:

```
[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
```

- And I notice that Atmire changed something in the XMLUI module's `pom.xml` as part of the DSpace 5.8 changes, specifically to remove the exclude for `node_modules` in the `maven-war-plugin` step
- This exclude is *present* in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
- It makes sense that it would take longer to complete this step because the `node_modules` folder has tens of thousands of files, and we have 27 themes!
- I need to test to see if this has any side effects when deployed...
- In other news, I see there was a pull request in DSpace 5.9 that fixes the issue with not being able to have blank lines in CSVs when importing via command line or webui ([DS-3245](https://jira.duraspace.org/browse/DS-3245))

## 2018-08-23

- Skype meeting with CKM people to meet new web dev guy Tariku
- They say they want to start working on the ContentDM harvester middleware again
- I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace
- Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
- It appears that the web UI's upload interface *requires* you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the `collections` file inside each item in the bundle
- I imported the CTA items on CGSpace for Sisay:

```
$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
```

## 2018-08-26

- Doing the DSpace 5.8 upgrade on CGSpace (linode18)
- I already finished the Maven build, now I'll take a backup of the PostgreSQL database and do a database cleanup just in case:

```
$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
$ dspace cleanup -v
```

- Now I can stop Tomcat and do the install:

```
$ cd dspace/target/dspace-installer
$ ant update clean_backups update_geolite
```

- After the successful Ant update I can run the database migrations:

```
$ psql dspace dspace

dspace=> \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql 
DELETE 0
UPDATE 1
DELETE 1
dspace=> \q

$ dspace database migrate ignored
```

- Then I'll run all system updates and reboot the server:

```
$ sudo su -
# apt update && apt full-upgrade
# apt clean && apt autoclean && apt autoremove
# reboot
```

- After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine
- Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now
- They want a CSV with *all* metadata, which the Atmire Listings and Reports module can't do
- I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject `GENDER` or `GENDER POVERTY AND INSTITUTIONS`, and CRP `Water, Land and Ecosystems`
- Then I extracted the Handle links from the report so I could export each item's metadata as CSV

```
$ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
```

- Then on the DSpace server I exported the metadata for each item one by one:

```
$ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
```

- But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
- I'm not sure how to proceed without writing some script to parse and join the CSVs, and I don't think it's worth my time
- I tested DSpace 5.8 in Tomcat 8.5.32 and it seems to work now, so I'm not sure why I got those errors last time I tried
- It could have been a configuration issue, though, as I also reconciled the `server.xml` with the one in [our Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public)
- But now I can start testing and preparing to move DSpace Test to Ubuntu 18.04 + Tomcat 8.5 + OpenJDK + PostgreSQL 9.6...
- Actually, upon closer inspection, it seems that when you try to go to Listings and Reports under Tomcat 8.5.33 you are taken to the JSPUI login page despite having already logged in in XMLUI
- If I type my username and password again it *does* take me to Listings and Reports, though...
- I don't see anything interesting in the Catalina or DSpace logs, so I might have to file a bug with Atmire
- For what it's worth, the Content and Usage (CUA) module does load, though I can't seem to get any results in the graph
- I just checked to see if the Listings and Reports issue with using the CGSpace citation field was fixed as planned alongside the DSpace 5.8 upgrades ([#589](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589)
- I was able to create a new layout containing only the citation field, so I closed the ticket

## 2018-08-29

- Discuss [COPO](https://copo-project.org/copo/) with Martin Mueller
- He and the consortium's idea is to use this for metadata annotation (submission?) to all repositories
- It is somehow related to adding events as items in the repository, and then linking related papers, presentations, etc to the event item using `dc.relation`, etc.
- Discuss Linode server charges with Abenet, apparently we want to start charging these to Big Data

## 2018-08-30

- I fixed the graphical glitch in the cookieconsent popup (the dismiss bug is still there) by pinning the last known good version (3.0.6) in `bower.json` of each XMLUI theme
- I guess cookieconsent got updated without me realizing it and the previous expression `^3.0.6` make bower install version 3.1.0

<!-- vim: set sw=2 ts=2: -->
-												Add notes for 2018-08-01

											
										
										
											2018-08-01 11:49:05 +02:00
+								---
 								title: "August, 2018"
 								date: 2018-08-01T11:52:54+03:00
 								author: "Alan Orth"
-												Use "Notes" category instead of tag for old posts

											
										
										
											2019-10-28 12:39:25 +01:00
+								categories: ["Notes"]
-												Add notes for 2018-08-01

											
										
										
											2018-08-01 11:49:05 +02:00
+								---
 								## 2018-08-01
 								- DSpace Test had crashed at some point yesterday morning and I see the following in `dmesg`:
 								```
 								[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
 								[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
 								[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
 								```
 								- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
 								- From the DSpace log I see that eventually Solr stopped responding, so I guess the `java` process that was OOM killed above was Tomcat's
 								- I'm not sure why Tomcat didn't crash with an OutOfMemoryError...
 								- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
 								- The server only has 8GB of RAM so we'll eventually need to upgrade to a larger one because we'll start starving the OS, PostgreSQL, and command line batch processes
 								- I ran all system updates on DSpace Test and rebooted it
 								<!--more-->
-												Update notes for 2018-08-01

											
										
										
											2018-08-01 16:24:49 +02:00
+								- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: [IITA July_30](https://dspacetest.cgiar.org/handle/10568/103250)
 								  - incorrect authorship types
 								  - dozens of inconsistencies, spelling mistakes, and white space in author affiliations
 								  - minor issues in countries (California is not a country)
 								  - minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects
-												Add notes for 2018-08-02

											
										
										
											2018-08-02 13:29:59 +02:00
+								## 2018-08-02
 								- DSpace Test crashed again and I don't see the only error I see is this in `dmesg`:
 								```
 								[Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
 								[Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
 								```
 								- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
 								- The risk we run there is that we'll start getting OutOfMemory errors from Tomcat
 								- So basically we need a new test server with more RAM very soon...
 								- Abenet asked about the workflow statistics in the Atmire CUA module again
 								- Last year Atmire told me that it's disabled by default but you can enable it with `workflow.stats.enabled = true` in the CUA configuration file
 								- There was a bug with adding users so they sent a patch, but I didn't merge it because it was [very dirty](https://github.com/ilri/DSpace/pull/319) and I wasn't sure it actually fixed the problem
 								- I just tried to enable the stats again on DSpace Test now that we're on DSpace 5.8 with updated Atmire modules, but every user I search for shows "No data available"
 								- As a test I submitted a new item and I was able to see it in the workflow statistics "data" tab, but not in the graph
-												Add notes for 2018-08-15

											
										
										
											2018-08-15 11:56:38 +02:00
+								## 2018-08-15
 								- Run through Peter's list of author affiliations from earlier this month
 								- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 14:40:38 +02:00
+								- Finally I did a test run with the [`fix-metadata-value.py`](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) script:
-												Add notes for 2018-08-15

											
										
										
											2018-08-15 11:56:38 +02:00
 								```
 								$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
 								$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
 								```
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 14:40:38 +02:00
+								## 2018-08-16
 								- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:
 								```
 								dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
 								```
 								- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
 								- I might need to overhaul the [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 17:59:45 +02:00
+								- After checking a few examples I see that checking only the `text_value` and `place` when adding ORCID fields is not enough anymore
 								- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission
 								- Now it is better to check if there is _any_ existing ORCID identifier for a given author for the item...
 								- I will have to update my script to extract the ORCID identifier and search for that
 								- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:
 								```
 								$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
 								$ createuser -h localhost -U postgres --pwprompt dspacetest
 								$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
 								$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
-												Fix command syntax

											
										
										
											2018-09-02 10:01:40 +02:00
+								$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest ~/Downloads/cgspace_2018-08-16.backup
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 17:59:45 +02:00
+								$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
 								$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
 								```
-												Update notes for 2018-08-16

											
										
										
											2018-08-16 14:40:38 +02:00
-												Add notes for 2018-08-19

											
										
										
											2018-08-19 17:42:55 +02:00
+								## 2018-08-19
 								- Keep working on the CIAT ORCID identifiers from Elizabeth
 								- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
 								- This is less obvious and more error prone with names like "Peters" where there are many more authors
 								- I see some errors in the variations of names as well, for example:
 								```
 								Verchot, Louis
 								Verchot, L
 								Verchot, L. V.
 								Verchot, L.V
 								Verchot, L.V.
 								Verchot, LV
 								Verchot, Louis V.
 								```
 								- I'll just tag them all with Louis Verchot's ORCID identifier...
 								- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script:
 								```
 								dc.contributor.author,cg.creator.id
 								"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
 								"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
 								"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
 								"Peters, Michael",Michael Peters: 0000-0003-4237-3916
 								"Peters, M.",Michael Peters: 0000-0003-4237-3916
 								"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
 								"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
 								"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
 								"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
 								"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
 								"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
 								"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
 								"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
 								"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
 								"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
 								"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
 								"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
 								"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
 								"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
 								"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
 								"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
 								"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
 								```
 								- The invocation would be:
 								```
 								$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
 								```
 								- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
 								- Looking at the list of author affialitions from Peter one last time
 								- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:
 								```
 								or(
 								  isNotNull(value.match(/.*\uFFFD.*/)),
 								  isNotNull(value.match(/.*\u00A0.*/)),
 								  isNotNull(value.match(/.*\u200A.*/)),
 								  isNotNull(value.match(/.*\u2019.*/)),
 								  isNotNull(value.match(/.*\u00b4.*/))
 								)
 								```
 								- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
 								- I will run the following on DSpace Test and CGSpace:
 								```
 								$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
 								$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
 								```
 								- Then force an update of the Discovery index on DSpace Test:
 								```
 								$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
 								$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
 								real    72m12.570s
 								user    6m45.305s
 								sys     2m2.461s
 								```
 								- And then on CGSpace:
 								```
 								$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
 								$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
 								real    79m44.392s
 								user    8m50.730s
 								sys     2m20.248s
 								```
 								- Run system updates on DSpace Test and reboot the server
 								- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:
 								```
 								# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
 
-												Remove trailing space

											
										
										
											2018-09-10 22:35:46 +02:00
+								# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
-												Add notes for 2018-08-19

											
										
										
											2018-08-19 17:42:55 +02:00
 								```
 								- I don't even know how its possible for the bot to use MORE sessions than total requests...
 								- The user agent is:
 								```
 								Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
 								```
 								- So I'm thinking we should add "crawl" to the Tomcat Crawler Session Manager valve, as we already have "bot" that catches Googlebot, Bingbot, etc.
-												Add notes for 2018-08-20

											
										
										
											2018-08-20 12:49:43 +02:00
+								## 2018-08-20
 								- Help Sisay with some UTF-8 encoding issues in a file Peter sent him
 								- Finish up reconciling Atmire's pull request for DSpace 5.8 changes with the latest status of our `5_x-prod` branch
 								- I had to do some `git rev-list --reverse --no-merges oldestcommit..newestcommit` and `git cherry-pick -S` hackery to get everything all in order
 								- After building I ran the Atmire schema migrations and forced old migrations, then did the `ant update`
-												Update notes for 2018-08-20

											
										
										
											2018-08-20 16:52:38 +02:00
+								- I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set `JAVA_OPTS` to 1024m and tried the `mvn package` again
 								- Still the `mvn package` takes forever and essentially hangs on processing the xmlui-mirage2 overlay (though after building all the themes)
 								- I will try to reduce Tomcat memory from 4608m to 4096m and then retry the `mvn package` with 1024m of `JAVA_OPTS` again
 								- After running the `mvn package` for the third time and waiting an hour, I attached `strace` to the Java process and saw that it was indeed reading XMLUI theme data... so I guess I just need to wait more
 								- After waiting two hours the maven process completed and installation was successful
 								- I restarted Tomcat and it seems everything is working well, so I'll merge the pull request and try to schedule the CGSpace upgrade for this coming Sunday, August 26th
 								- I merged [Atmire's pull request](https://github.com/ilri/DSpace/pull/378) into our `5_x-dspace-5.8` temporary brach and then cherry-picked all the changes from `5_x-prod` since April, 2018 when that temporary branch was created
 								- As the branch histories are very different I cannot merge the new 5.8 branch into the current `5_x-prod` branch
 								- Instead, I will archive the current `5_x-prod` DSpace 5.5 branch as `5_x-prod-dspace-5.5` and then hard reset `5_x-prod` based on `5_x-dspace-5.8`
 								- Unfortunately this will mess up the references in pull requests and issues on GitHub
-												Add notes for 2018-08-20

											
										
										
											2018-08-20 12:49:43 +02:00
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 13:41:34 +02:00
+								## 2018-08-21
 								- Something must have happened, as the `mvn package` *always* takes about two hours now, stopping for a very long time near the end at this step:
 								```
 								[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
 								```
 								- It's the same on DSpace Test, my local laptop, and CGSpace...
 								- It wasn't this way before when I was constantly building the previous 5.8 branch with Atmire patches...
 								- I will restore the previous `5_x-dspace-5.8` and `atmire-module-upgrades-5.8` branches to see if the build time is different there
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 21:39:00 +02:00
+								- ... it seems that the `atmire-module-upgrades-5.8` branch still takes 1 hour and 23 minutes on my local machine...
 								- Let me try to build the old `5_x-prod-dspace-5.5` branch on my local machine and see how long it takes
 								- That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8
-												Update notes for 2018-08-21

											
										
										
											2018-08-22 00:42:03 +02:00
+								- I notice that the step this pauses at is:
 								```
 								[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
 								```
 								- And I notice that Atmire changed something in the XMLUI module's `pom.xml` as part of the DSpace 5.8 changes, specifically to remove the exclude for `node_modules` in the `maven-war-plugin` step
 								- This exclude is *present* in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
-												Update notes for 2018-08-21

											
										
										
											2018-08-22 00:44:07 +02:00
+								- It makes sense that it would take longer to complete this step because the `node_modules` folder has tens of thousands of files, and we have 27 themes!
-												Update notes for 2018-08-21

											
										
										
											2018-08-22 00:42:03 +02:00
+								- I need to test to see if this has any side effects when deployed...
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 23:38:23 +02:00
+								- In other news, I see there was a pull request in DSpace 5.9 that fixes the issue with not being able to have blank lines in CSVs when importing via command line or webui ([DS-3245](https://jira.duraspace.org/browse/DS-3245))
-												Update notes for 2018-08-21

											
										
										
											2018-08-21 13:41:34 +02:00
-												Add notes for 2018-08-23

											
										
										
											2018-08-23 09:49:23 +02:00
+								## 2018-08-23
 								- Skype meeting with CKM people to meet new web dev guy Tariku
 								- They say they want to start working on the ContentDM harvester middleware again
 								- I sent a list of the top 1500 author affiliations on CGSpace to CodeObia so we can compare ours with the ones on MELSpace
-												Update notes for 2018-08-23

											
										
										
											2018-08-23 15:34:16 +02:00
+								- Discuss CTA items with Sisay, he was trying to figure out how to do the collection mapping in combination with SAFBuilder
 								- It appears that the web UI's upload interface *requires* you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the `collections` file inside each item in the bundle
 								- I imported the CTA items on CGSpace for Sisay:
 								```
 								$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
 								```
-												Add notes for 2018-08-23

											
										
										
											2018-08-23 09:49:23 +02:00
-												Update notes for 2018-08-26

											
										
										
											2018-08-26 08:38:15 +02:00
+								## 2018-08-26
 								- Doing the DSpace 5.8 upgrade on CGSpace (linode18)
 								- I already finished the Maven build, now I'll take a backup of the PostgreSQL database and do a database cleanup just in case:
 								```
 								$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
 								$ dspace cleanup -v
 								```
 								- Now I can stop Tomcat and do the install:
 								```
 								$ cd dspace/target/dspace-installer
 								$ ant update clean_backups update_geolite
 								```
 								- After the successful Ant update I can run the database migrations:
 								```
 								$ psql dspace dspace
 								dspace=> \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql
 								DELETE 0
 								UPDATE 1
 								DELETE 1
 								dspace=> \q
 								$ dspace database migrate ignored
 								```
 								- Then I'll run all system updates and reboot the server:
 								```
 								$ sudo su -
 								# apt update && apt full-upgrade
 								# apt clean && apt autoclean && apt autoremove
 								# reboot
 								```
 								- After reboot I logged in and cleared all the XMLUI caches and everything looked to be working fine
-												Update notes for 2018-08-26

											
										
										
											2018-08-27 09:27:13 +02:00
+								- Adam from WLE had asked a few weeks ago about getting the metadata for a bunch of items related to gender from 2013 until now
 								- They want a CSV with *all* metadata, which the Atmire Listings and Reports module can't do
 								- I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject `GENDER` or `GENDER POVERTY AND INSTITUTIONS`, and CRP `Water, Land and Ecosystems`
 								- Then I extracted the Handle links from the report so I could export each item's metadata as CSV
 								```
 								$ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
 								```
 								- Then on the DSpace server I exported the metadata for each item one by one:
 								```
 								$ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
 								```
 								- But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
 								- I'm not sure how to proceed without writing some script to parse and join the CSVs, and I don't think it's worth my time
-												Update notes for 2018-08-27

											
										
										
											2018-08-27 16:38:14 +02:00
+								- I tested DSpace 5.8 in Tomcat 8.5.32 and it seems to work now, so I'm not sure why I got those errors last time I tried
 								- It could have been a configuration issue, though, as I also reconciled the `server.xml` with the one in [our Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public)
 								- But now I can start testing and preparing to move DSpace Test to Ubuntu 18.04 + Tomcat 8.5 + OpenJDK + PostgreSQL 9.6...
 								- Actually, upon closer inspection, it seems that when you try to go to Listings and Reports under Tomcat 8.5.33 you are taken to the JSPUI login page despite having already logged in in XMLUI
 								- If I type my username and password again it *does* take me to Listings and Reports, though...
 								- I don't see anything interesting in the Catalina or DSpace logs, so I might have to file a bug with Atmire
 								- For what it's worth, the Content and Usage (CUA) module does load, though I can't seem to get any results in the graph
 								- I just checked to see if the Listings and Reports issue with using the CGSpace citation field was fixed as planned alongside the DSpace 5.8 upgrades ([#589](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589)
 								- I was able to create a new layout containing only the citation field, so I closed the ticket
-												Update notes for 2018-08-26

											
										
										
											2018-08-26 08:38:15 +02:00
-												Add notes for 2018-08-29

											
										
										
											2018-08-29 17:00:56 +02:00
+								## 2018-08-29
 								- Discuss [COPO](https://copo-project.org/copo/) with Martin Mueller
 								- He and the consortium's idea is to use this for metadata annotation (submission?) to all repositories
 								- It is somehow related to adding events as items in the repository, and then linking related papers, presentations, etc to the event item using `dc.relation`, etc.
 								- Discuss Linode server charges with Abenet, apparently we want to start charging these to Big Data
-												Add notes for 2018-08-30

											
										
										
											2018-08-30 15:29:19 +02:00
+								## 2018-08-30
 								- I fixed the graphical glitch in the cookieconsent popup (the dismiss bug is still there) by pinning the last known good version (3.0.6) in `bower.json` of each XMLUI theme
 								- I guess cookieconsent got updated without me realizing it and the previous expression `^3.0.6` make bower install version 3.1.0
-												Add notes for 2018-08-01

											
										
										
											2018-08-01 11:49:05 +02:00
+								<!-- vim: set sw=2 ts=2: -->