2018-08-01
- DSpace Test had crashed at some point yesterday morning and I see the following in
dmesg
:
[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
- Judging from the time of the crash it was probably related to the Discovery indexing that starts at midnight
- From the DSpace log I see that eventually Solr stopped responding, so I guess the
java
process that was OOM killed above was Tomcat’s
- I’m not sure why Tomcat didn’t crash with an OutOfMemoryError…
- Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did a few months ago when we tried to run the whole CGSpace Solr core
- The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
- I ran all system updates on DSpace Test and rebooted it
- I started looking over the latest round of IITA batch records from Sisay on DSpace Test: IITA July_30
- incorrect authorship types
- dozens of inconsistencies, spelling mistakes, and white space in author affiliations
- minor issues in countries (California is not a country)
- minor issues in IITA subjects, ISBNs, languages, and AGROVOC subjects
2018-08-02
- DSpace Test crashed again and I don’t see the only error I see is this in
dmesg
:
[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug 2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
- I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
- The risk we run there is that we’ll start getting OutOfMemory errors from Tomcat
- So basically we need a new test server with more RAM very soon…
- Abenet asked about the workflow statistics in the Atmire CUA module again
- Last year Atmire told me that it’s disabled by default but you can enable it with
workflow.stats.enabled = true
in the CUA configuration file
- There was a bug with adding users so they sent a patch, but I didn’t merge it because it was very dirty and I wasn’t sure it actually fixed the problem
- I just tried to enable the stats again on DSpace Test now that we’re on DSpace 5.8 with updated Atmire modules, but every user I search for shows “No data available”
- As a test I submitted a new item and I was able to see it in the workflow statistics “data” tab, but not in the graph
2018-08-15
- Run through Peter’s list of author affiliations from earlier this month
- I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
- Finally I did a test run with the
fix-metadata-value.py
script:
$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
2018-08-16
- Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
- Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
- I might need to overhaul the add-orcid-identifiers-csv.py script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
- After checking a few examples I see that checking only the
text_value
and place
when adding ORCID fields is not enough anymore
- It was a sane assumption when I was initially migrating ORCID records from Solr to regular metadata, but now it seems that some authors might have been added or changed after item submission
- Now it is better to check if there is any existing ORCID identifier for a given author for the item…
- I will have to update my script to extract the ORCID identifier and search for that
- Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:
$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-08-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
2018-08-19
- Keep working on the CIAT ORCID identifiers from Elizabeth
- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie “Schultze-Kraft, Rainer” and “Schultze-Kraft, R.”) I will just tag them with ORCID identifiers too
- This is less obvious and more error prone with names like “Peters” where there are many more authors
- I see some errors in the variations of names as well, for example:
Verchot, Louis
Verchot, L
Verchot, L. V.
Verchot, L.V
Verchot, L.V.
Verchot, LV
Verchot, Louis V.
- I’ll just tag them all with Louis Verchot’s ORCID identifier…
- In the end, I’ll run the following CSV with my add-orcid-identifiers-csv.py script:
dc.contributor.author,cg.creator.id
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
"Peters, Michael",Michael Peters: 0000-0003-4237-3916
"Peters, M.",Michael Peters: 0000-0003-4237-3916
"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
"Verchot, L",Louis Verchot: 0000-0001-8309-6754
"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
- Looking at the list of author affialitions from Peter one last time
- I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:
or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
isNotNull(value.match(/.*\u2019.*/)),
isNotNull(value.match(/.*\u00b4.*/))
)
- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
- I will run the following on DSpace Test and CGSpace:
$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
- Then force an update of the Discovery index on DSpace Test:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 72m12.570s
user 6m45.305s
sys 2m2.461s
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 79m44.392s
user 8m50.730s
sys 2m20.248s
- Run system updates on DSpace Test and reboot the server
- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
1553
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
1724
- I don’t even know how its possible for the bot to use MORE sessions than total requests…
- The user agent is:
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
- So I’m thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.
2018-08-20
- Help Sisay with some UTF-8 encoding issues in a file Peter sent him
- Finish up reconciling Atmire’s pull request for DSpace 5.8 changes with the latest status of our
5_x-prod
branch
- I had to do some
git rev-list --reverse --no-merges oldestcommit..newestcommit
and git cherry-pick -S
hackery to get everything all in order
- After building I ran the Atmire schema migrations and forced old migrations, then did the
ant update
- I tried to build it on DSpace Test, but it seems to still need more RAM to complete (like I experienced last month), so I stopped Tomcat and set
JAVA_OPTS
to 1024m and tried the mvn package again