April, 2016

Posted on Apr 4, 2016

#notes

2016-04-04

Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!
This will save us a few gigs of backup space we’re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:

Run start time: 03/06/2016 04:00:22
Error retrieving bitstream ID 71274 from asset store.
java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:146)
        at edu.sdsc.grid.io.local.LocalFileInputStream.open(LocalFileInputStream.java:171)
        at edu.sdsc.grid.io.GeneralFileInputStream.<init>(GeneralFileInputStream.java:145)
        at edu.sdsc.grid.io.local.LocalFileInputStream.<init>(LocalFileInputStream.java:139)
        at edu.sdsc.grid.io.FileFactory.newFileInputStream(FileFactory.java:630)
        at org.dspace.storage.bitstore.BitstreamStorageManager.retrieve(BitstreamStorageManager.java:525)
        at org.dspace.checker.BitstreamDAO.getBitstream(BitstreamDAO.java:60)
        at org.dspace.checker.CheckerCommand.processBitstream(CheckerCommand.java:303)
        at org.dspace.checker.CheckerCommand.checkBitstream(CheckerCommand.java:171)
        at org.dspace.checker.CheckerCommand.process(CheckerCommand.java:120)
        at org.dspace.app.checker.ChecksumChecker.main(ChecksumChecker.java:236)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77)
******************************************************

So this would be the tomcat7 Unix user, who seems to have a default limit of 1024 files in its shell
For what it’s worth, we have been setting the actual Tomcat 7 process’ limit to 16384 for a few years (in /etc/default/tomcat7)
Looks like cron will read limits from /etc/security/limits.* so we can do something for the tomcat7 user there
Submit pull request for Tomcat 7 limits in Ansible dspace role (#30)

2016-04-05

Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don’t need!

# s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del

Also, adjust the cron jobs for backups so they only backup dspace.log and some stats files (.dat)
Try to do some metadata field migrations using the Atmire batch UI (dc.Species → cg.species) but it took several hours and even missed a few records

2016-04-06

A better way to move metadata on this scale is via SQL, for example dc.type.output → dc.type (their IDs in the metadatafieldregistry are 66 and 109, respectively):

dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
UPDATE 40852

After that an index-discovery -bf is required
Start working on metadata migrations, add 25 or so new metadata fields to CGSpace

2016-04-07

Write shell script to do the migration of fields: https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b
Testing with a few fields it seems to work well:

$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40883
UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
UPDATE 21420
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
UPDATE 51258

2016-04-08

Discuss metadata renaming with Abenet, we decided it’s better to start with the center-specific subjects like ILRI, CIFOR, CCAFS, IWMI, and CPWF
I’ve e-mailed CCAFS and CPWF people to ask them how much time it will take for them to update their systems to cope with this change

2016-04-10

Looking at the DOI issue reported by Leroy from CIAT a few weeks ago
It seems the dx.doi.org URLs are much more proper in our repository!

dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
 count
-------
  5638
(1 row)

dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
 count
-------
     3

I will manually edit the dc.identifier.doi in ¹⁰⁵⁶⁸⁄₇₂₅₀₉ and tweet the link, then check back in a week to see if the donut gets updated

2016-04-11

The donut is already updated and shows the correct number now
CCAFS people say it will only take them an hour to update their code for the metadata renames, so I proposed we’d do it tentatively on Monday the 18th.

2016-04-12

Looking at quality of WLE data (cg.subject.iwmi) in SQL:

dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;

Listings and Reports is still not returning reliable data for dc.type
I think we need to ask Atmire, as their documentation isn’t too clear on the format of the filter configs
Alternatively, I want to see if I move all the data from dc.type.output to dc.type and then re-index, if it behaves better