+++ date = "2016-04-04T11:06:00+03:00" author = "Alan Orth" title = "April, 2016" tags = ["notes"] image = "../images/bg.jpg" +++ ## 2016-04-04 - Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit - We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc - After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year! - This will save us a few gigs of backup space we're paying for on S3 - Also, I noticed the `checker` log has some errors we should pay attention to: ``` Run start time: 03/06/2016 04:00:22 Error retrieving bitstream ID 71274 from asset store. java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at edu.sdsc.grid.io.local.LocalFileInputStream.open(LocalFileInputStream.java:171) at edu.sdsc.grid.io.GeneralFileInputStream.(GeneralFileInputStream.java:145) at edu.sdsc.grid.io.local.LocalFileInputStream.(LocalFileInputStream.java:139) at edu.sdsc.grid.io.FileFactory.newFileInputStream(FileFactory.java:630) at org.dspace.storage.bitstore.BitstreamStorageManager.retrieve(BitstreamStorageManager.java:525) at org.dspace.checker.BitstreamDAO.getBitstream(BitstreamDAO.java:60) at org.dspace.checker.CheckerCommand.processBitstream(CheckerCommand.java:303) at org.dspace.checker.CheckerCommand.checkBitstream(CheckerCommand.java:171) at org.dspace.checker.CheckerCommand.process(CheckerCommand.java:120) at org.dspace.app.checker.ChecksumChecker.main(ChecksumChecker.java:236) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77) ****************************************************** ``` - So this would be the `tomcat7` Unix user, who seems to have a default limit of 1024 files in its shell - For what it's worth, we have been setting the actual Tomcat 7 process' limit to 16384 for a few years (in `/etc/default/tomcat7`) - Looks like cron will read limits from `/etc/security/limits.*` so we can do something for the tomcat7 user there - Submit pull request for Tomcat 7 limits in Ansible dspace role ([#30](https://github.com/ilri/rmg-ansible-public/pull/30)) ## 2016-04-05 - Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don't need! ``` # s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt # grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del # grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del ``` - Also, adjust the cron jobs for backups so they only backup `dspace.log` and some stats files (.dat) - Try to do some metadata field migrations using the Atmire batch UI (`dc.Species` → `cg.species`) but it took several hours and even missed a few records ## 2016-04-06 - A better way to move metadata on this scale is via SQL, for example `dc.type.output` → `dc.type` (their IDs in the metadatafieldregistry are 66 and 109, respectively): ``` dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66; UPDATE 40852 ``` - After that an `index-discovery -bf` is required - Start working on metadata migrations, add 25 or so new metadata fields to CGSpace ## 2016-04-07 - Write shell script to do the migration of fields: https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b - Testing with a few fields it seems to work well: ``` $ ./migrate-fields.sh UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66 UPDATE 40883 UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72 UPDATE 21420 UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76 UPDATE 51258 ``` ## 2016-04-08 - Discuss metadata renaming with Abenet, we decided it's better to start with the center-specific subjects like ILRI, CIFOR, CCAFS, IWMI, and CPWF - I've e-mailed CCAFS and CPWF people to ask them how much time it will take for them to update their systems to cope with this change ## 2016-04-10 - Looking at the DOI issue [reported by Leroy from CIAT a few weeks ago](https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860) - It seems the `dx.doi.org` URLs are much more proper in our repository! ``` dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%'; count ------- 5638 (1 row) dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%'; count ------- 3 ``` - I will manually edit the `dc.identifier.doi` in [10568/72509](https://cgspace.cgiar.org/handle/10568/72509?show=full) and tweet the link, then check back in a week to see if the donut gets updated ## 2016-04-11 - The donut is already updated and shows the correct number now - CCAFS people say it will only take them an hour to update their code for the metadata renames, so I proposed we'd do it tentatively on Monday the 18th. ## 2016-04-12 - Looking at quality of WLE data (`cg.subject.iwmi`) in SQL: ``` dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc; ``` - Listings and Reports is still not returning reliable data for `dc.type` - I think we need to ask Atmire, as their documentation isn't too clear on the format of the filter configs - Alternatively, I want to see if I move all the data from `dc.type.output` to `dc.type` and then re-index, if it behaves better - Looking at our `input-forms.xml` I see we have two sets of ILRI subjects, but one has a few extra subjects - Remove one set of ILRI subjects and remove duplicate `VALUE CHAINS` from existing list ([#216](https://github.com/ilri/DSpace/pull/216)) - I decided to keep the set of subjects that had `FMD` and `RANGELANDS` added, as it appears to have been requested to have been added, and might be the newer list - I found 226 blank metadatavalues: ``` dspacetest# select * from metadatavalue where resource_type_id=2 and text_value=''; ``` - I think we should delete them and do a full re-index: ``` dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value=''; DELETE 226 ``` - I deleted them on CGSpace but I'll wait to do the re-index as we're going to be doing one in a few days for the metadata changes anyways - In other news, moving the `dc.type.output` to `dc.type` and re-indexing seems to have fixed the Listings and Reports issue from above - Unfortunately this isn't a very good solution, because Listings and Reports config should allow us to filter on `dc.type.*` but the documentation isn't very clear and I couldn't reach Atmire today - We want to do the `dc.type.output` move on CGSpace anyways, but we should wait as it might affect other external people! ## 2016-04-14 - Communicate with Macaroni Bros again about `dc.type` - Help Sisay with some rsync and Linux stuff - Notify CIAT people of metadata changes (I had forgotten them last week) ## 2016-04-15 - DSpace Test had crashed, so I ran all system updates, rebooted, and re-deployed DSpace code ## 2016-04-18 - Talk to CIAT people about their portal again - Start looking more at the fields we want to delete - The following metadata fields have 0 items using them, so we can just remove them from the registry and any references in XMLUI, input forms, etc: - dc.description.abstractother - dc.whatwasknown - dc.whatisnew - dc.description.nationalpartners - dc.peerreviewprocess - cg.species.animal