cgspace-notes/content/2016-04.md

7.5 KiB

+++ date = "2016-04-04T11:06:00+03:00" author = "Alan Orth" title = "April, 2016" tags = ["notes"] image = "../images/bg.jpg"

+++

2016-04-04

  • Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
  • We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
  • After running DSpace for over five years I've never needed to look in any other log file than dspace.log, leave alone one from last year!
  • This will save us a few gigs of backup space we're paying for on S3
  • Also, I noticed the checker log has some errors we should pay attention to:
Run start time: 03/06/2016 04:00:22
Error retrieving bitstream ID 71274 from asset store.
java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:146)
        at edu.sdsc.grid.io.local.LocalFileInputStream.open(LocalFileInputStream.java:171)
        at edu.sdsc.grid.io.GeneralFileInputStream.<init>(GeneralFileInputStream.java:145)
        at edu.sdsc.grid.io.local.LocalFileInputStream.<init>(LocalFileInputStream.java:139)
        at edu.sdsc.grid.io.FileFactory.newFileInputStream(FileFactory.java:630)
        at org.dspace.storage.bitstore.BitstreamStorageManager.retrieve(BitstreamStorageManager.java:525)
        at org.dspace.checker.BitstreamDAO.getBitstream(BitstreamDAO.java:60)
        at org.dspace.checker.CheckerCommand.processBitstream(CheckerCommand.java:303)
        at org.dspace.checker.CheckerCommand.checkBitstream(CheckerCommand.java:171)
        at org.dspace.checker.CheckerCommand.process(CheckerCommand.java:120)
        at org.dspace.app.checker.ChecksumChecker.main(ChecksumChecker.java:236)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77)
******************************************************
  • So this would be the tomcat7 Unix user, who seems to have a default limit of 1024 files in its shell
  • For what it's worth, we have been setting the actual Tomcat 7 process' limit to 16384 for a few years (in /etc/default/tomcat7)
  • Looks like cron will read limits from /etc/security/limits.* so we can do something for the tomcat7 user there
  • Submit pull request for Tomcat 7 limits in Ansible dspace role (#30)

2016-04-05

  • Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don't need!
# s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
  • Also, adjust the cron jobs for backups so they only backup dspace.log and some stats files (.dat)
  • Try to do some metadata field migrations using the Atmire batch UI (dc.Species → cg.species) but it took several hours and even missed a few records

2016-04-06

  • A better way to move metadata on this scale is via SQL, for example dc.type.output → dc.type (their IDs in the metadatafieldregistry are 66 and 109, respectively):
dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
UPDATE 40852
  • After that an index-discovery -bf is required
  • Start working on metadata migrations, add 25 or so new metadata fields to CGSpace

2016-04-07

$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40883
UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
UPDATE 21420
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
UPDATE 51258

2016-04-08

  • Discuss metadata renaming with Abenet, we decided it's better to start with the center-specific subjects like ILRI, CIFOR, CCAFS, IWMI, and CPWF
  • I've e-mailed CCAFS and CPWF people to ask them how much time it will take for them to update their systems to cope with this change

2016-04-10

dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
 count
-------
  5638
(1 row)

dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
 count
-------
     3
  • I will manually edit the dc.identifier.doi in 10568/72509 and tweet the link, then check back in a week to see if the donut gets updated

2016-04-11

  • The donut is already updated and shows the correct number now
  • CCAFS people say it will only take them an hour to update their code for the metadata renames, so I proposed we'd do it tentatively on Monday the 18th.

2016-04-12

  • Looking at quality of WLE data (cg.subject.iwmi) in SQL:
dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
  • Listings and Reports is still not returning reliable data for dc.type
  • I think we need to ask Atmire, as their documentation isn't too clear on the format of the filter configs
  • Alternatively, I want to see if I move all the data from dc.type.output to dc.type and then re-index, if it behaves better
  • Looking at our input-forms.xml I see we have two sets of ILRI subjects, but one has a few extra subjects
  • Remove one set of ILRI subjects and remove duplicate VALUE CHAINS from existing list (#216)
  • I decided to keep the set of subjects that had FMD and RANGELANDS added, as it appears to have been requested to have been added, and might be the newer list
  • I found 226 blank metadatavalues:
dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
  • I think we should delete them and do a full re-index:
dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 226
  • I deleted them on CGSpace but I'll wait to do the re-index as we're going to be doing one in a few days for the metadata changes anyways
  • In other news, moving the dc.type.output to dc.type and re-indexing seems to have fixed the Listings and Reports issue from above
  • Unfortunately this isn't a very good solution, because Listings and Reports config should allow us to filter on dc.type.* but the documentation isn't very clear and I couldn't reach Atmire today
  • We want to do the dc.type.output move on CGSpace anyways, but we should wait as it might affect other external people!