CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

April, 2017

2017-04-02

  • Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
  • Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:

dc.rights in the submission form

  • Remove redundant/duplicate text in the DSpace submission license
  • Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt

2017-04-03

  • Continue testing the CMYK patch on more communities:
$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
  • So far there are almost 500:
$ grep -c profile /tmp/filter-media-cmyk.txt
484
  • Looking at the CG Core document again, I’ll send some feedback to Peter and Abenet:
    • We use cg.contributor.crp to indicate the CRP(s) affiliated with the item
    • DSpace has dc.date.available, but this field isn’t particularly meaningful other than as an automatic timestamp at the time of item accession (and is identical to dc.date.accessioned)
    • dc.relation exists in CGSpace, but isn’t used—rather dc.relation.ispartofseries, which is used ~5,000 times to Series name and number within that series
  • Also, I’m noticing some weird outliers in cg.coverage.region, need to remember to go correct these later:
dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;

2017-04-04

  • The filter-media script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:
$ grep -c profile /tmp/filter-media-cmyk.txt
1584
  • Trying to find a way to get the number of items submitted by a certain user in 2016
  • It’s not possible in the DSpace search / module interfaces, but might be able to be derived from dc.description.provenance, as that field contains the name and email of the submitter/approver, ie:
Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
No. of bitstreams: 1^M
ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
  • This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a “checksum” (ie, there was a bitstream in the submission):
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
  • Then this one does the same, but for fields that don’t contain checksums (ie, there was no bitstream in the submission):
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
  • For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.
  • It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled…
  • In that case it might just be better to see how many the user submitted (both with and without bitstreams):
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';

2017-04-05

  • After doing a few more large communities it seems this is the final count of CMYK PDFs:
$ grep -c profile /tmp/filter-media-cmyk.txt
2505

2017-04-06

  • After reading the notes for DCAT April 2017 I am testing some new settings for PostgreSQL on DSpace Test:
    • db.maxconnections 30→70 (the default PostgreSQL config allows 100 connections, so DSpace’s default of 30 is quite low)
    • db.maxwait 5000→10000
    • db.maxidle 8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)
  • I need to look at the Munin graphs after a few days to see if the load has changed
  • Run system updates on DSpace Test and reboot the server
  • Discussing harvesting CIFOR’s DSpace via OAI
  • Sisay added their OAI as a source to a new collection, but using the Simple Dublin Core method, so many fields are unqualified and duplicated
  • Looking at the documentation it seems that we probably want to be using DSpace Intermediate Metadata

2017-04-10

  • Adjust Linode CPU usage alerts on DSpace servers
    • CGSpace from 200 to 250%
    • DSpace Test from 100 to 150%
  • Remove James from Linode access
  • Look into having CIFOR use a sub prefix of 10568 like 10568.01
  • Handle.net calls this “derived prefixes” and it seems this would work with DSpace if we wanted to go that route
  • CIFOR is starting to test aligning their metadata more with CGSpace/CG core
  • They shared a test item which is using cg.coverage.country, cg.subject.cifor, dc.subject, and dc.date.issued
  • Looking at their OAI I’m not sure it has updated as I don’t see the new fields: https://data.cifor.org/dspace/oai/request?verb=ListRecords&resumptionToken=oai_dc///col_11463_6/900
  • Maybe they need to make sure they are running the OAI cache refresh cron job, or maybe OAI doesn’t export these?
  • I added cg.subject.cifor to the metadata registry and I’m waiting for the harvester to re-harvest to see if it picks up more data now
  • Another possiblity is that we could use a cross walk… but I’ve never done it.

2017-04-11

  • Looking at the item from CIFOR it hasn’t been updated yet, maybe they aren’t running the cron job
  • I emailed Usman from CIFOR to ask if he’s running the cron job

2017-04-12

stale metadata in OAI

$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
...
63900 items imported so far...
64000 items imported so far...
Total: 64056 items
Purging cached OAI responses.
OAI 2.0 manager action ended. It took 829 seconds.
  • After reading some threads on the DSpace mailing list, I see that clean-cache is actually only for caching responses, ie to client requests in the OAI web application
  • These are stored in [dspace]/var/oai/requests/
  • The import command should theoretically catch situations like this where an item’s metadata was updated, but in this case we changed the metadata schema and it doesn’t seem to catch it (could be a bug!)
  • Attempting a full rebuild of OAI on CGSpace:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
...
58700 items imported so far...
Total: 58789 items
Purging cached OAI responses.
OAI 2.0 manager action ended. It took 1032 seconds.

real    17m20.156s
user    4m35.293s
sys     1m29.310s

2017-04-13

  • Checking the CIFOR item on DSpace Test, it still doesn’t have the new metadata
  • The collection status shows this message from the harvester:

Last Harvest Result: OAI server did not contain any updates on 2017-04-13 02:19:47.964

  • I don’t know why there were no updates detected, so I will reset and reimport the collection
  • Usman has set up a custom crosswalk called dimcg that now shows CG and CIFOR metadata namespaces, but we can’t use it because DSpace can only harvest DIM by default (from the harvesting user interface)
  • Also worth noting that the REST interface exposes all fields in the item, including CG and CIFOR fields: https://data.cifor.org/dspace/rest/items/944?expand=metadata
  • After re-importing the CIFOR collection it looks very good!
  • It seems like they have done a full metadata migration with dc.date.issued and cg.coverage.country etc
  • Submit pull request to upstream DSpace for the PDF thumbnail bug (DS-3516): https://github.com/DSpace/DSpace/pull/1709

2017-04-14

2017-04-17

  • CIFOR has now implemented a new “cgiar” context in their OAI that exposes CG fields, so I am re-harvesting that to see how it looks in the Discovery sidebars and searches
  • See: https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&metadataPrefix=dim&identifier=oai:data.cifor.org:11463/947
  • One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see harvester.autoStart in dspace/config/modules/oai.cfg)
  • Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(435) is still referenced from table "bundle".

2017-04-18

$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
$ cd ckm-cgspace-rest-api/app
$ gem install bundler
$ bundle
$ cd ..
$ rails -s
  • I used Ansible to create a PostgreSQL user that only has SELECT privileges on the tables it needs:
$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present