We don't need to print the Handle because some items can be in the
workflow still so this will be null, but also because DSpace will
already show the Handle in the log before printing the result.
Results are a single-line status that shows the result of the task,
but reports are like a running log of changes to the item and have
more complicated use cases and configuration requirements.
For now I will disable reports since I'm not using them.
I was wondering why the same bitstreams appeared to be getting de-
leted on every single run. It turns out that the only mode we were
committing the context in was in single item mode. If the argument
was a site, community, or collection we were updating the item but
not actually committing the changes!
Instead of checking whether they exist and then skipping them just
at the moment when we want to swap the bitstreams let's bail early
when we know an item is an Infographic or a Map.
This adds another script to detect and remove more low-quality thu-
mbnails. For example:
- If an item has an "IM Thumbnail" and a "Generated Thumbnail" in the
THUMBNAIL bundle, remove the "Generated Thumbnail"
- If an item has a PDF bitstream and a JPEG bitstream with a name or
description "thumbnail" in the ORIGINAL bundle, remove the
"thumbnail" bitstream in the ORIGINAL bundle and try to remove the
"thumbnail.jpg" bitstream in the THUMBNAIL bundle
The idea is that we should *always* prefer thumbnails generated by
ImageMagick from PDFs in the ORIGINAL bundle and should remove any
other manually uploaded thumbnails.
It's much easier to get your package verified on Central if it uses
a GitHub groupId. Otherwise you need to use DNS verification! This
changes the groupId:
- from: org.cgiar.cgspace.ctask
- to: io.github.ilri.cgspace
Also the package changed as well.
See: https://central.sonatype.org/pages/producers.html
We can append the codes we will add to a List of Strings and then
actually apply them later in one addMetadata call, and update the
item with one item.update() call. This reduces identical code and
is more efficient.
Note that when testing this on a collection with thousands of items
I realized that it is really important to limit both the cache size
as well as set the database transaction model to be per object/item
or else you will crash due to Java heap issues. For example:
$ ~/dspace/bin/dspace curate -t countrycodetagger -i 10568/3 -r - -l 500 -s object
See: https://wiki.lyrasis.org/display/DSPACE/Curation+Task+Cookbook
Originally I wasn't sure if I was going to try to parse each code,
check them against the mapping, and possibly correct them, but it's
easier to just skip items with codes unless we're in "force" mode.
The DSpace curation system has task properties that can be used to
create "profiles" of sorts. For example, if you set a custom task
name in curate.cfg:
plugin.named.org.dspace.curate.CurationTask = \
org.cgiar.cgspace.ctasks.CountryCodeTagger = countrycodetagger \
org.cgiar.cgspace.ctasks.CountryCodeTagger = countrycodetagger.force
... then DSpace will look for countrycodetagger.cfg by default, and
countrycodetagger.force.cfg for the second task. We can set different
properties in each one, for example "force=true", and then operate
accordingly in the task when we check the value using taskProperty().
I will use this to force all country tags to be cleared and updated,
where by default we only tag if there are no existing country tags.
See: https://wiki.lyrasis.org/display/DSDOC5x/Curation+System