Instead of checking whether they exist and then skipping them just
at the moment when we want to swap the bitstreams let's bail early
when we know an item is an Infographic or a Map.
This adds another script to detect and remove more low-quality thu-
mbnails. For example:
- If an item has an "IM Thumbnail" and a "Generated Thumbnail" in the
THUMBNAIL bundle, remove the "Generated Thumbnail"
- If an item has a PDF bitstream and a JPEG bitstream with a name or
description "thumbnail" in the ORIGINAL bundle, remove the
"thumbnail" bitstream in the ORIGINAL bundle and try to remove the
"thumbnail.jpg" bitstream in the THUMBNAIL bundle
The idea is that we should *always* prefer thumbnails generated by
ImageMagick from PDFs in the ORIGINAL bundle and should remove any
other manually uploaded thumbnails.
It's much easier to get your package verified on Central if it uses
a GitHub groupId. Otherwise you need to use DNS verification! This
changes the groupId:
- from: org.cgiar.cgspace.ctask
- to: io.github.ilri.cgspace
Also the package changed as well.
See: https://central.sonatype.org/pages/producers.html
We can append the codes we will add to a List of Strings and then
actually apply them later in one addMetadata call, and update the
item with one item.update() call. This reduces identical code and
is more efficient.
Note that when testing this on a collection with thousands of items
I realized that it is really important to limit both the cache size
as well as set the database transaction model to be per object/item
or else you will crash due to Java heap issues. For example:
$ ~/dspace/bin/dspace curate -t countrycodetagger -i 10568/3 -r - -l 500 -s object
See: https://wiki.lyrasis.org/display/DSPACE/Curation+Task+Cookbook
Originally I wasn't sure if I was going to try to parse each code,
check them against the mapping, and possibly correct them, but it's
easier to just skip items with codes unless we're in "force" mode.
The DSpace curation system has task properties that can be used to
create "profiles" of sorts. For example, if you set a custom task
name in curate.cfg:
plugin.named.org.dspace.curate.CurationTask = \
org.cgiar.cgspace.ctasks.CountryCodeTagger = countrycodetagger \
org.cgiar.cgspace.ctasks.CountryCodeTagger = countrycodetagger.force
... then DSpace will look for countrycodetagger.cfg by default, and
countrycodetagger.force.cfg for the second task. We can set different
properties in each one, for example "force=true", and then operate
accordingly in the task when we check the value using taskProperty().
I will use this to force all country tags to be cleared and updated,
where by default we only tag if there are no existing country tags.
See: https://wiki.lyrasis.org/display/DSDOC5x/Curation+System
We can't use the same class to map ISO 3166-1 and CGSpace country
vocabularies because our Gson is old and lacks the support for the
"alternate" value in its annotations (added in Gson 2.5). So it's
better to create multiple classes that extend the base one instead
of creating a custom deserializer. Each extended class then uses
its own Serializedname.
Based on Peter's preferred display values for these countries. We
will still use their ISO 3166-1 country codes so we include their
appropriate data from the iso-codes iso_3166-1.json list.
I will use the same format as the ISO 3166-1 JSON to make parsing
easier. I will add a new "cgspace_name" key to indicate our custom
name, though the codes will map to the standard ISO 3166-1 codes.
If an item has country metadata (cg.coverage.country) and no alpha
codes we check for name matches in ISO 3166 and add alpha_2 codes.
The name matching checks for a case-insensitive match on either an
ISO 3166-1 name, official name, or common name.
Our Java class needs to match the input JSON structure exactly, but
we can't use "3166-1" as a variable name so we tell GSON to use the
name "3166-1" when deserializing to countries.