37 Commits

Author SHA1 Message Date
3a805f9bf2
README.md: Add more documentation and notes 2020-08-02 22:55:23 +03:00
ca7deaac8f
CountryCodeTagger.java: Remove unused variable
Some of the other curation tasks use an array of results.
2020-08-02 22:03:10 +03:00
e158e4bc98
CountryCodeTagger.java: Refactor adding of alpha2 codes
We can append the codes we will add to a List of Strings and then
actually apply them later in one addMetadata call, and update the
item with one item.update() call. This reduces identical code and
is more efficient.

Note that when testing this on a collection with thousands of items
I realized that it is really important to limit both the cache size
as well as set the database transaction model to be per object/item
or else you will crash due to Java heap issues. For example:

    $ ~/dspace/bin/dspace curate -t countrycodetagger -i 10568/3 -r - -l 500 -s object

See: https://wiki.lyrasis.org/display/DSPACE/Curation+Task+Cookbook
2020-08-02 18:33:32 +03:00
1c866bdf64
src/main/java: Remove unnecessary comments and prints 2020-08-02 18:32:04 +03:00
28b4707426
README.md: Add TODOs 2020-08-02 15:53:37 +03:00
cc35c45a05
Remove tests
They were automatically generated by Maven and I haven't created
proper ones yet.
2020-08-02 15:52:43 +03:00
e5d45e62be
src/main/java: Refactor CountryCodeTagger.java
Now is much more modular and can easily, cleanly be extended to do
ISO 3166-1 Alpha3, numeric, etc...
2020-08-02 15:51:18 +03:00
a6d3653c9e
README.md: Remove profile todo 2020-08-01 23:39:09 +03:00
6228f337e9
src/main/java: Skip items that have country codes
Originally I wasn't sure if I was going to try to parse each code,
check them against the mapping, and possibly correct them, but it's
easier to just skip items with codes unless we're in "force" mode.
2020-08-01 23:14:19 +03:00
4b553676dd
src/main/java: Implement task "profiles"
The DSpace curation system has task properties that can be used to
create "profiles" of sorts. For example, if you set a custom task
name in curate.cfg:

    plugin.named.org.dspace.curate.CurationTask = \
        org.cgiar.cgspace.ctasks.CountryCodeTagger = countrycodetagger \
        org.cgiar.cgspace.ctasks.CountryCodeTagger = countrycodetagger.force

... then DSpace will look for countrycodetagger.cfg by default, and
countrycodetagger.force.cfg for the second task. We can set different
properties in each one, for example "force=true", and then operate
accordingly in the task when we check the value using taskProperty().

I will use this to force all country tags to be cleared and updated,
where by default we only tag if there are no existing country tags.

See: https://wiki.lyrasis.org/display/DSDOC5x/Curation+System
2020-08-01 23:04:35 +03:00
d4cd5bfd61
src/main/java: Optimize imports 2020-08-01 23:03:51 +03:00
4c5eb9c1e7
README.md: Add TODO about workflow 2020-08-01 21:56:36 +03:00
9f68834f87
README.md: Add new TODOs 2020-08-01 21:34:02 +03:00
cf73935ea9
src/main/java: Use tokenized alpha2 field parts 2020-08-01 21:02:58 +03:00
409eb3bd02
src/main/java: Refactor vocabularies classes
We can't use the same class to map ISO 3166-1 and CGSpace country
vocabularies because our Gson is old and lacks the support for the
"alternate" value in its annotations (added in Gson 2.5). So it's
better to create multiple classes that extend the base one instead
of creating a custom deserializer. Each extended class then uses
its own Serializedname.
2020-08-01 20:53:59 +03:00
6891c93eeb
README.md: Add TODO about Gson for DSpace 6 2020-08-01 20:50:11 +03:00
98d3d56d78
src/main/java: Fix comment 2020-08-01 20:31:31 +03:00
c2c5baaf7a
Use gson 2.2.1
That's the same version that DSpace 5.8 is using so we should use
it here as well so we don't forget. Unfortunately this means that
we can't use the ability to use alternate serializednames. We will
need to create different classes to map to our different JSON files
instead of simply matching different elements on the fly.
2020-08-01 20:21:25 +03:00
fdcd1811a2
src/main/resources: Adjust CGSpace country list
Based on Peter's preferred display values for these countries. We
will still use their ISO 3166-1 country codes so we include their
appropriate data from the iso-codes iso_3166-1.json list.
2020-08-01 11:50:55 +03:00
4a6edba467
src/main/java: Add cgspace_name to Countries class
We will eventually use this to read CGSpace-specific mappings to
ISO 3166-1 values.
2020-08-01 11:49:22 +03:00
b3a993d5bd
src/main/java: Fix comment alignment 2020-08-01 11:46:13 +03:00
0f2081db51
src/main/java: Correctly map common_name and official_name
I forgot to fix these so that they map exactly to the ISO 3166-1
JSON so that GSON can deserialize them automatically.
2020-08-01 11:44:54 +03:00
91a4367f38
src/main/java: Add comment 2020-08-01 11:01:27 +03:00
8c23277382
src/main/resources: Start collecting CGSpace countries
I will use the same format as the ISO 3166-1 JSON to make parsing
easier. I will add a new "cgspace_name" key to indicate our custom
name, though the codes will map to the standard ISO 3166-1 codes.
2020-08-01 09:31:26 +03:00
6477b923b6
Add working tagging of ISO 3166-1 countries
If an item has country metadata (cg.coverage.country) and no alpha
codes we check for name matches in ISO 3166 and add alpha_2 codes.
The name matching checks for a case-insensitive match on either an
ISO 3166-1 name, official name, or common name.
2020-08-01 00:05:21 +03:00
6995d7a864
Match alpha_2 and alpha_3 JSON elements with class
For GSON to automatically map these to our class we need to make
sure they use the same name.
2020-08-01 00:02:27 +03:00
edd08c859a
CountryCodeTagger.java: Remove FileReader import
We are using an InputStream now.
2020-07-31 23:37:06 +03:00
94ceabb732
Close BufferedReader after we use it 2020-07-31 22:26:50 +03:00
9089ffb66f
Add TODO about using try-with-resource
This would automatically close the BufferedReader after we are done
with it, but it also means that the JSON object we create is lost
when we exit the try() scope...

See: https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html
2020-07-31 22:26:33 +03:00
af708933b2
Use BufferedReader for iso-codes JSON 2020-07-31 22:25:09 +03:00
bb9e53b220
README.md: Add note about iso-codes license
Debian's iso-codes project uses the LGPL v2.1 license.

See: https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/COPYING
2020-07-31 22:23:35 +03:00
d11bd00fa9
Use country vocabs from package resources
Import a local copy of iso_3166-1.json from iso-codes version 4.5.0
so we don't need to load it from the system.

See: https://salsa.debian.org/iso-codes-team/iso-codes
2020-07-31 22:18:32 +03:00
01be5c69ba
Add .gitignore 2020-07-31 22:01:05 +03:00
4cf0626385
Update comments 2020-07-31 22:00:41 +03:00
f62b50f5a1
Use the @SerializedName annotation for ISO 3166-1
Our Java class needs to match the input JSON structure exactly, but
we can't use "3166-1" as a variable name so we tell GSON to use the
name "3166-1" when deserializing to countries.
2020-07-31 21:52:48 +03:00
968bd354fe
Optimize imports 2020-07-31 21:42:41 +03:00
89f1734a9a
Initial commit 2020-07-31 21:40:15 +03:00