15 Commits

Author SHA1 Message Date
d5cf51c464 README.md: Use correct version
I can't figure out how to publish releases on Maven central so let's
stick to SNAPSHOT releases for now.
2020-08-10 15:58:22 +03:00
98c7cfb3a5 README.md: Make README links shorter 2020-08-10 15:38:04 +03:00
58365cdfda Adjust README.md files
We need to try to keep the main README.md clean and move specific
configuration instructions to each separate component.
2020-08-10 15:30:32 +03:00
7190b751e1 Minor edits to FixJpgJpgThumbnails.java
Use primitive types instead of Java generics when we don't need to
do anything special, and break from the loop once our condition is
set.
2020-08-07 22:18:32 +03:00
34acc351a5 src/main/java: Add Javadoc stuff to CountryCodeTagger.java 2020-08-07 12:27:44 +03:00
ec293b3b28 Add CHANGELOG.md 2020-08-07 12:25:48 +03:00
31cd979b61 pom.xml: Move to next development snapshot
Version 5.4-SNAPSHOT
2020-08-07 09:58:00 +03:00
fce81c6003 Version 5.3 2020-08-07 09:57:19 +03:00
26d3cbd778 src/main/java: Tune FixJpgJpgThumbnails a bit
Make sure we don't modify thumbnails if the item is an Infographic
because the JPG in the ORIGINAL bundle might actually be the "real"
file, in which case the THUMBNAIL bundle would have a legitimate
".jpg.jpg" file.

Also, limit the criteria for replacement to original bitstreams
that are less than 100KiB. In my tests I found that we had 4,022
items with ".jpg.jpg" thumbnails, and the average file size of the
originals in those items was 98KiB. Without considering the large
inforgraphics, which are several megabytes apiece, the average of
the remaining 3,765 originals was ~20KiB so 100KiB should be very
safe.
2020-08-07 09:50:03 +03:00
fdc910f93b README.md: Update versions 2020-08-06 16:23:17 +03:00
e0d514e797 pom.xml: Move version to 5.3-SNAPSHOT 2020-08-06 16:17:05 +03:00
fd893d8c4e pom.xml: Release version 5.2 2020-08-06 16:16:13 +03:00
2263ac27e8 src/main/java: Handle more corner cases in FixJpgJpgThumbnails.java
We should make sure we are catching .JPG and .jpg. Also, we should
check for Generated Thumbnails as well as IM Thumbnail.
2020-08-06 16:13:51 +03:00
cf7012d698 pom.xml: Change version to 5.2-SNAPSHOT 2020-08-06 16:13:27 +03:00
7edc60e6ca README.md: Use badge for dspace5 branch 2020-08-06 15:47:33 +03:00
7 changed files with 177 additions and 41 deletions

19
CHANGELOG.md Normal file
View File

@ -0,0 +1,19 @@
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [5.3] - 2020-08-07
### Changed
- Make sure `FixJpgJpgThumbnails` only replaces thumbnails where the original is less than ~100KiB
- Make sure `FixJpgJpgThumbnails` only replaces thumbnails if the item type is not `Infographic` (because the JPG in the ORIGINAL bundle is the "real" file and it's OK that the thumbnail is ".jpg.jpg")
## [5.2] - 2020-08-06
### Changed
- Make `FixJpgJpgThumbnails` helper check for files named "JPG" as well as "jpg" (case insensitive)
- Make `FixJpgJpgThumbnails` helper replace thumbnails with description `IM Thumbnail` as well as `Generated Thumbnail`
## [5.1] - 2020-08-06
### Added
- Add `FixJpgJpgThumbnails` helper to replace ".jpg.jpg" thumbnails with their originals

View File

@ -1,4 +1,4 @@
# CGSpace Java Helpers [![Build Status](https://travis-ci.org/ilri/cgspace-java-helpers.svg?branch=master)](https://travis-ci.org/ilri/dspace-curation-tasks)
# CGSpace Java Helpers [![Build Status](https://travis-ci.org/ilri/cgspace-java-helpers.svg?branch=dspace5)](https://travis-ci.org/ilri/dspace-curation-tasks)
DSpace curation tasks and other Java-based helpers used on the [CGSpace](https://cgspace.cgiar.org) institutional repository:
- **CountryCodeTagger**: add ISO 3166-1 Alpha2 country codes to items based on their existing country metadata
@ -15,7 +15,7 @@ To use these curation tasks in a DSpace project add the following dependency to
<dependency>
<groupId>io.github.ilri.cgspace</groupId>
<artifactId>cgspace-java-helpers</artifactId>
<version>5.1</version>
<version>5.4-SNAPSHOT</version>
</dependency>
```
@ -31,42 +31,14 @@ $ mvn package
Copy the resulting jar to the DSpace `lib` directory:
```
$ cp target/cgspace-java-helpers-5.1.jar ~/dspace/lib
$ cp target/cgspace-java-helpers-5.4-SNAPSHOT.jar ~/dspace/lib
```
## Configuration
Add the curation task to DSpace's `config/modules/curate.cfg`:
Please refer to the appropriate README.md file:
```
plugin.named.org.dspace.curate.CurationTask = \
...
io.github.ilri.cgspace.ctasks.CountryCodeTagger = countrycodetagger \
io.github.ilri.cgspace.ctasks.CountryCodeTagger = countrycodetagger.force
```
And then add a configuration file for the task in `config/modules/countrycodetagger.cfg`:
```
# name of the field containing ISO 3166-1 country names
iso3166.field = cg.coverage.country
# name of the field containing ISO 3166-1 Alpha2 country codes
iso3166-alpha2.field = cg.coverage.iso3166-alpha2
# only add country codes if an item doesn't have any (default false)
#forceupdate = false
```
*Note*: DSpace's curation system supports "profiles" where you can use the same task with different options, for example above I have a normal country code tagger and a "force" variant. To use the "force" variant you create a new configuration file with the overridden options in `config/modules/countrycodetagger.force.cfg`. The "force" profile clears all existing country codes and updates everything.
## Invocation
Once the jar is installed and you have added appropriate configuration in `~/dspace/config/modules`:
```
$ ~/dspace/bin/dspace curate -t countrycodetagger -i 10568/3 -r - -l 500 -s object
```
*Note*: it is very important to set the cache limit (`-l`) and the database transaction scope to something sensible (`object`) if you're curating a community or collection with more than a few hundred items.
- Curation Tasks: [src/main/java/io/github/ilri/cgspace/ctasks/README.md](https://github.com/ilri/cgspace-java-helpers/blob/dspace5/src/main/java/io/github/ilri/cgspace/ctasks/README.md)
- Scripts: [src/main/java/io/github/ilri/cgspace/scripts/README.md](https://github.com/ilri/cgspace-java-helpers/blob/dspace5/src/main/java/io/github/ilri/cgspace/scripts/README.md)
## Notes
This project was initially created according to the [Maven Getting Started Guide](https://maven.apache.org/guides/getting-started/):

View File

@ -6,7 +6,7 @@
<groupId>io.github.ilri.cgspace</groupId>
<artifactId>cgspace-java-helpers</artifactId>
<version>5.1</version>
<version>5.4-SNAPSHOT</version>
<name>cgspace-java-helpers</name>
<url>https://github.com/ilri/cgspace-java-helpers</url>

View File

@ -35,6 +35,11 @@ import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
/**
* @author Alan Orth for the International Livestock Research Institute
* @version 5.1
* @since 1.0
*/
public class CountryCodeTagger extends AbstractCurationTask
{
public class CountryCodeTaggerConfig {

View File

@ -0,0 +1,74 @@
# Curation Tasks
DSpace curation tasks used on the [CGSpace](https://cgspace.cgiar.org) institutional repository:
- **CountryCodeTagger**: add ISO 3166-1 Alpha2 country codes to items based on their existing country metadata
Tested on DSpace 5.8. Read more about the [DSpace curation system](https://wiki.lyrasis.org/display/DSDOC5x/Curation+System).
## Build and Install
### Integrate into DSpace Build
To use these curation tasks in a DSpace project add the following dependency to `dspace/modules/additions/pom.xml`:
```
<dependency>
<groupId>io.github.ilri.cgspace</groupId>
<artifactId>cgspace-java-helpers</artifactId>
<version>5.3</version>
</dependency>
```
The jar will be copied to all DSpace applications.
### Manual Build and Install
To build the standalone jar:
```
$ mvn package
```
Copy the resulting jar to the DSpace `lib` directory:
```
$ cp target/cgspace-java-helpers-5.3.jar ~/dspace/lib
```
## Configuration
Add the curation task to DSpace's `config/modules/curate.cfg`:
```
plugin.named.org.dspace.curate.CurationTask = \
...
io.github.ilri.cgspace.ctasks.CountryCodeTagger = countrycodetagger \
io.github.ilri.cgspace.ctasks.CountryCodeTagger = countrycodetagger.force
```
And then add a configuration file for the task in `config/modules/countrycodetagger.cfg`:
```
# name of the field containing ISO 3166-1 country names
iso3166.field = cg.coverage.country
# name of the field containing ISO 3166-1 Alpha2 country codes
iso3166-alpha2.field = cg.coverage.iso3166-alpha2
# only add country codes if an item doesn't have any (default false)
#forceupdate = false
```
*Note*: DSpace's curation system supports "profiles" where you can use the same task with different options, for example above I have a normal country code tagger and a "force" variant. To use the "force" variant you create a new configuration file with the overridden options in `config/modules/countrycodetagger.force.cfg`. The "force" profile clears all existing country codes and updates everything.
## Invocation
Once the jar is installed and you have added appropriate configuration in `~/dspace/config/modules`:
```
$ ~/dspace/bin/dspace curate -t countrycodetagger -i 10568/3 -r - -l 500 -s object
```
*Note*: it is very important to set the cache limit (`-l`) and the database transaction scope to something sensible (`object`) if you're curating a community or collection with more than a few hundred items.
## TODO
- Make sure this doesn't work on items in the workflow
- Check for existence of metadata field before trying to add metadata
- Add tests

View File

@ -13,8 +13,8 @@ import java.sql.SQLException;
/**
* @author Andrea Schweer schweer@waikato.ac.nz for the LCoNZ Institutional Research Repositories
* @author Alan Orth for the International Livestock Research Institute
* @version 5.1-SNAPSHOT
* @since 5.1-SNAPSHOT
* @version 5.4
* @since 5.1
*/
public class FixJpgJpgThumbnails {
@ -73,22 +73,47 @@ public class FixJpgJpgThumbnails {
}
private static void processItem(Item item) throws SQLException, AuthorizeException, IOException {
// Some bitstreams like Infographics are large JPGs and put in the ORIGINAL bundle on purpose so we shouldn't
// swap them.
Metadatum[] itemTypes = item.getMetadataByMetadataString("dc.type");
boolean itemHasInfographic = false;
for (Metadatum itemType: itemTypes) {
if (itemType.value.equals("Infographic")) {
itemHasInfographic = true;
break;
}
}
Bundle[] thumbnailBundles = item.getBundles("THUMBNAIL");
for (Bundle thumbnailBundle : thumbnailBundles) {
Bitstream[] thumbnailBundleBitstreams = thumbnailBundle.getBitstreams();
for (Bitstream thumbnailBitstream : thumbnailBundleBitstreams) {
String thumbnailName = thumbnailBitstream.getName();
if (thumbnailName.contains(".jpg.jpg")) {
if (thumbnailName.toLowerCase().contains(".jpg.jpg")) {
Bundle[] originalBundles = item.getBundles("ORIGINAL");
for (Bundle originalBundle : originalBundles) {
Bitstream[] originalBundleBitstreams = originalBundle.getBitstreams();
for(Bitstream originalBitstream : originalBundleBitstreams) {
for (Bitstream originalBitstream : originalBundleBitstreams) {
String originalName = originalBitstream.getName();
//check if the original file name is the same as the thumbnail name minus the extra ".jpg"
if (originalName.equals(StringUtils.removeEndIgnoreCase(thumbnailName, ".jpg")) && "Generated Thumbnail".equals(thumbnailBitstream.getDescription())) {
long originalBitstreamBytes = originalBitstream.getSize();
/*
- check if the original file name is the same as the thumbnail name minus the extra ".jpg"
- check if the thumbnail description indicates it was automatically generated
- check if the item has dc.type Infographic (JPG could be the "real" item!)
- check if the original bitstream is less than ~100KiB
- Note: in my tests there were 4022 items with ".jpg.jpg" thumbnails totaling 394549249
bytes for an average of about 98KiB so ~100KiB seems like a good cut off
*/
if (
originalName.equalsIgnoreCase(StringUtils.removeEndIgnoreCase(thumbnailName, ".jpg"))
&& ("Generated Thumbnail".equals(thumbnailBitstream.getDescription()) || "IM Thumbnail".equals(thumbnailBitstream.getDescription()))
&& !itemHasInfographic
&& originalBitstreamBytes < 100000
) {
System.out.println(item.getHandle() + ": replacing " + thumbnailName + " with " + originalName);
//add the original bitstream to the THUMBNAIL bundle

View File

@ -0,0 +1,41 @@
# Scripts
Java-based helpers used on the [CGSpace](https://cgspace.cgiar.org) institutional repository:
- **FixJpgJpgThumbnails**: Fix low-quality ".jpg.jpg" thumbnails by replacing them with their originals
Tested on DSpace 5.8. Read more about the [DSpace curation system](https://wiki.lyrasis.org/display/DSDOC5x/Curation+System).
## Build and Install
### Integrate into DSpace Build
To use these curation tasks in a DSpace project add the following dependency to `dspace/modules/additions/pom.xml`:
```
<dependency>
<groupId>io.github.ilri.cgspace</groupId>
<artifactId>cgspace-java-helpers</artifactId>
<version>5.3</version>
</dependency>
```
The jar will be copied to all DSpace applications.
### Manual Build and Install
To build the standalone jar:
```
$ mvn package
```
Copy the resulting jar to the DSpace `lib` directory:
```
$ cp target/cgspace-java-helpers-5.3.jar ~/dspace/lib
```
### Invocation
The script only takes one argument, which is a community, collection, or item:
```
$ dspace dsrun io.github.ilri.cgspace.scripts.FixJpgJpgThumbnails 10568/83389
```