From 1008acf35e2753d9194755dd186c5428a0c520e6 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Sat, 13 Mar 2021 12:59:45 +0200 Subject: [PATCH] Always fix invalid multi-value separators This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository. --- CHANGELOG.md | 5 +++++ README.md | 12 ++++++------ csv_metadata_quality/app.py | 7 +++---- 3 files changed, 14 insertions(+), 10 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index bd50e1e..9fa7727 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,11 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## Unreleased +### Changed +- Fixing invalid multi-value separators like `|` and `|||` is no longer class- +ified as "unsafe" as I have yet to see a case where this was intentional + ## [0.4.6] - 2021-03-11 ### Added - Validation of dcterms.license field against SPDX license identifiers diff --git a/README.md b/README.md index e3fe053..a39daa4 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ If you use the DSpace CSV metadata quality checker please cite: - Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option) - Validation of licenses against the list of [SPDX license identifiers](https://spdx.org/licenses) - Fix leading, trailing, and excessive (ie, more than one) whitespace -- Fix invalid and unnecessary multi-value separators (`|`) using `--unsafe-fixes` +- Fix invalid and unnecessary multi-value separators (`|`) - Fix problematic newlines (line feeds) using `--unsafe-fixes` - Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes` - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc @@ -55,14 +55,14 @@ To validate and clean a CSV file you must specify input and output files using t $ csv-metadata-quality -i data/test.csv -o /tmp/test.csv ``` -## Unsafe Fixes -You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will attempt to fix invalid multi-value separators, remove newlines, and perform Unicode normalization. - -### Invalid Multi-Value Separators -This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`. +## Invalid Multi-Value Separators +While it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. This utility will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`. This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`. +## Unsafe Fixes +You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines and perform Unicode normalization. + ### Newlines This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A). diff --git a/csv_metadata_quality/app.py b/csv_metadata_quality/app.py index bdea4ec..c09ec7d 100644 --- a/csv_metadata_quality/app.py +++ b/csv_metadata_quality/app.py @@ -111,10 +111,9 @@ def run(argv): df[column] = df[column].apply(check.suspicious_characters, field_name=column) # Fix: invalid and unnecessary multi-value separators - if args.unsafe_fixes: - df[column] = df[column].apply(fix.separators, field_name=column) - # Run whitespace fix again after fixing invalid separators - df[column] = df[column].apply(fix.whitespace, field_name=column) + df[column] = df[column].apply(fix.separators, field_name=column) + # Run whitespace fix again after fixing invalid separators + df[column] = df[column].apply(fix.whitespace, field_name=column) # Fix: duplicate metadata values df[column] = df[column].apply(fix.duplicates, field_name=column)