diff --git a/README.md b/README.md index 37a3186..7f88221 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?) A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc. -Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested. +Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested. ## Functionality @@ -15,6 +15,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" - Remove duplicate metadata values +- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes` ## Installation The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv): @@ -58,6 +59,14 @@ This is considered "unsafe" because it is *theoretically* possible for a single ### Newlines This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A). +## Unicode Normalization +[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are not — technically they refer to different code points in the Unicode standard: + +- `é` is the Unicode code point `U+00E9` +- `é` is the Unicode code points `U+0065` + `U+0301` + +Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html). + ## AGROVOC Validation You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields: diff --git a/csv_metadata_quality/app.py b/csv_metadata_quality/app.py index 9d8a2ce..5ece6c6 100644 --- a/csv_metadata_quality/app.py +++ b/csv_metadata_quality/app.py @@ -94,6 +94,11 @@ def run(argv): if match is not None: df[column] = df[column].apply(fix.comma_space, field_name=column) + # Fix: perform Unicode normalization (NFC) to convert decomposed + # characters into their canonical forms. + if args.unsafe_fixes: + df[column] = df[column].apply(fix.normalize_unicode, field_name=column) + # Fix: unnecessary Unicode df[column] = df[column].apply(fix.unnecessary_unicode) diff --git a/csv_metadata_quality/fix.py b/csv_metadata_quality/fix.py index 88115a3..85e5077 100755 --- a/csv_metadata_quality/fix.py +++ b/csv_metadata_quality/fix.py @@ -201,3 +201,27 @@ def comma_space(field, field_name): field = re.sub(r",(\w)", r", \1", field) return field + + +def normalize_unicode(field, field_name): + """Fix occurrences of decomposed Unicode characters by normalizing them + with NFC to their canonical forms, for example: + + Ouédraogo, Mathieu → Ouédraogo, Mathieu + + Return normalized string. + """ + + from unicodedata import is_normalized + from unicodedata import normalize + + # Skip fields with missing values + if pd.isna(field): + return + + # Check if the current string is using normalized Unicode (NFC) + if not is_normalized("NFC", field): + print(f"Normalizing Unicode ({field_name}): {field}") + field = normalize("NFC", field) + + return field diff --git a/data/test.csv b/data/test.csv index ea98f79..54f13dd 100644 --- a/data/test.csv +++ b/data/test.csv @@ -26,3 +26,5 @@ Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,, "Missing space,after comma",2019-08-27,,,,,, Incorrect ISO 639-1 language,2019-09-26,,,es,,, Incorrect ISO 639-3 language,2019-09-26,,,spa,,, +Composéd Unicode,2020-01-14,,,,,, +Decomposéd Unicode,2020-01-14,,,,,, diff --git a/tests/test_fix.py b/tests/test_fix.py index e6df205..9b56796 100644 --- a/tests/test_fix.py +++ b/tests/test_fix.py @@ -66,3 +66,25 @@ def test_fix_comma_space(): field_name = "dc.contributor.author" assert fix.comma_space(value, field_name) == "Orth, Alan S." + + +def test_fix_normalized_unicode(): + """Test fixing a string that is already in its normalized (NFC) Unicode form.""" + + # string using the normalized canonical form of é + value = "Ouédraogo, Mathieu" + + field_name = "dc.contributor.author" + + assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu" + + +def test_fix_decomposed_unicode(): + """Test fixing a string that contains Unicode string.""" + + # string using the decomposed form of é + value = "Ouédraogo, Mathieu" + + field_name = "dc.contributor.author" + + assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"