Add Unicode normalization

This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2025-08-31 00:52:37 +02:00 · 2020-01-15 11:37:54 +02:00
parent 403b253762
commit 49e3543878
5 changed files with 63 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
 A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.

-Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
+Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.

 ## Functionality

@@ -15,6 +15,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
 - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
 - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
 - Remove duplicate metadata values
+- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`

 ## Installation
 The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
@@ -58,6 +59,14 @@ This is considered "unsafe" because it is *theoretically* possible for a single
 ### Newlines
 This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).

+## Unicode Normalization
+[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are not — technically they refer to different code points in the Unicode standard:
+
+- `é` is the Unicode code point `U+00E9`
+- `é` is the Unicode code points `U+0065` + `U+0301`
+
+Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
+
 ## AGROVOC Validation
 You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:

--- a/csv_metadata_quality/app.py
+++ b/csv_metadata_quality/app.py
@@ -94,6 +94,11 @@ def run(argv):
            if match is not None:
                df[column] = df[column].apply(fix.comma_space, field_name=column)

+        # Fix: perform Unicode normalization (NFC) to convert decomposed
+        # characters into their canonical forms.
+        if args.unsafe_fixes:
+            df[column] = df[column].apply(fix.normalize_unicode, field_name=column)
+
        # Fix: unnecessary Unicode
        df[column] = df[column].apply(fix.unnecessary_unicode)

--- a/csv_metadata_quality/fix.py
+++ b/csv_metadata_quality/fix.py
@@ -201,3 +201,27 @@ def comma_space(field, field_name):
        field = re.sub(r",(\w)", r", \1", field)

    return field
+
+
+def normalize_unicode(field, field_name):
+    """Fix occurrences of decomposed Unicode characters by normalizing them
+    with NFC to their canonical forms, for example:
+
+    Ouédraogo, Mathieu → Ouédraogo, Mathieu
+
+    Return normalized string.
+    """
+
+    from unicodedata import is_normalized
+    from unicodedata import normalize
+
+    # Skip fields with missing values
+    if pd.isna(field):
+        return
+
+    # Check if the current string is using normalized Unicode (NFC)
+    if not is_normalized("NFC", field):
+        print(f"Normalizing Unicode ({field_name}): {field}")
+        field = normalize("NFC", field)
+
+    return field
--- a/data/test.csv
+++ b/data/test.csv
@@ -26,3 +26,5 @@ Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-92-9043-823-6,,,,
 "Missing space,after comma",2019-08-27,,,,,,
 Incorrect ISO 639-1 language,2019-09-26,,,es,,,
 Incorrect ISO 639-3 language,2019-09-26,,,spa,,,
+Composéd Unicode,2020-01-14,,,,,,
+Decomposéd Unicode,2020-01-14,,,,,,
--- a/tests/test_fix.py
+++ b/tests/test_fix.py
@@ -66,3 +66,25 @@ def test_fix_comma_space():
    field_name = "dc.contributor.author"

    assert fix.comma_space(value, field_name) == "Orth, Alan S."
+
+
+def test_fix_normalized_unicode():
+    """Test fixing a string that is already in its normalized (NFC) Unicode form."""
+
+    # string using the normalized canonical form of é
+    value = "Ouédraogo, Mathieu"
+
+    field_name = "dc.contributor.author"
+
+    assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
+
+
+def test_fix_decomposed_unicode():
+    """Test fixing a string that contains Unicode string."""
+
+    # string using the decomposed form of é
+    value = "Ouédraogo, Mathieu"
+
+    field_name = "dc.contributor.author"
+
+    assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"