Add Unicode normalization

This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
This commit is contained in:
Alan Orth 2020-01-15 11:37:54 +02:00
parent 403b253762
commit 49e3543878
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
5 changed files with 63 additions and 1 deletions

View File

@ -1,7 +1,7 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
## Functionality
@ -15,6 +15,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Remove duplicate metadata values
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
@ -58,6 +59,14 @@ This is considered "unsafe" because it is *theoretically* possible for a single
### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
## Unicode Normalization
[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are nottechnically they refer to different code points in the Unicode standard:
- `é` is the Unicode code point `U+00E9`
- `é` is the Unicode code points `U+0065` + `U+0301`
Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
## AGROVOC Validation
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:

View File

@ -94,6 +94,11 @@ def run(argv):
if match is not None:
df[column] = df[column].apply(fix.comma_space, field_name=column)
# Fix: perform Unicode normalization (NFC) to convert decomposed
# characters into their canonical forms.
if args.unsafe_fixes:
df[column] = df[column].apply(fix.normalize_unicode, field_name=column)
# Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode)

View File

@ -201,3 +201,27 @@ def comma_space(field, field_name):
field = re.sub(r",(\w)", r", \1", field)
return field
def normalize_unicode(field, field_name):
"""Fix occurrences of decomposed Unicode characters by normalizing them
with NFC to their canonical forms, for example:
Ouédraogo, Mathieu Ouédraogo, Mathieu
Return normalized string.
"""
from unicodedata import is_normalized
from unicodedata import normalize
# Skip fields with missing values
if pd.isna(field):
return
# Check if the current string is using normalized Unicode (NFC)
if not is_normalized("NFC", field):
print(f"Normalizing Unicode ({field_name}): {field}")
field = normalize("NFC", field)
return field

View File

@ -26,3 +26,5 @@ Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,,
"Missing space,after comma",2019-08-27,,,,,,
Incorrect ISO 639-1 language,2019-09-26,,,es,,,
Incorrect ISO 639-3 language,2019-09-26,,,spa,,,
Composéd Unicode,2020-01-14,,,,,,
Decomposéd Unicode,2020-01-14,,,,,,

1 dc.title dc.date.issued dc.identifier.issn dc.identifier.isbn dc.language.iso dc.subject cg.coverage.country filename
26 Incorrect ISO 639-1 language 2019-09-26 es
27 Incorrect ISO 639-3 language 2019-09-26 spa
28 Composéd Unicode 2020-01-14
29 Decomposéd Unicode 2020-01-14
30

View File

@ -66,3 +66,25 @@ def test_fix_comma_space():
field_name = "dc.contributor.author"
assert fix.comma_space(value, field_name) == "Orth, Alan S."
def test_fix_normalized_unicode():
"""Test fixing a string that is already in its normalized (NFC) Unicode form."""
# string using the normalized canonical form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
def test_fix_decomposed_unicode():
"""Test fixing a string that contains Unicode string."""
# string using the decomposed form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"