1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-06-29 09:33:46 +02:00
csv-metadata-quality/csv_metadata_quality/util.py
Alan Orth 898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00

50 lines
1.5 KiB
Python

from ftfy.badness import sequence_weirdness
def is_nfc(field):
"""Utility function to check whether a string is using normalized Unicode.
Python's built-in unicodedata library has the is_normalized() function, but
it was only introduced in Python 3.8. By using a simple utility function we
are able to run on Python >= 3.6 again.
See: https://docs.python.org/3/library/unicodedata.html
Return boolean.
"""
from unicodedata import normalize
return field == normalize("NFC", field)
def is_mojibake(field):
"""Determines whether a string contains mojibake.
We commonly deal with CSV files that were *encoded* in UTF-8, but decoded
as something else like CP-1252 (Windows Latin). This manifests in the form
of "mojibake", for example:
- CIAT Publicaçao
- CIAT Publicación
This uses the excellent "fixes text for you" (ftfy) library to determine
whether a string contains characters that have been encoded in one encoding
and decoded in another.
Inspired by this code snippet from Martijn Pieters on StackOverflow:
https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
Return boolean.
"""
if not sequence_weirdness(field):
# Nothing weird, should be okay
return False
try:
field.encode("sloppy-windows-1252")
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return False
else:
# Encodable as CP-1252, Mojibake alert level high
return True