This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.
For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:
- CIAT Publicaçao
- CIAT Publicación
The correct version of these in UTF-8 would be:
- CIAT Publicaçao
- CIAT Publicación
I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.
See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.
See: https://docs.python.org/3/library/unicodedata.html