1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-12-22 20:22:18 +01:00
csv-metadata-quality/csv_metadata_quality
Alan Orth 898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
..
__init__.py Refactor as package with subpackages 2019-07-26 22:11:10 +03:00
__main__.py Sort imports with isort 2019-08-29 01:15:04 +03:00
app.py Add checks and unsafe fixes for mojibake 2021-03-19 10:22:21 +02:00
check.py Add checks and unsafe fixes for mojibake 2021-03-19 10:22:21 +02:00
experimental.py csv_metadata_quality/experimental.py: Move all imports to top of file 2021-03-16 16:13:34 +02:00
fix.py Add checks and unsafe fixes for mojibake 2021-03-19 10:22:21 +02:00
util.py Add checks and unsafe fixes for mojibake 2021-03-19 10:22:21 +02:00
version.py Version 0.4.7 2021-03-17 10:00:34 +02:00