Add checks and unsafe fixes for mojibake

This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2025-08-23 05:11:49 +02:00 · 2021-03-19 10:22:21 +02:00
parent e92ec5d371
commit 898bb412c3
5 changed files with 85 additions and 1 deletions
--- a/csv_metadata_quality/check.py
+++ b/csv_metadata_quality/check.py
@@ -11,6 +11,8 @@ from pycountry import languages
 from stdnum import isbn as stdnum_isbn
 from stdnum import issn as stdnum_issn

+from csv_metadata_quality.util import is_mojibake
+

 def issn(field):
    """Check if an ISSN is valid.
@@ -345,3 +347,22 @@ def duplicate_items(df):
                )
            else:
                items.append(item_title_type_date)
+
+
+def mojibake(field, field_name):
+    """Check for mojibake (text that was encoded in one encoding and decoded in
+    in another, perhaps multiple times). See util.py.
+
+    Prints the string if it contains suspected mojibake.
+    """
+
+    # Skip fields with missing values
+    if pd.isna(field):
+        return
+
+    if is_mojibake(field):
+        print(
+            f"{Fore.YELLOW}Possible encoding issue ({field_name}): {Fore.RESET}{field}"
+        )
+
+    return