csv-metadata-quality/csv_metadata_quality/util.py

# SPDX-License-Identifier: GPL-3.0-only


import json
from importlib.resources import files

from ftfy.badness import is_bad


def is_nfc(field):
    """Utility function to check whether a string is using normalized Unicode.
    Python's built-in unicodedata library has the is_normalized() function, but
    it was only introduced in Python 3.8. By using a simple utility function we
    are able to run on Python >= 3.6 again.

    See: https://docs.python.org/3/library/unicodedata.html

    Return boolean.
    """

    from unicodedata import normalize

    return field == normalize("NFC", field)


def is_mojibake(field):
    """Determines whether a string contains mojibake.

    We commonly deal with CSV files that were *encoded* in UTF-8, but decoded
    as something else like CP-1252 (Windows Latin). This manifests in the form
    of "mojibake", for example:

        - CIAT PublicaÃ§ao
        - CIAT PublicaciÃ³n

    This uses the excellent "fixes text for you" (ftfy) library to determine
    whether a string contains characters that have been encoded in one encoding
    and decoded in another.

    Inspired by this code snippet from Martijn Pieters on StackOverflow:
    https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python

    Return boolean.
    """
    if not is_bad(field):
        # Nothing weird, should be okay
        return False
    try:
        field.encode("sloppy-windows-1252")
    except UnicodeEncodeError:
        # Not CP-1252 encodable, probably fine
        return False
    else:
        # Encodable as CP-1252, Mojibake alert level high
        return True


def load_spdx_licenses():
    """Returns a Python list of SPDX short license identifiers."""

    with open(files("csv_metadata_quality").joinpath("data/licenses.json")) as f:
        licenses = json.load(f)

    # List comprehension to extract the license ID for each license
    return [license["licenseId"] for license in licenses["licenses"]]
Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/ 2021-03-19 15:04:13 +01:00			`# SPDX-License-Identifier: GPL-3.0-only`

Use licenses.json from SPDX instead of spdx-license-list spdx-license-list has been deprecated[1] and already has outdated information compared to recent SPDX data releases. Now I use the JSON license data directly from SPDX[2] (currently version 3.19). The JSON file is loaded from the package's data directory using Python 3's stdlib functions from importlib[3], though we now need Python 3.9 as a minimum for importlib.resources.files[4]. Also note that the data directory is not properly packaged via setuptools, so this only works for local installs, and not via versions published to pypi, for example (I'm currently not doing this anyways). If I want to publish this in the future I will need to modify setup.py/pyproject.toml to include the data files. [1] https://gitlab.com/uniqx/spdx-license-list [2] https://github.com/spdx/license-list-data/blob/main/json/licenses.json [3] https://copdips.com/2022/09/adding-data-files-to-python-package-with-setup-py.html [4] https://docs.python.org/3/library/importlib.resources.html#importlib.resources.files 2022-12-13 08:31:21 +01:00
			`import json`
			`from importlib.resources import files`

csv_metadata_quality/util.py: update for ftfy 6.0 The sequence_weirdness() heuristic is deprecated. Now we should use is_bad(). See: https://ftfy.readthedocs.io/en/v6.0/heuristic.html See: https://github.com/rspeer/python-ftfy/blob/master/CHANGELOG.md#version-60-april-2-2021 2021-12-15 20:58:07 +01:00			`from ftfy.badness import is_bad`
Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python 2021-03-19 09:22:21 +01:00

Add utility function to check normalization Python's built-in unicodedata library includes the is_normalized() function starting with Python 3.8. This utility function allows us to do the same thing with earlier Python versions. See: https://docs.python.org/3/library/unicodedata.html 2020-01-15 11:17:52 +01:00			`def is_nfc(field):`
			`"""Utility function to check whether a string is using normalized Unicode.`
			`Python's built-in unicodedata library has the is_normalized() function, but`
			`it was only introduced in Python 3.8. By using a simple utility function we`
			`are able to run on Python >= 3.6 again.`

			`See: https://docs.python.org/3/library/unicodedata.html`

			`Return boolean.`
			`"""`

			`from unicodedata import normalize`

			`return field == normalize("NFC", field)`
Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python 2021-03-19 09:22:21 +01:00

			`def is_mojibake(field):`
			`"""Determines whether a string contains mojibake.`

			`We commonly deal with CSV files that were encoded in UTF-8, but decoded`
			`as something else like CP-1252 (Windows Latin). This manifests in the form`
			`of "mojibake", for example:`

			`- CIAT PublicaÃ§ao`
			`- CIAT PublicaciÃ³n`

			`This uses the excellent "fixes text for you" (ftfy) library to determine`
			`whether a string contains characters that have been encoded in one encoding`
			`and decoded in another.`

			`Inspired by this code snippet from Martijn Pieters on StackOverflow:`
			`https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python`

			`Return boolean.`
			`"""`
csv_metadata_quality/util.py: update for ftfy 6.0 The sequence_weirdness() heuristic is deprecated. Now we should use is_bad(). See: https://ftfy.readthedocs.io/en/v6.0/heuristic.html See: https://github.com/rspeer/python-ftfy/blob/master/CHANGELOG.md#version-60-april-2-2021 2021-12-15 20:58:07 +01:00			`if not is_bad(field):`
Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python 2021-03-19 09:22:21 +01:00			`# Nothing weird, should be okay`
			`return False`
			`try:`
			`field.encode("sloppy-windows-1252")`
			`except UnicodeEncodeError:`
			`# Not CP-1252 encodable, probably fine`
			`return False`
			`else:`
			`# Encodable as CP-1252, Mojibake alert level high`
			`return True`
Use licenses.json from SPDX instead of spdx-license-list spdx-license-list has been deprecated[1] and already has outdated information compared to recent SPDX data releases. Now I use the JSON license data directly from SPDX[2] (currently version 3.19). The JSON file is loaded from the package's data directory using Python 3's stdlib functions from importlib[3], though we now need Python 3.9 as a minimum for importlib.resources.files[4]. Also note that the data directory is not properly packaged via setuptools, so this only works for local installs, and not via versions published to pypi, for example (I'm currently not doing this anyways). If I want to publish this in the future I will need to modify setup.py/pyproject.toml to include the data files. [1] https://gitlab.com/uniqx/spdx-license-list [2] https://github.com/spdx/license-list-data/blob/main/json/licenses.json [3] https://copdips.com/2022/09/adding-data-files-to-python-package-with-setup-py.html [4] https://docs.python.org/3/library/importlib.resources.html#importlib.resources.files 2022-12-13 08:31:21 +01:00

			`def load_spdx_licenses():`
			`"""Returns a Python list of SPDX short license identifiers."""`

			`with open(files("csv_metadata_quality").joinpath("data/licenses.json")) as f:`
			`licenses = json.load(f)`

			`# List comprehension to extract the license ID for each license`
			`return [license["licenseId"] for license in licenses["licenses"]]`