csv-metadata-quality/csv_metadata_quality/experimental.py

# SPDX-License-Identifier: GPL-3.0-only

import re

import pandas as pd
import py3langid as langid
from colorama import Fore
from pycountry import languages


def correct_language(row, exclude):
    """Analyze the text used in the title, abstract, and citation fields to pre-
    dict the language being used and compare it with the item's dc.language.iso
    field.

    Function prints an error if the language field does not match the detected
    language and returns the value in the language field if it does match.
    """

    # Initialize some variables at global scope so that we can set them in the
    # loop scope below and still be able to access them afterwards.
    language = ""
    sample_strings = []
    title = None

    # Iterate over the labels of the current row's values. Before we transposed
    # the DataFrame these were the columns in the CSV, ie dc.title and dc.type.
    for label in row.axes[0]:
        # Skip fields with missing values
        if pd.isna(row[label]):
            continue

        # Check if current row has multiple language values (separated by "||")
        match = re.match(r"^.*?language.*$", label)
        if match is not None:
            # Skip fields with multiple language values
            if "||" in row[label]:
                return

            language = row[label]

        # Extract title if it is present (note that we don't allow excluding
        # the title here because it complicates things).
        match = re.match(r"^.*?title.*$", label)
        if match is not None:
            title = row[label]
            # Append title to sample strings
            sample_strings.append(row[label])

        # Extract abstract if it is present
        match = re.match(r"^.*?abstract.*$", label)
        if match is not None and label not in exclude:
            sample_strings.append(row[label])

        # Extract citation if it is present
        match = re.match(r"^.*?[cC]itation.*$", label)
        if match is not None and label not in exclude:
            sample_strings.append(row[label])

    # Make sure language is not blank and is valid ISO 639-1/639-3 before proceeding with language prediction
    if language != "":
        # Check language value like "es"
        if len(language) == 2:
            if not languages.get(alpha_2=language):
                return
        # Check language value like "spa"
        elif len(language) == 3:
            if not languages.get(alpha_3=language):
                return
        # Language value is something else like "Span", do not proceed
        else:
            return
    # Language is blank, do not proceed
    else:
        return

    # Concatenate all sample strings into one string
    sample_text = " ".join(sample_strings)

    # Restrict the langid detection space to reduce false positives
    langid.set_languages(
        ["ar", "de", "en", "es", "fr", "hi", "it", "ja", "ko", "pt", "ru", "vi", "zh"]
    )
    langid_classification = langid.classify(sample_text)

    # langid returns an ISO 639-1 (alpha 2) representation of the detected language, but the current item's language field might be ISO 639-3 (alpha 3) so we should use a pycountry Language object to compare both represenations and give appropriate error messages that match the format used by in the input file.
    detected_language = languages.get(alpha_2=langid_classification[0])
    if len(language) == 2 and language != detected_language.alpha_2:
        print(
            f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_2}): {Fore.RESET}{title}"
        )

    elif len(language) == 3 and language != detected_language.alpha_3:
        print(
            f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_3}): {Fore.RESET}{title}"
        )

    else:
        return
Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/ 2021-03-19 15:04:13 +01:00			`# SPDX-License-Identifier: GPL-3.0-only`

csv_metadata_quality/experimental.py: Move all imports to top of file PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports 2021-03-16 15:13:34 +01:00			`import re`

Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`import pandas as pd`
Use py3langid instead of langid Faster and more modern code for Python 3 as a drop-in replacement. See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html 2023-12-28 12:11:21 +01:00			`import py3langid as langid`
Colorize output Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes 2021-02-21 12:01:25 +01:00			`from colorama import Fore`
csv_metadata_quality/experimental.py: Move all imports to top of file PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports 2021-03-16 15:13:34 +01:00			`from pycountry import languages`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00

Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks. 2022-09-02 14:59:22 +02:00			`def correct_language(row, exclude):`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`"""Analyze the text used in the title, abstract, and citation fields to pre-`
			`dict the language being used and compare it with the item's dc.language.iso`
			`field.`

			`Function prints an error if the language field does not match the detected`
			`language and returns the value in the language field if it does match.`
			`"""`

			`# Initialize some variables at global scope so that we can set them in the`
			`# loop scope below and still be able to access them afterwards.`
			`language = ""`
Apply fixes from fixit Apply recommended fix from fixit: RewriteToLiteral: It's slower to call list() than using the empty literal, because the name list must be looked up in the global scope in case it has been rebound. 2023-11-22 19:54:50 +01:00			`sample_strings = []`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`title = None`

			`# Iterate over the labels of the current row's values. Before we transposed`
			`# the DataFrame these were the columns in the CSV, ie dc.title and dc.type.`
			`for label in row.axes[0]:`
			`# Skip fields with missing values`
			`if pd.isna(row[label]):`
			`continue`

			`# Check if current row has multiple language values (separated by "\|\|")`
			`match = re.match(r"^.?language.$", label)`
			`if match is not None:`
			`# Skip fields with multiple language values`
			`if "\|\|" in row[label]:`
			`return`

			`language = row[label]`

Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks. 2022-09-02 14:59:22 +02:00			`# Extract title if it is present (note that we don't allow excluding`
			`# the title here because it complicates things).`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`match = re.match(r"^.?title.$", label)`
			`if match is not None:`
			`title = row[label]`
			`# Append title to sample strings`
			`sample_strings.append(row[label])`

			`# Extract abstract if it is present`
			`match = re.match(r"^.?abstract.$", label)`
Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks. 2022-09-02 14:59:22 +02:00			`if match is not None and label not in exclude:`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`sample_strings.append(row[label])`

			`# Extract citation if it is present`
csv_metadata_quality/experimental.py: Adjust citation match We need to match both of these citation fields: - dc.identifier.citation - dcterms.bibliographicCitation 2021-10-06 15:13:10 +02:00			`match = re.match(r"^.?[cC]itation.$", label)`
Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks. 2022-09-02 14:59:22 +02:00			`if match is not None and label not in exclude:`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`sample_strings.append(row[label])`

			`# Make sure language is not blank and is valid ISO 639-1/639-3 before proceeding with language prediction`
			`if language != "":`
			`# Check language value like "es"`
			`if len(language) == 2:`
			`if not languages.get(alpha_2=language):`
			`return`
			`# Check language value like "spa"`
			`elif len(language) == 3:`
			`if not languages.get(alpha_3=language):`
			`return`
			`# Language value is something else like "Span", do not proceed`
			`else:`
			`return`
			`# Language is blank, do not proceed`
			`else:`
			`return`

			`# Concatenate all sample strings into one string`
			`sample_text = " ".join(sample_strings)`

			`# Restrict the langid detection space to reduce false positives`
			`langid.set_languages(`
			`["ar", "de", "en", "es", "fr", "hi", "it", "ja", "ko", "pt", "ru", "vi", "zh"]`
			`)`
			`langid_classification = langid.classify(sample_text)`

			`# langid returns an ISO 639-1 (alpha 2) representation of the detected language, but the current item's language field might be ISO 639-3 (alpha 3) so we should use a pycountry Language object to compare both represenations and give appropriate error messages that match the format used by in the input file.`
			`detected_language = languages.get(alpha_2=langid_classification[0])`
			`if len(language) == 2 and language != detected_language.alpha_2:`
			`print(`
Colorize output Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes 2021-02-21 12:01:25 +01:00			`f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_2}): {Fore.RESET}{title}"`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`)`

			`elif len(language) == 3 and language != detected_language.alpha_3:`
			`print(`
Colorize output Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes 2021-02-21 12:01:25 +01:00			`f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_3}): {Fore.RESET}{title}"`
Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e". 2019-09-24 17:55:05 +02:00			`)`

			`else:`
Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory. 2021-03-16 15:04:19 +01:00			`return`