csv-metadata-quality/csv_metadata_quality/app.py

from csv_metadata_quality.version import VERSION
import argparse
import csv_metadata_quality.check as check
import csv_metadata_quality.fix as fix
import pandas as pd
import re
import signal
import sys


def parse_args(argv):
    parser = argparse.ArgumentParser(description='Metadata quality checker and fixer.')
    parser.add_argument('--agrovoc-fields', '-a', help='Comma-separated list of fields to validate against AGROVOC, for example: dc.subject,cg.coverage.country')
    parser.add_argument('--input-file', '-i', help='Path to input file. Can be UTF-8 CSV or Excel XLSX.', required=True, type=argparse.FileType('r', encoding='UTF-8'))
    parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
    parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
    parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}')
    args = parser.parse_args()

    return args


def signal_handler(signal, frame):
    sys.exit(1)


def run(argv):
    args = parse_args(argv)

    # set the signal handler for SIGINT (^C)
    signal.signal(signal.SIGINT, signal_handler)

    # Read all fields as strings so dates don't get converted from 1998 to 1998.0
    df = pd.read_csv(args.input_file, dtype=str)

    for column in df.columns.values.tolist():
        # Fix: whitespace
        df[column] = df[column].apply(fix.whitespace)

        # Fix: newlines
        if args.unsafe_fixes:
            df[column] = df[column].apply(fix.newlines)

        # Fix: unnecessary Unicode
        df[column] = df[column].apply(fix.unnecessary_unicode)

        # Check: invalid multi-value separator
        df[column] = df[column].apply(check.separators)

        # Check: suspicious characters
        df[column] = df[column].apply(check.suspicious_characters, field_name=column)

        # Fix: invalid multi-value separator
        if args.unsafe_fixes:
            df[column] = df[column].apply(fix.separators)
            # Run whitespace fix again after fixing invalid separators
            df[column] = df[column].apply(fix.whitespace)

        # Fix: duplicate metadata values
        df[column] = df[column].apply(fix.duplicates)

        # Check: invalid AGROVOC subject
        if args.agrovoc_fields:
            # Identify fields the user wants to validate against AGROVOC
            for field in args.agrovoc_fields.split(','):
                if column == field:
                    df[column] = df[column].apply(check.agrovoc, field_name=column)

        # Check: invalid language
        match = re.match(r'^.*?language.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.language)

        # Check: invalid ISSN
        match = re.match(r'^.*?issn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.issn)

        # Check: invalid ISBN
        match = re.match(r'^.*?isbn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.isbn)

        # Check: invalid date
        match = re.match(r'^.*?date.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.date, field_name=column)

        # Check: filename extension
        if column == 'filename':
            df[column] = df[column].apply(check.filename_extension)

    # Write
    df.to_csv(args.output_file, index=False)

    # Close the input and output files before exiting
    args.input_file.close()
    args.output_file.close()

    sys.exit(0)
Add option to print version with `--version` or `-V` I guess `-v` is more commonly used for "verbose" so I will use the short option of `-V` for version. 2019-08-01 23:09:08 +02:00			`from csv_metadata_quality.version import VERSION`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`import argparse`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`import csv_metadata_quality.check as check`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`import csv_metadata_quality.fix as fix`
			`import pandas as pd`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`import re`
csv_metadata_quality/app.py: Handle Ctrl-C Instead of printing an ugly two-page stack trace. 2019-08-03 20:11:57 +02:00			`import signal`
			`import sys`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00
			`def parse_args(argv):`
			`parser = argparse.ArgumentParser(description='Metadata quality checker and fixer.')`
Rework AGROVOC validation AGROVOC validation is now disabled by default, but can be enabled on a field-by-field basis. For example, countries and regions are also present in AGROVOC. Fields with these values can be enabled using the new `--agrovoc-fields` option. I reworked the script output to show the field name when printing an invalid term so that the user knows in which field the term is. 2019-08-01 22:51:58 +02:00			`parser.add_argument('--agrovoc-fields', '-a', help='Comma-separated list of fields to validate against AGROVOC, for example: dc.subject,cg.coverage.country')`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`parser.add_argument('--input-file', '-i', help='Path to input file. Can be UTF-8 CSV or Excel XLSX.', required=True, type=argparse.FileType('r', encoding='UTF-8'))`
			`parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))`
Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose. 2019-07-28 21:53:39 +02:00			`parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')`
Add option to print version with `--version` or `-V` I guess `-v` is more commonly used for "verbose" so I will use the short option of `-V` for version. 2019-08-01 23:09:08 +02:00			`parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}')`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`args = parser.parse_args()`

			`return args`


csv_metadata_quality/app.py: Handle Ctrl-C Instead of printing an ugly two-page stack trace. 2019-08-03 20:11:57 +02:00			`def signal_handler(signal, frame):`
			`sys.exit(1)`


Re-work as a proper standalone Python package Add a setup.py so that installation is easier and a standalone CLI script called csv-metadata-quality is provided. Now the user only needs to run this from a virtual environment inside the project directory: $ pip install . Eventually I could publish this on PyPi when I settle on a more appropriate package name. See: https://packaging.python.org/tutorials/packaging-projects/ See: https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/ 2019-07-31 16:34:36 +02:00			`def run(argv):`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`args = parse_args(argv)`

csv_metadata_quality/app.py: Handle Ctrl-C Instead of printing an ugly two-page stack trace. 2019-08-03 20:11:57 +02:00			`# set the signal handler for SIGINT (^C)`
			`signal.signal(signal.SIGINT, signal_handler)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Read all fields as strings so dates don't get converted from 1998 to 1998.0`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`df = pd.read_csv(args.input_file, dtype=str)`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
			`for column in df.columns.values.tolist():`
csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Fix: whitespace`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`df[column] = df[column].apply(fix.whitespace)`

Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr. 2019-07-30 19:05:12 +02:00			`# Fix: newlines`
			`if args.unsafe_fixes:`
			`df[column] = df[column].apply(fix.newlines)`

Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI. 2019-07-29 15:38:10 +02:00			`# Fix: unnecessary Unicode`
			`df[column] = df[column].apply(fix.unnecessary_unicode)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid multi-value separator`
Add check for invalid multi-value separators 2019-07-26 22:48:24 +02:00			`df[column] = df[column].apply(check.separators)`

Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data. 2019-07-29 16:08:49 +02:00			`# Check: suspicious characters`
Improve suspicious character detection Now it will print just the part of the metadata value that contains the suspicious character (up to 80 characters, so we don't make the line break on terminals that use 80 character width by default). Also, print the name of the field in which the metadata value is so that it is easier for the user to locate. 2019-08-09 00:22:59 +02:00			`df[column] = df[column].apply(check.suspicious_characters, field_name=column)`
Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data. 2019-07-29 16:08:49 +02:00
csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Fix: invalid multi-value separator`
Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose. 2019-07-28 21:53:39 +02:00			`if args.unsafe_fixes:`
			`df[column] = df[column].apply(fix.separators)`
			`# Run whitespace fix again after fixing invalid separators`
			`df[column] = df[column].apply(fix.whitespace)`

Add fix for duplicate metadata values 2019-07-29 17:05:03 +02:00			`# Fix: duplicate metadata values`
			`df[column] = df[column].apply(fix.duplicates)`

Add support for validating subjects against AGROVOC Checks values in the dc.subject or dcterms.subject field against the AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py. See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/ See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py 2019-07-29 23:30:31 +02:00			`# Check: invalid AGROVOC subject`
Rework AGROVOC validation AGROVOC validation is now disabled by default, but can be enabled on a field-by-field basis. For example, countries and regions are also present in AGROVOC. Fields with these values can be enabled using the new `--agrovoc-fields` option. I reworked the script output to show the field name when printing an invalid term so that the user knows in which field the term is. 2019-08-01 22:51:58 +02:00			`if args.agrovoc_fields:`
			`# Identify fields the user wants to validate against AGROVOC`
			`for field in args.agrovoc_fields.split(','):`
			`if column == field:`
			`df[column] = df[column].apply(check.agrovoc, field_name=column)`
Add support for validating subjects against AGROVOC Checks values in the dc.subject or dcterms.subject field against the AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py. See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/ See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py 2019-07-29 23:30:31 +02:00
Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other". 2019-07-29 17:59:42 +02:00			`# Check: invalid language`
			`match = re.match(r'^.?language.$', column)`
			`if match is not None:`
			`df[column] = df[column].apply(check.language)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid ISSN`
csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`match = re.match(r'^.?issn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.issn)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid ISBN`
csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`match = re.match(r'^.?isbn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.isbn)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid date`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`match = re.match(r'^.?date.$', column)`
			`if match is not None:`
Add column name to output in date checks This makes it easier to understand where the error is in case a CSV has multiple date fields, for example: Missing date (dc.date.issued). Missing date (dc.date.issued[]). If you have 126 items and you get 126 "Missing date" messages then it's likely that 100 of the items have dates in one field, and the others have dates in other field. 2019-08-21 14:31:12 +02:00			`df[column] = df[column].apply(check.date, field_name=column)`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00
Add check for uncommon filenames Generally we want people to upload documents in accessible formats like PDF, Word, Excel, and PowerPoint. This check warns if a file is using an uncommon extension. 2019-08-10 22:41:16 +02:00			`# Check: filename extension`
			`if column == 'filename':`
			`df[column] = df[column].apply(check.filename_extension)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Write`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`df.to_csv(args.output_file, index=False)`
csv_metadata_quality/app.py: Close files before exit 2019-08-04 08:10:19 +02:00
			`# Close the input and output files before exiting`
			`args.input_file.close()`
			`args.output_file.close()`
csv_metadata_quality/app.py: Explicitly exit with success 2019-08-04 08:10:37 +02:00
			`sys.exit(0)`