csv-metadata-quality/csv_metadata_quality/app.py

import argparse
import csv_metadata_quality.check as check
import csv_metadata_quality.fix as fix
import pandas as pd
import re


def parse_args(argv):
    parser = argparse.ArgumentParser(description='Metadata quality checker and fixer.')
    parser.add_argument('--input-file', '-i', help='Path to input file. Can be UTF-8 CSV or Excel XLSX.', required=True, type=argparse.FileType('r', encoding='UTF-8'))
    parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
    parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
    args = parser.parse_args()

    return args


def main(argv):
    args = parse_args(argv)

    # Read all fields as strings so dates don't get converted from 1998 to 1998.0
    df = pd.read_csv(args.input_file, dtype=str)

    for column in df.columns.values.tolist():
        # Fix: whitespace
        df[column] = df[column].apply(fix.whitespace)

        # Fix: unnecessary Unicode
        df[column] = df[column].apply(fix.unnecessary_unicode)

        # Check: invalid multi-value separator
        df[column] = df[column].apply(check.separators)

        # Check: suspicious characters
        df[column] = df[column].apply(check.suspicious_characters)

        # Fix: invalid multi-value separator
        if args.unsafe_fixes:
            df[column] = df[column].apply(fix.separators)
            # Run whitespace fix again after fixing invalid separators
            df[column] = df[column].apply(fix.whitespace)

        # Check: invalid ISSN
        match = re.match(r'^.*?issn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.issn)

        # Check: invalid ISBN
        match = re.match(r'^.*?isbn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.isbn)

        # Check: invalid date
        match = re.match(r'^.*?date.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.date)

    # Write
    df.to_csv(args.output_file, index=False)
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`import argparse`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`import csv_metadata_quality.check as check`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`import csv_metadata_quality.fix as fix`
			`import pandas as pd`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`import re`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00
			`def parse_args(argv):`
			`parser = argparse.ArgumentParser(description='Metadata quality checker and fixer.')`
			`parser.add_argument('--input-file', '-i', help='Path to input file. Can be UTF-8 CSV or Excel XLSX.', required=True, type=argparse.FileType('r', encoding='UTF-8'))`
			`parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))`
Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose. 2019-07-28 21:53:39 +02:00			`parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`args = parser.parse_args()`

			`return args`


			`def main(argv):`
			`args = parse_args(argv)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Read all fields as strings so dates don't get converted from 1998 to 1998.0`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`df = pd.read_csv(args.input_file, dtype=str)`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
			`for column in df.columns.values.tolist():`
csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Fix: whitespace`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`df[column] = df[column].apply(fix.whitespace)`

Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI. 2019-07-29 15:38:10 +02:00			`# Fix: unnecessary Unicode`
			`df[column] = df[column].apply(fix.unnecessary_unicode)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid multi-value separator`
Add check for invalid multi-value separators 2019-07-26 22:48:24 +02:00			`df[column] = df[column].apply(check.separators)`

Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data. 2019-07-29 16:08:49 +02:00			`# Check: suspicious characters`
			`df[column] = df[column].apply(check.suspicious_characters)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Fix: invalid multi-value separator`
Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose. 2019-07-28 21:53:39 +02:00			`if args.unsafe_fixes:`
			`df[column] = df[column].apply(fix.separators)`
			`# Run whitespace fix again after fixing invalid separators`
			`df[column] = df[column].apply(fix.whitespace)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid ISSN`
csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`match = re.match(r'^.?issn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.issn)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid ISBN`
csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`match = re.match(r'^.?isbn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.isbn)`

csv_metadata_quality/app.py: Improve comments 2019-07-29 15:24:35 +02:00			`# Check: invalid date`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`match = re.match(r'^.?date.$', column)`
			`if match is not None:`
			`df[column] = df[column].apply(check.date)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Write`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`df.to_csv(args.output_file, index=False)`