csv-metadata-quality/csv_metadata_quality/app.py

import argparse
import csv_metadata_quality.check as check
import csv_metadata_quality.fix as fix
import pandas as pd
import re


def parse_args(argv):
    parser = argparse.ArgumentParser(description='Metadata quality checker and fixer.')
    parser.add_argument('--input-file', '-i', help='Path to input file. Can be UTF-8 CSV or Excel XLSX.', required=True, type=argparse.FileType('r', encoding='UTF-8'))
    parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
    parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
    args = parser.parse_args()

    return args


def main(argv):
    args = parse_args(argv)

    # Read all fields as strings so dates don't get converted from 1998 to 1998.0
    df = pd.read_csv(args.input_file, dtype=str)

    # Fix whitespace in all columns
    for column in df.columns.values.tolist():
        # Run whitespace fix on all columns
        df[column] = df[column].apply(fix.whitespace)

        # Run invalid multi-value separator check on all columns
        df[column] = df[column].apply(check.separators)

        # Run invalid multi-value separator fix on all columns
        if args.unsafe_fixes:
            df[column] = df[column].apply(fix.separators)
            # Run whitespace fix again after fixing invalid separators
            df[column] = df[column].apply(fix.whitespace)

        # check if column is an issn column like dc.identifier.issn
        match = re.match(r'^.*?issn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.issn)

        # check if column is an isbn column like dc.identifier.isbn
        match = re.match(r'^.*?isbn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.isbn)

        # check if column is a date column like dc.date.issued
        match = re.match(r'^.*?date.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.date)

    # Write
    df.to_csv(args.output_file, index=False)
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`import argparse`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`import csv_metadata_quality.check as check`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`import csv_metadata_quality.fix as fix`
			`import pandas as pd`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`import re`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00
			`def parse_args(argv):`
			`parser = argparse.ArgumentParser(description='Metadata quality checker and fixer.')`
			`parser.add_argument('--input-file', '-i', help='Path to input file. Can be UTF-8 CSV or Excel XLSX.', required=True, type=argparse.FileType('r', encoding='UTF-8'))`
			`parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))`
Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose. 2019-07-28 21:53:39 +02:00			`parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`args = parser.parse_args()`

			`return args`


			`def main(argv):`
			`args = parse_args(argv)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Read all fields as strings so dates don't get converted from 1998 to 1998.0`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`df = pd.read_csv(args.input_file, dtype=str)`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
			`# Fix whitespace in all columns`
			`for column in df.columns.values.tolist():`
csv_metadata_quality/app.py: Add comment 2019-07-26 22:49:13 +02:00			`# Run whitespace fix on all columns`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`df[column] = df[column].apply(fix.whitespace)`

Add check for invalid multi-value separators 2019-07-26 22:48:24 +02:00			`# Run invalid multi-value separator check on all columns`
			`df[column] = df[column].apply(check.separators)`

Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose. 2019-07-28 21:53:39 +02:00			`# Run invalid multi-value separator fix on all columns`
			`if args.unsafe_fixes:`
			`df[column] = df[column].apply(fix.separators)`
			`# Run whitespace fix again after fixing invalid separators`
			`df[column] = df[column].apply(fix.whitespace)`

csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`# check if column is an issn column like dc.identifier.issn`
			`match = re.match(r'^.?issn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.issn)`

csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`# check if column is an isbn column like dc.identifier.isbn`
			`match = re.match(r'^.?isbn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.isbn)`

Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`# check if column is a date column like dc.date.issued`
			`match = re.match(r'^.?date.$', column)`
			`if match is not None:`
			`df[column] = df[column].apply(check.date)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Write`
Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes. 2019-07-28 19:31:57 +02:00			`df.to_csv(args.output_file, index=False)`