csv-metadata-quality/csv_metadata_quality/app.py

import csv_metadata_quality.check as check
import csv_metadata_quality.fix as fix
import pandas as pd
import re

def main():
    # Read all fields as strings so dates don't get converted from 1998 to 1998.0
    #df = pd.read_csv('/home/aorth/Downloads/2019-07-26-Bioversity-Migration.csv', dtype=str)
    #df = pd.read_csv('/tmp/quality.csv', dtype=str)
    df = pd.read_csv('data/test.csv', dtype=str)

    # Fix whitespace in all columns
    for column in df.columns.values.tolist():
        # Run whitespace fix on all columns
        df[column] = df[column].apply(fix.whitespace)

        # Run invalid multi-value separator check on all columns
        df[column] = df[column].apply(check.separators)

        if column == 'dc.identifier.issn':
            df[column] = df[column].apply(check.issn)

        if column == 'dc.identifier.isbn':
            df[column] = df[column].apply(check.isbn)

        # check if column is a date column like dc.date.issued
        match = re.match(r'^.*?date.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.date)

    # Write
    df.to_csv('/tmp/test.fixed.csv', index=False)
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`import csv_metadata_quality.check as check`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`import csv_metadata_quality.fix as fix`
			`import pandas as pd`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`import re`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
Main function should be "main()" 2019-07-27 22:09:16 +02:00			`def main():`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Read all fields as strings so dates don't get converted from 1998 to 1998.0`
			`#df = pd.read_csv('/home/aorth/Downloads/2019-07-26-Bioversity-Migration.csv', dtype=str)`
			`#df = pd.read_csv('/tmp/quality.csv', dtype=str)`
csv_metadata_quality/app.py: Fix path to test.csv 2019-07-26 23:25:30 +02:00			`df = pd.read_csv('data/test.csv', dtype=str)`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
			`# Fix whitespace in all columns`
			`for column in df.columns.values.tolist():`
csv_metadata_quality/app.py: Add comment 2019-07-26 22:49:13 +02:00			`# Run whitespace fix on all columns`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`df[column] = df[column].apply(fix.whitespace)`

Add check for invalid multi-value separators 2019-07-26 22:48:24 +02:00			`# Run invalid multi-value separator check on all columns`
			`df[column] = df[column].apply(check.separators)`

Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`if column == 'dc.identifier.issn':`
			`df[column] = df[column].apply(check.issn)`

			`if column == 'dc.identifier.isbn':`
			`df[column] = df[column].apply(check.isbn)`

Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`# check if column is a date column like dc.date.issued`
			`match = re.match(r'^.?date.$', column)`
			`if match is not None:`
			`df[column] = df[column].apply(check.date)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Write`
Bring test.csv into project 2019-07-26 22:14:37 +02:00			`df.to_csv('/tmp/test.fixed.csv', index=False)`