csv-metadata-quality/csv_metadata_quality/app.py

import csv_metadata_quality.check as check
import csv_metadata_quality.fix as fix
import pandas as pd
import re

def main():
    # Read all fields as strings so dates don't get converted from 1998 to 1998.0
    df = pd.read_csv('data/test.csv', dtype=str)

    # Fix whitespace in all columns
    for column in df.columns.values.tolist():
        # Run whitespace fix on all columns
        df[column] = df[column].apply(fix.whitespace)

        # Run invalid multi-value separator check on all columns
        df[column] = df[column].apply(check.separators)

        # check if column is an issn column like dc.identifier.issn
        match = re.match(r'^.*?issn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.issn)

        # check if column is an isbn column like dc.identifier.isbn
        match = re.match(r'^.*?isbn.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.isbn)

        # check if column is a date column like dc.date.issued
        match = re.match(r'^.*?date.*$', column)
        if match is not None:
            df[column] = df[column].apply(check.date)

    # Write
    df.to_csv('/tmp/test.fixed.csv', index=False)
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`import csv_metadata_quality.check as check`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`import csv_metadata_quality.fix as fix`
			`import pandas as pd`
Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`import re`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
Main function should be "main()" 2019-07-27 22:09:16 +02:00			`def main():`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Read all fields as strings so dates don't get converted from 1998 to 1998.0`
csv_metadata_quality/app.py: Fix path to test.csv 2019-07-26 23:25:30 +02:00			`df = pd.read_csv('data/test.csv', dtype=str)`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00
			`# Fix whitespace in all columns`
			`for column in df.columns.values.tolist():`
csv_metadata_quality/app.py: Add comment 2019-07-26 22:49:13 +02:00			`# Run whitespace fix on all columns`
Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`df[column] = df[column].apply(fix.whitespace)`

Add check for invalid multi-value separators 2019-07-26 22:48:24 +02:00			`# Run invalid multi-value separator check on all columns`
			`df[column] = df[column].apply(check.separators)`

csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`# check if column is an issn column like dc.identifier.issn`
			`match = re.match(r'^.?issn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.issn)`

csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens. 2019-07-28 16:27:20 +02:00			`# check if column is an isbn column like dc.identifier.isbn`
			`match = re.match(r'^.?isbn.$', column)`
			`if match is not None:`
Add ISSN and ISBN checks using python-stdnum 2019-07-26 22:14:10 +02:00			`df[column] = df[column].apply(check.isbn)`

Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system. 2019-07-28 15:11:36 +02:00			`# check if column is a date column like dc.date.issued`
			`match = re.match(r'^.?date.$', column)`
			`if match is not None:`
			`df[column] = df[column].apply(check.date)`

Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6 2019-07-26 21:11:10 +02:00			`# Write`
Bring test.csv into project 2019-07-26 22:14:37 +02:00			`df.to_csv('/tmp/test.fixed.csv', index=False)`