Release version 0.2.2

tests/test_fix.py: Add test for missing space after comma
data/test.csv: Add sample for missing space after comma
2025-05-09 22:56:01 +02:00 · 2019-08-28 00:10:17 +03:00 · 2019-08-28 00:08:56 +03:00 · 2019-08-28 00:08:26 +03:00 · 2019-08-28 00:05:52 +03:00 · 2019-08-27 00:12:41 +03:00
10 changed files with 223 additions and 66 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,6 +4,22 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [0.2.2] - 2019-08-27
+### Changed
+- Output of date checks to include column names (helps debugging in case there are multiple date fields)
+
+### Added
+- Ability to exclude certain fields using `--exclude-fields`
+- Fix for missing space after a comma, ie "Orth,Alan S."
+
+### Improved
+- AGROVOC lookup code
+
+## [0.2.1] - 2019-08-11
+### Added
+- Check for uncommon filename extensions
+- Replacement of unneccessary Unicode characters like soft hyphens (U+00AD)
+
 ## [0.2.0] - 2019-08-09
 ### Added
 - Handle Ctrl-C interrupt gracefully
--- a/README.md
+++ b/README.md
@ -1,4 +1,4 @@
-# CSV Metadata Quality [![Build Status](https://travis-ci.org/alanorth/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/alanorth/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
+# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
 A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.

 Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
@ -19,7 +19,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
 The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):

 ```
-$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
+$ git clone https://github.com/ilri/csv-metadata-quality.git
 $ cd csv-metadata-quality
 $ pipenv install
 $ pipenv shell
@ -28,7 +28,7 @@ $ pipenv shell
 Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:

 ```
-$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
+$ git clone https://github.com/ilri/csv-metadata-quality.git
 $ cd csv-metadata-quality
 $ python3 -m venv venv
 $ source venv/bin/activate
@ -74,6 +74,13 @@ Invalid AGROVOC (cg.coverage.country): KENYAA
 - Reporting / summary
 - Better logging, for example with INFO, WARN, and ERR levels
 - Verbose, debug, or quiet options
+- Warn if an author is shorter than 3 characters?
+- Validate dc.rights field against SPDX? Perhaps with an option like `-m spdx` to enable the spdx module?
+- Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
+- Warn if two items use the same file in `filename` column
+- Add an option to drop invalid AGROVOC subjects?
+- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
+- Add tests for application invocation, ie `tests/test_app.py`?

 ## License
 This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
--- a/csv_metadata_quality/app.py
+++ b/csv_metadata_quality/app.py
@ -15,6 +15,7 @@ def parse_args(argv):
    parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
    parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
    parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}')
+    parser.add_argument('--exclude-fields', '-x', help='Comma-separated list of fields to skip, for example: dc.contributor.author,dc.identifier.citation')
    args = parser.parse_args()

    return args
@ -34,6 +35,19 @@ def run(argv):
    df = pd.read_csv(args.input_file, dtype=str)

    for column in df.columns.values.tolist():
+        # Check if the user requested to skip any fields
+        if args.exclude_fields:
+            skip = False
+            # Split the list of excludes on ',' so we can test exact matches
+            # rather than fuzzy matches with regexes or "if word in string"
+            for exclude in args.exclude_fields.split(','):
+                if column == exclude and skip is False:
+                    skip = True
+            if skip:
+                print(f'Skipping {column}')
+
+                continue
+
        # Fix: whitespace
        df[column] = df[column].apply(fix.whitespace)

@ -41,6 +55,13 @@ def run(argv):
        if args.unsafe_fixes:
            df[column] = df[column].apply(fix.newlines)

+        # Fix: missing space after comma. Only run on author and citation
+        # fields for now, as this problem is mostly an issue in names.
+        if args.unsafe_fixes:
+            match = re.match(r'^.*?(author|citation).*$', column)
+            if match is not None:
+                df[column] = df[column].apply(fix.comma_space, field_name=column)
+
        # Fix: unnecessary Unicode
        df[column] = df[column].apply(fix.unnecessary_unicode)

@ -84,7 +105,11 @@ def run(argv):
        # Check: invalid date
        match = re.match(r'^.*?date.*$', column)
        if match is not None:
-            df[column] = df[column].apply(check.date)
+            df[column] = df[column].apply(check.date, field_name=column)
+
+        # Check: filename extension
+        if column == 'filename':
+            df[column] = df[column].apply(check.filename_extension)

    # Write
    df.to_csv(args.output_file, index=False)
--- a/csv_metadata_quality/check.py
+++ b/csv_metadata_quality/check.py
@ -75,7 +75,7 @@ def separators(field):
    return field


-def date(field):
+def date(field, field_name):
    """Check if a date is valid.

    In DSpace the issue date is usually 1990, 1990-01, or 1990-01-01, but it
@ -88,7 +88,7 @@ def date(field):
    from datetime import datetime

    if pd.isna(field):
-        print(f'Missing date.')
+        print(f'Missing date ({field_name}).')

        return

@ -97,7 +97,7 @@ def date(field):

    # We don't allow multi-value date fields
    if len(multiple_dates) > 1:
-        print(f'Multiple dates not allowed: {field}')
+        print(f'Multiple dates not allowed ({field_name}): {field}')

        return field

@ -123,7 +123,7 @@ def date(field):

        return field
    except ValueError:
-        print(f'Invalid date: {field}')
+        print(f'Invalid date ({field_name}): {field}')

        return field

@ -212,7 +212,6 @@ def agrovoc(field, field_name):
    """

    from datetime import timedelta
-    import re
    import requests
    import requests_cache

@ -222,34 +221,67 @@ def agrovoc(field, field_name):

    # Try to split multi-value field on "||" separator
    for value in field.split('||'):
-        # match lines beginning with words, paying attention to subjects with
-        # special characters like spaces, quotes, dashes, parentheses, etc:
-        # SUBJECT
-        # ANOTHER SUBJECT
-        # XANTHOMONAS CAMPESTRIS PV. MANIHOTIS
-        # WOMEN'S PARTICIPATION
-        # COMMUNITY-BASED FOREST MANAGEMENT
-        # INTERACCIÓN GENOTIPO AMBIENTE
-        # COCOA (PLANT)
-        pattern = re.compile(r'^[\w\-\.\'\(\)]+?[\w\s\-\.\'\(\)]+$')
+        request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'

-        if pattern.match(value):
-            request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'
+        # enable transparent request cache with thirty days expiry
+        expire_after = timedelta(days=30)
+        requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)

-            # enable transparent request cache with thirty days expiry
-            expire_after = timedelta(days=30)
-            requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)
+        request = requests.get(request_url)

-            request = requests.get(request_url)
+        # prune old cache entries
+        requests_cache.core.remove_expired_responses()

-            # prune old cache entries
-            requests_cache.core.remove_expired_responses()
+        if request.status_code == requests.codes.ok:
+            data = request.json()

-            if request.status_code == requests.codes.ok:
-                data = request.json()
-
-                # check if there are any results
-                if len(data['results']) == 0:
-                    print(f'Invalid AGROVOC ({field_name}): {value}')
+            # check if there are any results
+            if len(data['results']) == 0:
+                print(f'Invalid AGROVOC ({field_name}): {value}')
+
+    return field
+
+
+def filename_extension(field):
+    """Check filename extension.
+
+    CSVs with a 'filename' column are likely meant as input for the SAFBuilder
+    tool, which creates a Simple Archive Format bundle for importing metadata
+    with accompanying PDFs or other files into DSpace.
+
+    This check warns if a filename has an uncommon extension (that is, other
+    than .pdf, .xls(x), .doc(x), ppt(x), case insensitive).
+    """
+
+    import re
+
+    # Skip fields with missing values
+    if pd.isna(field):
+        return
+
+    # Try to split multi-value field on "||" separator
+    values = field.split('||')
+
+    # List of common filename extentions
+    common_filename_extensions = ['.pdf', '.doc', '.docx', '.ppt', '.pptx', '.xls', '.xlsx']
+
+    # Iterate over all values
+    for value in values:
+        # Assume filename extension does not match
+        filename_extension_match = False
+
+        for filename_extension in common_filename_extensions:
+            # Check for extension at the end of the filename
+            pattern = re.escape(filename_extension) + r'$'
+            match = re.search(pattern, value, re.IGNORECASE)
+
+            if match is not None:
+                # Register the match and stop checking for this filename
+                filename_extension_match = True
+
+                break
+
+        if filename_extension_match is False:
+            print(f'Filename with uncommon extension: {value}')

    return field
--- a/csv_metadata_quality/fix.py
+++ b/csv_metadata_quality/fix.py
@ -68,14 +68,17 @@ def separators(field):


 def unnecessary_unicode(field):
-    """Remove unnecessary Unicode characters.
+    """Remove and replace unnecessary Unicode characters.

    Removes unnecessary Unicode characters like:
        - Zero-width space (U+200B)
        - Replacement character (U+FFFD)
        - No-break space (U+00A0)

-    Return string with characters removed.
+    Replaces unnecessary Unicode characters like:
+        - Soft hyphen (U+00AD) → hyphen
+
+    Return string with characters removed or replaced.
    """

    # Skip fields with missing values
@ -106,6 +109,14 @@ def unnecessary_unicode(field):
        print(f'Removing unnecessary Unicode (U+00A0): {field}')
        field = re.sub(pattern, '', field)

+    # Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen
+    pattern = re.compile(r'\u002D*?\u00AD')
+    match = re.findall(pattern, field)
+
+    if match:
+        print(f'Replacing unnecessary Unicode (U+00AD): {field}')
+        field = re.sub(pattern, '-', field)
+
    return field


@ -165,3 +176,27 @@ def newlines(field):
        field = field.replace('\n', '')

    return field
+
+
+def comma_space(field, field_name):
+    """Fix occurrences of commas missing a trailing space, for example:
+
+    Orth,Alan S.
+
+    This is a very common mistake in author and citation fields.
+
+    Return string with a space added.
+    """
+
+    # Skip fields with missing values
+    if pd.isna(field):
+        return
+
+    # Check for comma followed by a word character
+    match = re.findall(r',\w', field)
+
+    if match:
+        print(f'Adding space after comma ({field_name}): {field}')
+        field = re.sub(r',(\w)', r', \1', field)
+
+    return field
--- a/csv_metadata_quality/version.py
+++ b/csv_metadata_quality/version.py
@ -1 +1 @@
-VERSION = '0.2.0'
+VERSION = '0.2.2'
--- a/data/test.csv
+++ b/data/test.csv
@ -1,23 +1,26 @@
-dc.contributor.author,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country
- Leading space,2019-07-29,,,,,
-Trailing space ,2019-07-29,,,,,
-Excessive  space,2019-07-29,,,,,
-Miscellaenous ||whitespace | issues ,2019-07-29,,,,,
-Duplicate||Duplicate,2019-07-29,,,,,
-Invalid ISSN,2019-07-29,2321-2302,,,,
-Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,
-Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,
-Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,
-Invalid date,2019-07-260,,,,,
-Multiple dates,2019-07-26||2019-01-10,,,,,
-Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,
-Unnecessary Unicode,2019-07-29,,,,,
-Suspicious character||foreˆt,2019-07-29,,,,,
-Invalid ISO 639-2 language,2019-07-29,,,jp,,
-Invalid ISO 639-3 language,2019-07-29,,,chi,,
-Invalid language,2019-07-29,,,Span,,
-Invalid AGROVOC subject,2019-07-29,,,,FOREST,
+dc.contributor.author,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename
+ Leading space,2019-07-29,,,,,,
+Trailing space ,2019-07-29,,,,,,
+Excessive  space,2019-07-29,,,,,,
+Miscellaenous ||whitespace | issues ,2019-07-29,,,,,,
+Duplicate||Duplicate,2019-07-29,,,,,,
+Invalid ISSN,2019-07-29,2321-2302,,,,,
+Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,,
+Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,,
+Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,,
+Invalid date,2019-07-260,,,,,,
+Multiple dates,2019-07-26||2019-01-10,,,,,,
+Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,,
+Unnecessary Unicode,2019-07-29,,,,,,
+Suspicious character||foreˆt,2019-07-29,,,,,,
+Invalid ISO 639-2 language,2019-07-29,,,jp,,,
+Invalid ISO 639-3 language,2019-07-29,,,chi,,,
+Invalid language,2019-07-29,,,Span,,,
+Invalid AGROVOC subject,2019-07-29,,,,FOREST,,
 Newline (LF),2019-07-30,,,,"TANZA
-NIA",
-Missing date,,,,,,
-Invalid country,2019-08-01,,,,,KENYAA
+NIA",,
+Missing date,,,,,,,
+Invalid country,2019-08-01,,,,,KENYAA,
+Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck
+Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-92-9043-823-6,,,,
+"Missing space,after comma",2019-08-27,,,,,,
--- a/setup.py
+++ b/setup.py
@ -13,7 +13,7 @@ install_requires = [

 setuptools.setup(
    name="csv-metadata-quality",
-    version="0.2.0",
+    version="0.2.2",
    author="Alan Orth",
    author_email="aorth@mjanja.ch",
    description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
--- a/tests/test_check.py
+++ b/tests/test_check.py
@ -69,10 +69,12 @@ def test_check_missing_date(capsys):

    value = None

-    check.date(value)
+    field_name = 'dc.date.issued'
+
+    check.date(value, field_name)

    captured = capsys.readouterr()
-    assert captured.out == f'Missing date.\n'
+    assert captured.out == f'Missing date ({field_name}).\n'


 def test_check_multiple_dates(capsys):
@ -80,10 +82,12 @@ def test_check_multiple_dates(capsys):

    value = '1990||1991'

-    check.date(value)
+    field_name = 'dc.date.issued'
+
+    check.date(value, field_name)

    captured = capsys.readouterr()
-    assert captured.out == f'Multiple dates not allowed: {value}\n'
+    assert captured.out == f'Multiple dates not allowed ({field_name}): {value}\n'


 def test_check_invalid_date(capsys):
@ -91,10 +95,12 @@ def test_check_invalid_date(capsys):

    value = '1990-0'

-    check.date(value)
+    field_name = 'dc.date.issued'
+
+    check.date(value, field_name)

    captured = capsys.readouterr()
-    assert captured.out == f'Invalid date: {value}\n'
+    assert captured.out == f'Invalid date ({field_name}): {value}\n'


 def test_check_valid_date():
@ -102,7 +108,9 @@ def test_check_valid_date():

    value = '1990'

-    result = check.date(value)
+    field_name = 'dc.date.issued'
+
+    result = check.date(value, field_name)

    assert result == value

@ -194,3 +202,24 @@ def test_check_valid_agrovoc():
    result = check.agrovoc(value, field_name)

    assert result == value
+
+
+def test_check_uncommon_filename_extension(capsys):
+    '''Test uncommon filename extension.'''
+
+    value = 'file.pdf.lck'
+
+    check.filename_extension(value)
+
+    captured = capsys.readouterr()
+    assert captured.out == f'Filename with uncommon extension: {value}\n'
+
+
+def test_check_common_filename_extension():
+    '''Test common filename extension.'''
+
+    value = 'file.pdf'
+
+    result = check.filename_extension(value)
+
+    assert result == value
--- a/tests/test_fix.py
+++ b/tests/test_fix.py
@ -56,3 +56,13 @@ def test_fix_newlines():
 ya'''

    assert fix.newlines(value) == 'Kenya'
+
+
+def test_fix_comma_space():
+    '''Test adding space after comma.'''
+
+    value = 'Orth,Alan S.'
+
+    field_name = 'dc.contributor.author'
+
+    assert fix.comma_space(value, field_name) == 'Orth, Alan S.'
Author	SHA1	Message	Date
Alan Orth	c354a3687c	Release version 0.2.2	2019-08-28 00:10:17 +03:00
Alan Orth	07f80cb37f	tests/test_fix.py: Add test for missing space after comma	2019-08-28 00:08:56 +03:00
Alan Orth	89d72540f1	data/test.csv: Add sample for missing space after comma	2019-08-28 00:08:26 +03:00
Alan Orth	81190d56bb	Add fix for missing space after commas This happens in names very often, for example in the contributor and citation fields. I will limit this to those fields for now and hide this fix behind the "unsafe fixes" option until I test it more.	2019-08-28 00:05:52 +03:00
Alan Orth	2af714fb05	README.md: Add a handful of TODOs	2019-08-27 00:12:41 +03:00
Alan Orth	cc863a6bdd	CHANGELOG.md: Add note about excluding fields	2019-08-27 00:11:22 +03:00
Alan Orth	113e7cd8b6	csv_metadata_quality/app.py: Add ability to skip fields The user may want to skip the checking and fixing of certain fields in the input file.	2019-08-27 00:10:07 +03:00
Alan Orth	bd984f3db5	README.md: Update TravisCI badge	2019-08-22 15:07:03 +03:00
Alan Orth	3f4e84a638	README.md: Use ILRI GitHub remote	2019-08-22 14:54:12 +03:00
Alan Orth	c52b3ed131	CHANGELOG.md: Add note about AGROVOC	2019-08-21 16:37:49 +03:00
Alan Orth	884e8f970d	csv_metadata_quality/check.py: Simplify AGROVOC check I recycled this code from a separate agrovoc-lookup.py script that checks lines in a text file to see if they are valid AGROVOC terms or not. There I was concerned about skipping comments or something I think, but we don't need to check that here. We simply check the term that is in the field and inform the user if it's valid or not.	2019-08-21 16:35:29 +03:00
Alan Orth	6d02f5026a	CHANGELOG.md: Add note about date checks	2019-08-21 15:35:46 +03:00
Alan Orth	e7cb8920db	tests/test_check.py: Update date tests	2019-08-21 15:34:52 +03:00
Alan Orth	ed5612fbcf	Add column name to output in date checks This makes it easier to understand where the error is in case a CSV has multiple date fields, for example: Missing date (dc.date.issued). Missing date (dc.date.issued[]). If you have 126 items and you get 126 "Missing date" messages then it's likely that 100 of the items have dates in one field, and the others have dates in other field.	2019-08-21 15:31:12 +03:00
Alan Orth	3247495cee	CHANGELOG.md: Remove extra space	2019-08-11 10:43:27 +03:00
Alan Orth	7255bf4707	Version 0.2.1	2019-08-11 10:39:39 +03:00
Alan Orth	3aaf18c290	CHANGELOG.md: Move unreleased changes to 0.2.1	2019-08-11 10:39:18 +03:00
Alan Orth	745306edd7	CHANGELOG.md: Add note about replacement of unnccesary Unicode	2019-08-11 00:09:35 +03:00
Alan Orth	e324e321a2	data/test.csv: Add test for replacement of unneccessary Unicode	2019-08-11 00:08:44 +03:00
Alan Orth	232ff99898	csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes Add a check for soft hyphens (U+00AD). In one sample CSV I have a normal hyphen followed by a soft hyphen in an ISBN. This causes the ISBN validation to fail.	2019-08-11 00:07:21 +03:00
Alan Orth	13d5221378	csv_metadata_quality/check.py: Fix test for False	2019-08-10 23:52:53 +03:00
Alan Orth	3c7a9eb75b	CHANGELOG.md: Add check for uncommon filename extensions	2019-08-10 23:47:46 +03:00
Alan Orth	a99fbd8a51	data/test.csv: Add test case for uncommon filename extension	2019-08-10 23:46:56 +03:00
Alan Orth	e801042340	tests/test_check.py: Fix unused result We don't need to capture the function's return value here because pytest will capture stdout from the function.	2019-08-10 23:45:41 +03:00
Alan Orth	62ef2a4489	tests/test_check.py: Add tests for file extensions	2019-08-10 23:44:13 +03:00
Alan Orth	9ce7dc6716	Add check for uncommon filenames Generally we want people to upload documents in accessible formats like PDF, Word, Excel, and PowerPoint. This check warns if a file is using an uncommon extension.	2019-08-10 23:41:16 +03:00