Release version 0.2.2

tests/test_fix.py: Add test for missing space after comma
data/test.csv: Add sample for missing space after comma
2025-07-11 00:32:21 +02:00 · 2019-08-28 00:10:17 +03:00 · 2019-08-28 00:08:56 +03:00 · 2019-08-28 00:08:26 +03:00 · 2019-08-28 00:05:52 +03:00 · 2019-08-27 00:12:41 +03:00
10 changed files with 112 additions and 43 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,9 +4,20 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [0.2.2] - 2019-08-27
+### Changed
+- Output of date checks to include column names (helps debugging in case there are multiple date fields)
+
+### Added
+- Ability to exclude certain fields using `--exclude-fields`
+- Fix for missing space after a comma, ie "Orth,Alan S."
+
+### Improved
+- AGROVOC lookup code
+
 ## [0.2.1] - 2019-08-11
 ### Added
- Check for uncommon filename extensions 
+- Check for uncommon filename extensions
 - Replacement of unneccessary Unicode characters like soft hyphens (U+00AD)

 ## [0.2.0] - 2019-08-09
--- a/README.md
+++ b/README.md
@ -1,4 +1,4 @@
-# CSV Metadata Quality [![Build Status](https://travis-ci.org/alanorth/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/alanorth/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
+# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
 A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.

 Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
@ -19,7 +19,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
 The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):

 ```
-$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
+$ git clone https://github.com/ilri/csv-metadata-quality.git
 $ cd csv-metadata-quality
 $ pipenv install
 $ pipenv shell
@ -28,7 +28,7 @@ $ pipenv shell
 Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:

 ```
-$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
+$ git clone https://github.com/ilri/csv-metadata-quality.git
 $ cd csv-metadata-quality
 $ python3 -m venv venv
 $ source venv/bin/activate
@ -74,6 +74,13 @@ Invalid AGROVOC (cg.coverage.country): KENYAA
 - Reporting / summary
 - Better logging, for example with INFO, WARN, and ERR levels
 - Verbose, debug, or quiet options
+- Warn if an author is shorter than 3 characters?
+- Validate dc.rights field against SPDX? Perhaps with an option like `-m spdx` to enable the spdx module?
+- Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
+- Warn if two items use the same file in `filename` column
+- Add an option to drop invalid AGROVOC subjects?
+- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
+- Add tests for application invocation, ie `tests/test_app.py`?

 ## License
 This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
--- a/csv_metadata_quality/app.py
+++ b/csv_metadata_quality/app.py
@ -15,6 +15,7 @@ def parse_args(argv):
    parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
    parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
    parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}')
+    parser.add_argument('--exclude-fields', '-x', help='Comma-separated list of fields to skip, for example: dc.contributor.author,dc.identifier.citation')
    args = parser.parse_args()

    return args
@ -34,6 +35,19 @@ def run(argv):
    df = pd.read_csv(args.input_file, dtype=str)

    for column in df.columns.values.tolist():
+        # Check if the user requested to skip any fields
+        if args.exclude_fields:
+            skip = False
+            # Split the list of excludes on ',' so we can test exact matches
+            # rather than fuzzy matches with regexes or "if word in string"
+            for exclude in args.exclude_fields.split(','):
+                if column == exclude and skip is False:
+                    skip = True
+            if skip:
+                print(f'Skipping {column}')
+
+                continue
+
        # Fix: whitespace
        df[column] = df[column].apply(fix.whitespace)

@ -41,6 +55,13 @@ def run(argv):
        if args.unsafe_fixes:
            df[column] = df[column].apply(fix.newlines)

+        # Fix: missing space after comma. Only run on author and citation
+        # fields for now, as this problem is mostly an issue in names.
+        if args.unsafe_fixes:
+            match = re.match(r'^.*?(author|citation).*$', column)
+            if match is not None:
+                df[column] = df[column].apply(fix.comma_space, field_name=column)
+
        # Fix: unnecessary Unicode
        df[column] = df[column].apply(fix.unnecessary_unicode)

@ -84,7 +105,7 @@ def run(argv):
        # Check: invalid date
        match = re.match(r'^.*?date.*$', column)
        if match is not None:
-            df[column] = df[column].apply(check.date)
+            df[column] = df[column].apply(check.date, field_name=column)

        # Check: filename extension
        if column == 'filename':
--- a/csv_metadata_quality/check.py
+++ b/csv_metadata_quality/check.py
@ -75,7 +75,7 @@ def separators(field):
    return field


-def date(field):
+def date(field, field_name):
    """Check if a date is valid.

    In DSpace the issue date is usually 1990, 1990-01, or 1990-01-01, but it
@ -88,7 +88,7 @@ def date(field):
    from datetime import datetime

    if pd.isna(field):
-        print(f'Missing date.')
+        print(f'Missing date ({field_name}).')

        return

@ -97,7 +97,7 @@ def date(field):

    # We don't allow multi-value date fields
    if len(multiple_dates) > 1:
-        print(f'Multiple dates not allowed: {field}')
+        print(f'Multiple dates not allowed ({field_name}): {field}')

        return field

@ -123,7 +123,7 @@ def date(field):

        return field
    except ValueError:
-        print(f'Invalid date: {field}')
+        print(f'Invalid date ({field_name}): {field}')

        return field

@ -212,7 +212,6 @@ def agrovoc(field, field_name):
    """

    from datetime import timedelta
-    import re
    import requests
    import requests_cache

@ -222,35 +221,23 @@ def agrovoc(field, field_name):

    # Try to split multi-value field on "||" separator
    for value in field.split('||'):
-        # match lines beginning with words, paying attention to subjects with
-        # special characters like spaces, quotes, dashes, parentheses, etc:
-        # SUBJECT
-        # ANOTHER SUBJECT
-        # XANTHOMONAS CAMPESTRIS PV. MANIHOTIS
-        # WOMEN'S PARTICIPATION
-        # COMMUNITY-BASED FOREST MANAGEMENT
-        # INTERACCIÓN GENOTIPO AMBIENTE
-        # COCOA (PLANT)
-        pattern = re.compile(r'^[\w\-\.\'\(\)]+?[\w\s\-\.\'\(\)]+$')
+        request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'

-        if pattern.match(value):
-            request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'
+        # enable transparent request cache with thirty days expiry
+        expire_after = timedelta(days=30)
+        requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)

-            # enable transparent request cache with thirty days expiry
-            expire_after = timedelta(days=30)
-            requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)
+        request = requests.get(request_url)

-            request = requests.get(request_url)
+        # prune old cache entries
+        requests_cache.core.remove_expired_responses()

-            # prune old cache entries
-            requests_cache.core.remove_expired_responses()
+        if request.status_code == requests.codes.ok:
+            data = request.json()

-            if request.status_code == requests.codes.ok:
-                data = request.json()
-
-                # check if there are any results
-                if len(data['results']) == 0:
-                    print(f'Invalid AGROVOC ({field_name}): {value}')
+            # check if there are any results
+            if len(data['results']) == 0:
+                print(f'Invalid AGROVOC ({field_name}): {value}')

    return field

--- a/csv_metadata_quality/fix.py
+++ b/csv_metadata_quality/fix.py
@ -176,3 +176,27 @@ def newlines(field):
        field = field.replace('\n', '')

    return field
+
+
+def comma_space(field, field_name):
+    """Fix occurrences of commas missing a trailing space, for example:
+
+    Orth,Alan S.
+
+    This is a very common mistake in author and citation fields.
+
+    Return string with a space added.
+    """
+
+    # Skip fields with missing values
+    if pd.isna(field):
+        return
+
+    # Check for comma followed by a word character
+    match = re.findall(r',\w', field)
+
+    if match:
+        print(f'Adding space after comma ({field_name}): {field}')
+        field = re.sub(r',(\w)', r', \1', field)
+
+    return field
--- a/csv_metadata_quality/version.py
+++ b/csv_metadata_quality/version.py
@ -1 +1 @@
-VERSION = '0.2.1'
+VERSION = '0.2.2'
--- a/data/test.csv
+++ b/data/test.csv
@ -23,3 +23,4 @@ Missing date,,,,,,,
 Invalid country,2019-08-01,,,,,KENYAA,
 Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck
 Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-92-9043-823-6,,,,
+"Missing space,after comma",2019-08-27,,,,,,
--- a/setup.py
+++ b/setup.py
@ -13,7 +13,7 @@ install_requires = [

 setuptools.setup(
    name="csv-metadata-quality",
-    version="0.2.1",
+    version="0.2.2",
    author="Alan Orth",
    author_email="aorth@mjanja.ch",
    description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
--- a/tests/test_check.py
+++ b/tests/test_check.py
@ -69,10 +69,12 @@ def test_check_missing_date(capsys):

    value = None

-    check.date(value)
+    field_name = 'dc.date.issued'
+
+    check.date(value, field_name)

    captured = capsys.readouterr()
-    assert captured.out == f'Missing date.\n'
+    assert captured.out == f'Missing date ({field_name}).\n'


 def test_check_multiple_dates(capsys):
@ -80,10 +82,12 @@ def test_check_multiple_dates(capsys):

    value = '1990||1991'

-    check.date(value)
+    field_name = 'dc.date.issued'
+
+    check.date(value, field_name)

    captured = capsys.readouterr()
-    assert captured.out == f'Multiple dates not allowed: {value}\n'
+    assert captured.out == f'Multiple dates not allowed ({field_name}): {value}\n'


 def test_check_invalid_date(capsys):
@ -91,10 +95,12 @@ def test_check_invalid_date(capsys):

    value = '1990-0'

-    check.date(value)
+    field_name = 'dc.date.issued'
+
+    check.date(value, field_name)

    captured = capsys.readouterr()
-    assert captured.out == f'Invalid date: {value}\n'
+    assert captured.out == f'Invalid date ({field_name}): {value}\n'


 def test_check_valid_date():
@ -102,7 +108,9 @@ def test_check_valid_date():

    value = '1990'

-    result = check.date(value)
+    field_name = 'dc.date.issued'
+
+    result = check.date(value, field_name)

    assert result == value

--- a/tests/test_fix.py
+++ b/tests/test_fix.py
@ -56,3 +56,13 @@ def test_fix_newlines():
 ya'''

    assert fix.newlines(value) == 'Kenya'
+
+
+def test_fix_comma_space():
+    '''Test adding space after comma.'''
+
+    value = 'Orth,Alan S.'
+
+    field_name = 'dc.contributor.author'
+
+    assert fix.comma_space(value, field_name) == 'Orth, Alan S.'
Author	SHA1	Message	Date
Alan Orth	c354a3687c	Release version 0.2.2	2019-08-28 00:10:17 +03:00
Alan Orth	07f80cb37f	tests/test_fix.py: Add test for missing space after comma	2019-08-28 00:08:56 +03:00
Alan Orth	89d72540f1	data/test.csv: Add sample for missing space after comma	2019-08-28 00:08:26 +03:00
Alan Orth	81190d56bb	Add fix for missing space after commas This happens in names very often, for example in the contributor and citation fields. I will limit this to those fields for now and hide this fix behind the "unsafe fixes" option until I test it more.	2019-08-28 00:05:52 +03:00
Alan Orth	2af714fb05	README.md: Add a handful of TODOs	2019-08-27 00:12:41 +03:00
Alan Orth	cc863a6bdd	CHANGELOG.md: Add note about excluding fields	2019-08-27 00:11:22 +03:00
Alan Orth	113e7cd8b6	csv_metadata_quality/app.py: Add ability to skip fields The user may want to skip the checking and fixing of certain fields in the input file.	2019-08-27 00:10:07 +03:00
Alan Orth	bd984f3db5	README.md: Update TravisCI badge	2019-08-22 15:07:03 +03:00
Alan Orth	3f4e84a638	README.md: Use ILRI GitHub remote	2019-08-22 14:54:12 +03:00
Alan Orth	c52b3ed131	CHANGELOG.md: Add note about AGROVOC	2019-08-21 16:37:49 +03:00
Alan Orth	884e8f970d	csv_metadata_quality/check.py: Simplify AGROVOC check I recycled this code from a separate agrovoc-lookup.py script that checks lines in a text file to see if they are valid AGROVOC terms or not. There I was concerned about skipping comments or something I think, but we don't need to check that here. We simply check the term that is in the field and inform the user if it's valid or not.	2019-08-21 16:35:29 +03:00
Alan Orth	6d02f5026a	CHANGELOG.md: Add note about date checks	2019-08-21 15:35:46 +03:00
Alan Orth	e7cb8920db	tests/test_check.py: Update date tests	2019-08-21 15:34:52 +03:00
Alan Orth	ed5612fbcf	Add column name to output in date checks This makes it easier to understand where the error is in case a CSV has multiple date fields, for example: Missing date (dc.date.issued). Missing date (dc.date.issued[]). If you have 126 items and you get 126 "Missing date" messages then it's likely that 100 of the items have dates in one field, and the others have dates in other field.	2019-08-21 15:31:12 +03:00
Alan Orth	3247495cee	CHANGELOG.md: Remove extra space	2019-08-11 10:43:27 +03:00