1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-05-09 22:56:01 +02:00

15 Commits

Author SHA1 Message Date
c354a3687c Release version 0.2.2 2019-08-28 00:10:17 +03:00
07f80cb37f tests/test_fix.py: Add test for missing space after comma 2019-08-28 00:08:56 +03:00
89d72540f1 data/test.csv: Add sample for missing space after comma 2019-08-28 00:08:26 +03:00
81190d56bb Add fix for missing space after commas
This happens in names very often, for example in the contributor
and citation fields. I will limit this to those fields for now and
hide this fix behind the "unsafe fixes" option until I test it more.
2019-08-28 00:05:52 +03:00
2af714fb05 README.md: Add a handful of TODOs 2019-08-27 00:12:41 +03:00
cc863a6bdd CHANGELOG.md: Add note about excluding fields 2019-08-27 00:11:22 +03:00
113e7cd8b6 csv_metadata_quality/app.py: Add ability to skip fields
The user may want to skip the checking and fixing of certain fields
in the input file.
2019-08-27 00:10:07 +03:00
bd984f3db5 README.md: Update TravisCI badge 2019-08-22 15:07:03 +03:00
3f4e84a638 README.md: Use ILRI GitHub remote 2019-08-22 14:54:12 +03:00
c52b3ed131 CHANGELOG.md: Add note about AGROVOC 2019-08-21 16:37:49 +03:00
884e8f970d csv_metadata_quality/check.py: Simplify AGROVOC check
I recycled this code from a separate agrovoc-lookup.py script that
checks lines in a text file to see if they are valid AGROVOC terms
or not. There I was concerned about skipping comments or something
I think, but we don't need to check that here. We simply check the
term that is in the field and inform the user if it's valid or not.
2019-08-21 16:35:29 +03:00
6d02f5026a CHANGELOG.md: Add note about date checks 2019-08-21 15:35:46 +03:00
e7cb8920db tests/test_check.py: Update date tests 2019-08-21 15:34:52 +03:00
ed5612fbcf Add column name to output in date checks
This makes it easier to understand where the error is in case a CSV
has multiple date fields, for example:

    Missing date (dc.date.issued).
    Missing date (dc.date.issued[]).

If you have 126 items and you get 126 "Missing date" messages then
it's likely that 100 of the items have dates in one field, and the
others have dates in other field.
2019-08-21 15:31:12 +03:00
3247495cee CHANGELOG.md: Remove extra space 2019-08-11 10:43:27 +03:00
10 changed files with 112 additions and 43 deletions

View File

@ -4,9 +4,20 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.2.2] - 2019-08-27
### Changed
- Output of date checks to include column names (helps debugging in case there are multiple date fields)
### Added
- Ability to exclude certain fields using `--exclude-fields`
- Fix for missing space after a comma, ie "Orth,Alan S."
### Improved
- AGROVOC lookup code
## [0.2.1] - 2019-08-11
### Added
- Check for uncommon filename extensions
- Check for uncommon filename extensions
- Replacement of unneccessary Unicode characters like soft hyphens (U+00AD)
## [0.2.0] - 2019-08-09

View File

@ -1,4 +1,4 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/alanorth/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/alanorth/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
@ -19,7 +19,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
```
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
$ git clone https://github.com/ilri/csv-metadata-quality.git
$ cd csv-metadata-quality
$ pipenv install
$ pipenv shell
@ -28,7 +28,7 @@ $ pipenv shell
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:
```
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
$ git clone https://github.com/ilri/csv-metadata-quality.git
$ cd csv-metadata-quality
$ python3 -m venv venv
$ source venv/bin/activate
@ -74,6 +74,13 @@ Invalid AGROVOC (cg.coverage.country): KENYAA
- Reporting / summary
- Better logging, for example with INFO, WARN, and ERR levels
- Verbose, debug, or quiet options
- Warn if an author is shorter than 3 characters?
- Validate dc.rights field against SPDX? Perhaps with an option like `-m spdx` to enable the spdx module?
- Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
- Warn if two items use the same file in `filename` column
- Add an option to drop invalid AGROVOC subjects?
- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
- Add tests for application invocation, ie `tests/test_app.py`?
## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@ -15,6 +15,7 @@ def parse_args(argv):
parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}')
parser.add_argument('--exclude-fields', '-x', help='Comma-separated list of fields to skip, for example: dc.contributor.author,dc.identifier.citation')
args = parser.parse_args()
return args
@ -34,6 +35,19 @@ def run(argv):
df = pd.read_csv(args.input_file, dtype=str)
for column in df.columns.values.tolist():
# Check if the user requested to skip any fields
if args.exclude_fields:
skip = False
# Split the list of excludes on ',' so we can test exact matches
# rather than fuzzy matches with regexes or "if word in string"
for exclude in args.exclude_fields.split(','):
if column == exclude and skip is False:
skip = True
if skip:
print(f'Skipping {column}')
continue
# Fix: whitespace
df[column] = df[column].apply(fix.whitespace)
@ -41,6 +55,13 @@ def run(argv):
if args.unsafe_fixes:
df[column] = df[column].apply(fix.newlines)
# Fix: missing space after comma. Only run on author and citation
# fields for now, as this problem is mostly an issue in names.
if args.unsafe_fixes:
match = re.match(r'^.*?(author|citation).*$', column)
if match is not None:
df[column] = df[column].apply(fix.comma_space, field_name=column)
# Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode)
@ -84,7 +105,7 @@ def run(argv):
# Check: invalid date
match = re.match(r'^.*?date.*$', column)
if match is not None:
df[column] = df[column].apply(check.date)
df[column] = df[column].apply(check.date, field_name=column)
# Check: filename extension
if column == 'filename':

View File

@ -75,7 +75,7 @@ def separators(field):
return field
def date(field):
def date(field, field_name):
"""Check if a date is valid.
In DSpace the issue date is usually 1990, 1990-01, or 1990-01-01, but it
@ -88,7 +88,7 @@ def date(field):
from datetime import datetime
if pd.isna(field):
print(f'Missing date.')
print(f'Missing date ({field_name}).')
return
@ -97,7 +97,7 @@ def date(field):
# We don't allow multi-value date fields
if len(multiple_dates) > 1:
print(f'Multiple dates not allowed: {field}')
print(f'Multiple dates not allowed ({field_name}): {field}')
return field
@ -123,7 +123,7 @@ def date(field):
return field
except ValueError:
print(f'Invalid date: {field}')
print(f'Invalid date ({field_name}): {field}')
return field
@ -212,7 +212,6 @@ def agrovoc(field, field_name):
"""
from datetime import timedelta
import re
import requests
import requests_cache
@ -222,35 +221,23 @@ def agrovoc(field, field_name):
# Try to split multi-value field on "||" separator
for value in field.split('||'):
# match lines beginning with words, paying attention to subjects with
# special characters like spaces, quotes, dashes, parentheses, etc:
# SUBJECT
# ANOTHER SUBJECT
# XANTHOMONAS CAMPESTRIS PV. MANIHOTIS
# WOMEN'S PARTICIPATION
# COMMUNITY-BASED FOREST MANAGEMENT
# INTERACCIÓN GENOTIPO AMBIENTE
# COCOA (PLANT)
pattern = re.compile(r'^[\w\-\.\'\(\)]+?[\w\s\-\.\'\(\)]+$')
request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'
if pattern.match(value):
request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'
# enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30)
requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)
# enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30)
requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)
request = requests.get(request_url)
request = requests.get(request_url)
# prune old cache entries
requests_cache.core.remove_expired_responses()
# prune old cache entries
requests_cache.core.remove_expired_responses()
if request.status_code == requests.codes.ok:
data = request.json()
if request.status_code == requests.codes.ok:
data = request.json()
# check if there are any results
if len(data['results']) == 0:
print(f'Invalid AGROVOC ({field_name}): {value}')
# check if there are any results
if len(data['results']) == 0:
print(f'Invalid AGROVOC ({field_name}): {value}')
return field

View File

@ -176,3 +176,27 @@ def newlines(field):
field = field.replace('\n', '')
return field
def comma_space(field, field_name):
"""Fix occurrences of commas missing a trailing space, for example:
Orth,Alan S.
This is a very common mistake in author and citation fields.
Return string with a space added.
"""
# Skip fields with missing values
if pd.isna(field):
return
# Check for comma followed by a word character
match = re.findall(r',\w', field)
if match:
print(f'Adding space after comma ({field_name}): {field}')
field = re.sub(r',(\w)', r', \1', field)
return field

View File

@ -1 +1 @@
VERSION = '0.2.1'
VERSION = '0.2.2'

View File

@ -23,3 +23,4 @@ Missing date,,,,,,,
Invalid country,2019-08-01,,,,,KENYAA,
Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck
Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,,
"Missing space,after comma",2019-08-27,,,,,,

1 dc.contributor.author birthdate dc.identifier.issn dc.identifier.isbn dc.language.iso dc.subject cg.coverage.country filename
23 Uncommon filename extension 2019-08-10 file.pdf.lck
24 Unneccesary unicode (U+002D + U+00AD) 2019-08-10 978-­92-­9043-­823-­6
25 Missing space,after comma 2019-08-27
26

View File

@ -13,7 +13,7 @@ install_requires = [
setuptools.setup(
name="csv-metadata-quality",
version="0.2.1",
version="0.2.2",
author="Alan Orth",
author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",

View File

@ -69,10 +69,12 @@ def test_check_missing_date(capsys):
value = None
check.date(value)
field_name = 'dc.date.issued'
check.date(value, field_name)
captured = capsys.readouterr()
assert captured.out == f'Missing date.\n'
assert captured.out == f'Missing date ({field_name}).\n'
def test_check_multiple_dates(capsys):
@ -80,10 +82,12 @@ def test_check_multiple_dates(capsys):
value = '1990||1991'
check.date(value)
field_name = 'dc.date.issued'
check.date(value, field_name)
captured = capsys.readouterr()
assert captured.out == f'Multiple dates not allowed: {value}\n'
assert captured.out == f'Multiple dates not allowed ({field_name}): {value}\n'
def test_check_invalid_date(capsys):
@ -91,10 +95,12 @@ def test_check_invalid_date(capsys):
value = '1990-0'
check.date(value)
field_name = 'dc.date.issued'
check.date(value, field_name)
captured = capsys.readouterr()
assert captured.out == f'Invalid date: {value}\n'
assert captured.out == f'Invalid date ({field_name}): {value}\n'
def test_check_valid_date():
@ -102,7 +108,9 @@ def test_check_valid_date():
value = '1990'
result = check.date(value)
field_name = 'dc.date.issued'
result = check.date(value, field_name)
assert result == value

View File

@ -56,3 +56,13 @@ def test_fix_newlines():
ya'''
assert fix.newlines(value) == 'Kenya'
def test_fix_comma_space():
'''Test adding space after comma.'''
value = 'Orth,Alan S.'
field_name = 'dc.contributor.author'
assert fix.comma_space(value, field_name) == 'Orth, Alan S.'