mirror of
https://github.com/ilri/csv-metadata-quality.git
synced 2025-05-10 07:06:00 +02:00
Compare commits
26 Commits
Author | SHA1 | Date | |
---|---|---|---|
c354a3687c
|
|||
07f80cb37f
|
|||
89d72540f1
|
|||
81190d56bb
|
|||
2af714fb05
|
|||
cc863a6bdd
|
|||
113e7cd8b6
|
|||
bd984f3db5
|
|||
3f4e84a638
|
|||
c52b3ed131
|
|||
884e8f970d
|
|||
6d02f5026a
|
|||
e7cb8920db
|
|||
ed5612fbcf
|
|||
3247495cee
|
|||
7255bf4707
|
|||
3aaf18c290
|
|||
745306edd7
|
|||
e324e321a2
|
|||
232ff99898
|
|||
13d5221378
|
|||
3c7a9eb75b
|
|||
a99fbd8a51
|
|||
e801042340
|
|||
62ef2a4489
|
|||
9ce7dc6716
|
16
CHANGELOG.md
16
CHANGELOG.md
@ -4,6 +4,22 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## [0.2.2] - 2019-08-27
|
||||||
|
### Changed
|
||||||
|
- Output of date checks to include column names (helps debugging in case there are multiple date fields)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Ability to exclude certain fields using `--exclude-fields`
|
||||||
|
- Fix for missing space after a comma, ie "Orth,Alan S."
|
||||||
|
|
||||||
|
### Improved
|
||||||
|
- AGROVOC lookup code
|
||||||
|
|
||||||
|
## [0.2.1] - 2019-08-11
|
||||||
|
### Added
|
||||||
|
- Check for uncommon filename extensions
|
||||||
|
- Replacement of unneccessary Unicode characters like soft hyphens (U+00AD)
|
||||||
|
|
||||||
## [0.2.0] - 2019-08-09
|
## [0.2.0] - 2019-08-09
|
||||||
### Added
|
### Added
|
||||||
- Handle Ctrl-C interrupt gracefully
|
- Handle Ctrl-C interrupt gracefully
|
||||||
|
13
README.md
13
README.md
@ -1,4 +1,4 @@
|
|||||||
# CSV Metadata Quality [](https://travis-ci.org/alanorth/csv-metadata-quality) [](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
|
# CSV Metadata Quality [](https://travis-ci.org/ilri/csv-metadata-quality) [](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
|
||||||
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
|
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
|
||||||
|
|
||||||
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
|
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
|
||||||
@ -19,7 +19,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
|
|||||||
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
|
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
|
||||||
|
|
||||||
```
|
```
|
||||||
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
|
$ git clone https://github.com/ilri/csv-metadata-quality.git
|
||||||
$ cd csv-metadata-quality
|
$ cd csv-metadata-quality
|
||||||
$ pipenv install
|
$ pipenv install
|
||||||
$ pipenv shell
|
$ pipenv shell
|
||||||
@ -28,7 +28,7 @@ $ pipenv shell
|
|||||||
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:
|
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:
|
||||||
|
|
||||||
```
|
```
|
||||||
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
|
$ git clone https://github.com/ilri/csv-metadata-quality.git
|
||||||
$ cd csv-metadata-quality
|
$ cd csv-metadata-quality
|
||||||
$ python3 -m venv venv
|
$ python3 -m venv venv
|
||||||
$ source venv/bin/activate
|
$ source venv/bin/activate
|
||||||
@ -74,6 +74,13 @@ Invalid AGROVOC (cg.coverage.country): KENYAA
|
|||||||
- Reporting / summary
|
- Reporting / summary
|
||||||
- Better logging, for example with INFO, WARN, and ERR levels
|
- Better logging, for example with INFO, WARN, and ERR levels
|
||||||
- Verbose, debug, or quiet options
|
- Verbose, debug, or quiet options
|
||||||
|
- Warn if an author is shorter than 3 characters?
|
||||||
|
- Validate dc.rights field against SPDX? Perhaps with an option like `-m spdx` to enable the spdx module?
|
||||||
|
- Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
|
||||||
|
- Warn if two items use the same file in `filename` column
|
||||||
|
- Add an option to drop invalid AGROVOC subjects?
|
||||||
|
- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
|
||||||
|
- Add tests for application invocation, ie `tests/test_app.py`?
|
||||||
|
|
||||||
## License
|
## License
|
||||||
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
|
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
|
||||||
|
@ -15,6 +15,7 @@ def parse_args(argv):
|
|||||||
parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
|
parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8'))
|
||||||
parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
|
parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true')
|
||||||
parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}')
|
parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}')
|
||||||
|
parser.add_argument('--exclude-fields', '-x', help='Comma-separated list of fields to skip, for example: dc.contributor.author,dc.identifier.citation')
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
return args
|
return args
|
||||||
@ -34,6 +35,19 @@ def run(argv):
|
|||||||
df = pd.read_csv(args.input_file, dtype=str)
|
df = pd.read_csv(args.input_file, dtype=str)
|
||||||
|
|
||||||
for column in df.columns.values.tolist():
|
for column in df.columns.values.tolist():
|
||||||
|
# Check if the user requested to skip any fields
|
||||||
|
if args.exclude_fields:
|
||||||
|
skip = False
|
||||||
|
# Split the list of excludes on ',' so we can test exact matches
|
||||||
|
# rather than fuzzy matches with regexes or "if word in string"
|
||||||
|
for exclude in args.exclude_fields.split(','):
|
||||||
|
if column == exclude and skip is False:
|
||||||
|
skip = True
|
||||||
|
if skip:
|
||||||
|
print(f'Skipping {column}')
|
||||||
|
|
||||||
|
continue
|
||||||
|
|
||||||
# Fix: whitespace
|
# Fix: whitespace
|
||||||
df[column] = df[column].apply(fix.whitespace)
|
df[column] = df[column].apply(fix.whitespace)
|
||||||
|
|
||||||
@ -41,6 +55,13 @@ def run(argv):
|
|||||||
if args.unsafe_fixes:
|
if args.unsafe_fixes:
|
||||||
df[column] = df[column].apply(fix.newlines)
|
df[column] = df[column].apply(fix.newlines)
|
||||||
|
|
||||||
|
# Fix: missing space after comma. Only run on author and citation
|
||||||
|
# fields for now, as this problem is mostly an issue in names.
|
||||||
|
if args.unsafe_fixes:
|
||||||
|
match = re.match(r'^.*?(author|citation).*$', column)
|
||||||
|
if match is not None:
|
||||||
|
df[column] = df[column].apply(fix.comma_space, field_name=column)
|
||||||
|
|
||||||
# Fix: unnecessary Unicode
|
# Fix: unnecessary Unicode
|
||||||
df[column] = df[column].apply(fix.unnecessary_unicode)
|
df[column] = df[column].apply(fix.unnecessary_unicode)
|
||||||
|
|
||||||
@ -84,7 +105,11 @@ def run(argv):
|
|||||||
# Check: invalid date
|
# Check: invalid date
|
||||||
match = re.match(r'^.*?date.*$', column)
|
match = re.match(r'^.*?date.*$', column)
|
||||||
if match is not None:
|
if match is not None:
|
||||||
df[column] = df[column].apply(check.date)
|
df[column] = df[column].apply(check.date, field_name=column)
|
||||||
|
|
||||||
|
# Check: filename extension
|
||||||
|
if column == 'filename':
|
||||||
|
df[column] = df[column].apply(check.filename_extension)
|
||||||
|
|
||||||
# Write
|
# Write
|
||||||
df.to_csv(args.output_file, index=False)
|
df.to_csv(args.output_file, index=False)
|
||||||
|
@ -75,7 +75,7 @@ def separators(field):
|
|||||||
return field
|
return field
|
||||||
|
|
||||||
|
|
||||||
def date(field):
|
def date(field, field_name):
|
||||||
"""Check if a date is valid.
|
"""Check if a date is valid.
|
||||||
|
|
||||||
In DSpace the issue date is usually 1990, 1990-01, or 1990-01-01, but it
|
In DSpace the issue date is usually 1990, 1990-01, or 1990-01-01, but it
|
||||||
@ -88,7 +88,7 @@ def date(field):
|
|||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
||||||
if pd.isna(field):
|
if pd.isna(field):
|
||||||
print(f'Missing date.')
|
print(f'Missing date ({field_name}).')
|
||||||
|
|
||||||
return
|
return
|
||||||
|
|
||||||
@ -97,7 +97,7 @@ def date(field):
|
|||||||
|
|
||||||
# We don't allow multi-value date fields
|
# We don't allow multi-value date fields
|
||||||
if len(multiple_dates) > 1:
|
if len(multiple_dates) > 1:
|
||||||
print(f'Multiple dates not allowed: {field}')
|
print(f'Multiple dates not allowed ({field_name}): {field}')
|
||||||
|
|
||||||
return field
|
return field
|
||||||
|
|
||||||
@ -123,7 +123,7 @@ def date(field):
|
|||||||
|
|
||||||
return field
|
return field
|
||||||
except ValueError:
|
except ValueError:
|
||||||
print(f'Invalid date: {field}')
|
print(f'Invalid date ({field_name}): {field}')
|
||||||
|
|
||||||
return field
|
return field
|
||||||
|
|
||||||
@ -212,7 +212,6 @@ def agrovoc(field, field_name):
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
from datetime import timedelta
|
from datetime import timedelta
|
||||||
import re
|
|
||||||
import requests
|
import requests
|
||||||
import requests_cache
|
import requests_cache
|
||||||
|
|
||||||
@ -222,34 +221,67 @@ def agrovoc(field, field_name):
|
|||||||
|
|
||||||
# Try to split multi-value field on "||" separator
|
# Try to split multi-value field on "||" separator
|
||||||
for value in field.split('||'):
|
for value in field.split('||'):
|
||||||
# match lines beginning with words, paying attention to subjects with
|
request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'
|
||||||
# special characters like spaces, quotes, dashes, parentheses, etc:
|
|
||||||
# SUBJECT
|
|
||||||
# ANOTHER SUBJECT
|
|
||||||
# XANTHOMONAS CAMPESTRIS PV. MANIHOTIS
|
|
||||||
# WOMEN'S PARTICIPATION
|
|
||||||
# COMMUNITY-BASED FOREST MANAGEMENT
|
|
||||||
# INTERACCIÓN GENOTIPO AMBIENTE
|
|
||||||
# COCOA (PLANT)
|
|
||||||
pattern = re.compile(r'^[\w\-\.\'\(\)]+?[\w\s\-\.\'\(\)]+$')
|
|
||||||
|
|
||||||
if pattern.match(value):
|
# enable transparent request cache with thirty days expiry
|
||||||
request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'
|
expire_after = timedelta(days=30)
|
||||||
|
requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)
|
||||||
|
|
||||||
# enable transparent request cache with thirty days expiry
|
request = requests.get(request_url)
|
||||||
expire_after = timedelta(days=30)
|
|
||||||
requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after)
|
|
||||||
|
|
||||||
request = requests.get(request_url)
|
# prune old cache entries
|
||||||
|
requests_cache.core.remove_expired_responses()
|
||||||
|
|
||||||
# prune old cache entries
|
if request.status_code == requests.codes.ok:
|
||||||
requests_cache.core.remove_expired_responses()
|
data = request.json()
|
||||||
|
|
||||||
if request.status_code == requests.codes.ok:
|
# check if there are any results
|
||||||
data = request.json()
|
if len(data['results']) == 0:
|
||||||
|
print(f'Invalid AGROVOC ({field_name}): {value}')
|
||||||
# check if there are any results
|
|
||||||
if len(data['results']) == 0:
|
return field
|
||||||
print(f'Invalid AGROVOC ({field_name}): {value}')
|
|
||||||
|
|
||||||
|
def filename_extension(field):
|
||||||
|
"""Check filename extension.
|
||||||
|
|
||||||
|
CSVs with a 'filename' column are likely meant as input for the SAFBuilder
|
||||||
|
tool, which creates a Simple Archive Format bundle for importing metadata
|
||||||
|
with accompanying PDFs or other files into DSpace.
|
||||||
|
|
||||||
|
This check warns if a filename has an uncommon extension (that is, other
|
||||||
|
than .pdf, .xls(x), .doc(x), ppt(x), case insensitive).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
|
||||||
|
# Skip fields with missing values
|
||||||
|
if pd.isna(field):
|
||||||
|
return
|
||||||
|
|
||||||
|
# Try to split multi-value field on "||" separator
|
||||||
|
values = field.split('||')
|
||||||
|
|
||||||
|
# List of common filename extentions
|
||||||
|
common_filename_extensions = ['.pdf', '.doc', '.docx', '.ppt', '.pptx', '.xls', '.xlsx']
|
||||||
|
|
||||||
|
# Iterate over all values
|
||||||
|
for value in values:
|
||||||
|
# Assume filename extension does not match
|
||||||
|
filename_extension_match = False
|
||||||
|
|
||||||
|
for filename_extension in common_filename_extensions:
|
||||||
|
# Check for extension at the end of the filename
|
||||||
|
pattern = re.escape(filename_extension) + r'$'
|
||||||
|
match = re.search(pattern, value, re.IGNORECASE)
|
||||||
|
|
||||||
|
if match is not None:
|
||||||
|
# Register the match and stop checking for this filename
|
||||||
|
filename_extension_match = True
|
||||||
|
|
||||||
|
break
|
||||||
|
|
||||||
|
if filename_extension_match is False:
|
||||||
|
print(f'Filename with uncommon extension: {value}')
|
||||||
|
|
||||||
return field
|
return field
|
||||||
|
@ -68,14 +68,17 @@ def separators(field):
|
|||||||
|
|
||||||
|
|
||||||
def unnecessary_unicode(field):
|
def unnecessary_unicode(field):
|
||||||
"""Remove unnecessary Unicode characters.
|
"""Remove and replace unnecessary Unicode characters.
|
||||||
|
|
||||||
Removes unnecessary Unicode characters like:
|
Removes unnecessary Unicode characters like:
|
||||||
- Zero-width space (U+200B)
|
- Zero-width space (U+200B)
|
||||||
- Replacement character (U+FFFD)
|
- Replacement character (U+FFFD)
|
||||||
- No-break space (U+00A0)
|
- No-break space (U+00A0)
|
||||||
|
|
||||||
Return string with characters removed.
|
Replaces unnecessary Unicode characters like:
|
||||||
|
- Soft hyphen (U+00AD) → hyphen
|
||||||
|
|
||||||
|
Return string with characters removed or replaced.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Skip fields with missing values
|
# Skip fields with missing values
|
||||||
@ -106,6 +109,14 @@ def unnecessary_unicode(field):
|
|||||||
print(f'Removing unnecessary Unicode (U+00A0): {field}')
|
print(f'Removing unnecessary Unicode (U+00A0): {field}')
|
||||||
field = re.sub(pattern, '', field)
|
field = re.sub(pattern, '', field)
|
||||||
|
|
||||||
|
# Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen
|
||||||
|
pattern = re.compile(r'\u002D*?\u00AD')
|
||||||
|
match = re.findall(pattern, field)
|
||||||
|
|
||||||
|
if match:
|
||||||
|
print(f'Replacing unnecessary Unicode (U+00AD): {field}')
|
||||||
|
field = re.sub(pattern, '-', field)
|
||||||
|
|
||||||
return field
|
return field
|
||||||
|
|
||||||
|
|
||||||
@ -165,3 +176,27 @@ def newlines(field):
|
|||||||
field = field.replace('\n', '')
|
field = field.replace('\n', '')
|
||||||
|
|
||||||
return field
|
return field
|
||||||
|
|
||||||
|
|
||||||
|
def comma_space(field, field_name):
|
||||||
|
"""Fix occurrences of commas missing a trailing space, for example:
|
||||||
|
|
||||||
|
Orth,Alan S.
|
||||||
|
|
||||||
|
This is a very common mistake in author and citation fields.
|
||||||
|
|
||||||
|
Return string with a space added.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Skip fields with missing values
|
||||||
|
if pd.isna(field):
|
||||||
|
return
|
||||||
|
|
||||||
|
# Check for comma followed by a word character
|
||||||
|
match = re.findall(r',\w', field)
|
||||||
|
|
||||||
|
if match:
|
||||||
|
print(f'Adding space after comma ({field_name}): {field}')
|
||||||
|
field = re.sub(r',(\w)', r', \1', field)
|
||||||
|
|
||||||
|
return field
|
||||||
|
@ -1 +1 @@
|
|||||||
VERSION = '0.2.0'
|
VERSION = '0.2.2'
|
||||||
|
@ -1,23 +1,26 @@
|
|||||||
dc.contributor.author,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country
|
dc.contributor.author,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename
|
||||||
Leading space,2019-07-29,,,,,
|
Leading space,2019-07-29,,,,,,
|
||||||
Trailing space ,2019-07-29,,,,,
|
Trailing space ,2019-07-29,,,,,,
|
||||||
Excessive space,2019-07-29,,,,,
|
Excessive space,2019-07-29,,,,,,
|
||||||
Miscellaenous ||whitespace | issues ,2019-07-29,,,,,
|
Miscellaenous ||whitespace | issues ,2019-07-29,,,,,,
|
||||||
Duplicate||Duplicate,2019-07-29,,,,,
|
Duplicate||Duplicate,2019-07-29,,,,,,
|
||||||
Invalid ISSN,2019-07-29,2321-2302,,,,
|
Invalid ISSN,2019-07-29,2321-2302,,,,,
|
||||||
Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,
|
Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,,
|
||||||
Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,
|
Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,,
|
||||||
Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,
|
Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,,
|
||||||
Invalid date,2019-07-260,,,,,
|
Invalid date,2019-07-260,,,,,,
|
||||||
Multiple dates,2019-07-26||2019-01-10,,,,,
|
Multiple dates,2019-07-26||2019-01-10,,,,,,
|
||||||
Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,
|
Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,,
|
||||||
Unnecessary Unicode,2019-07-29,,,,,
|
Unnecessary Unicode,2019-07-29,,,,,,
|
||||||
Suspicious character||foreˆt,2019-07-29,,,,,
|
Suspicious character||foreˆt,2019-07-29,,,,,,
|
||||||
Invalid ISO 639-2 language,2019-07-29,,,jp,,
|
Invalid ISO 639-2 language,2019-07-29,,,jp,,,
|
||||||
Invalid ISO 639-3 language,2019-07-29,,,chi,,
|
Invalid ISO 639-3 language,2019-07-29,,,chi,,,
|
||||||
Invalid language,2019-07-29,,,Span,,
|
Invalid language,2019-07-29,,,Span,,,
|
||||||
Invalid AGROVOC subject,2019-07-29,,,,FOREST,
|
Invalid AGROVOC subject,2019-07-29,,,,FOREST,,
|
||||||
Newline (LF),2019-07-30,,,,"TANZA
|
Newline (LF),2019-07-30,,,,"TANZA
|
||||||
NIA",
|
NIA",,
|
||||||
Missing date,,,,,,
|
Missing date,,,,,,,
|
||||||
Invalid country,2019-08-01,,,,,KENYAA
|
Invalid country,2019-08-01,,,,,KENYAA,
|
||||||
|
Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck
|
||||||
|
Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-92-9043-823-6,,,,
|
||||||
|
"Missing space,after comma",2019-08-27,,,,,,
|
||||||
|
|
2
setup.py
2
setup.py
@ -13,7 +13,7 @@ install_requires = [
|
|||||||
|
|
||||||
setuptools.setup(
|
setuptools.setup(
|
||||||
name="csv-metadata-quality",
|
name="csv-metadata-quality",
|
||||||
version="0.2.0",
|
version="0.2.2",
|
||||||
author="Alan Orth",
|
author="Alan Orth",
|
||||||
author_email="aorth@mjanja.ch",
|
author_email="aorth@mjanja.ch",
|
||||||
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
|
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
|
||||||
|
@ -69,10 +69,12 @@ def test_check_missing_date(capsys):
|
|||||||
|
|
||||||
value = None
|
value = None
|
||||||
|
|
||||||
check.date(value)
|
field_name = 'dc.date.issued'
|
||||||
|
|
||||||
|
check.date(value, field_name)
|
||||||
|
|
||||||
captured = capsys.readouterr()
|
captured = capsys.readouterr()
|
||||||
assert captured.out == f'Missing date.\n'
|
assert captured.out == f'Missing date ({field_name}).\n'
|
||||||
|
|
||||||
|
|
||||||
def test_check_multiple_dates(capsys):
|
def test_check_multiple_dates(capsys):
|
||||||
@ -80,10 +82,12 @@ def test_check_multiple_dates(capsys):
|
|||||||
|
|
||||||
value = '1990||1991'
|
value = '1990||1991'
|
||||||
|
|
||||||
check.date(value)
|
field_name = 'dc.date.issued'
|
||||||
|
|
||||||
|
check.date(value, field_name)
|
||||||
|
|
||||||
captured = capsys.readouterr()
|
captured = capsys.readouterr()
|
||||||
assert captured.out == f'Multiple dates not allowed: {value}\n'
|
assert captured.out == f'Multiple dates not allowed ({field_name}): {value}\n'
|
||||||
|
|
||||||
|
|
||||||
def test_check_invalid_date(capsys):
|
def test_check_invalid_date(capsys):
|
||||||
@ -91,10 +95,12 @@ def test_check_invalid_date(capsys):
|
|||||||
|
|
||||||
value = '1990-0'
|
value = '1990-0'
|
||||||
|
|
||||||
check.date(value)
|
field_name = 'dc.date.issued'
|
||||||
|
|
||||||
|
check.date(value, field_name)
|
||||||
|
|
||||||
captured = capsys.readouterr()
|
captured = capsys.readouterr()
|
||||||
assert captured.out == f'Invalid date: {value}\n'
|
assert captured.out == f'Invalid date ({field_name}): {value}\n'
|
||||||
|
|
||||||
|
|
||||||
def test_check_valid_date():
|
def test_check_valid_date():
|
||||||
@ -102,7 +108,9 @@ def test_check_valid_date():
|
|||||||
|
|
||||||
value = '1990'
|
value = '1990'
|
||||||
|
|
||||||
result = check.date(value)
|
field_name = 'dc.date.issued'
|
||||||
|
|
||||||
|
result = check.date(value, field_name)
|
||||||
|
|
||||||
assert result == value
|
assert result == value
|
||||||
|
|
||||||
@ -194,3 +202,24 @@ def test_check_valid_agrovoc():
|
|||||||
result = check.agrovoc(value, field_name)
|
result = check.agrovoc(value, field_name)
|
||||||
|
|
||||||
assert result == value
|
assert result == value
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_uncommon_filename_extension(capsys):
|
||||||
|
'''Test uncommon filename extension.'''
|
||||||
|
|
||||||
|
value = 'file.pdf.lck'
|
||||||
|
|
||||||
|
check.filename_extension(value)
|
||||||
|
|
||||||
|
captured = capsys.readouterr()
|
||||||
|
assert captured.out == f'Filename with uncommon extension: {value}\n'
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_common_filename_extension():
|
||||||
|
'''Test common filename extension.'''
|
||||||
|
|
||||||
|
value = 'file.pdf'
|
||||||
|
|
||||||
|
result = check.filename_extension(value)
|
||||||
|
|
||||||
|
assert result == value
|
||||||
|
@ -56,3 +56,13 @@ def test_fix_newlines():
|
|||||||
ya'''
|
ya'''
|
||||||
|
|
||||||
assert fix.newlines(value) == 'Kenya'
|
assert fix.newlines(value) == 'Kenya'
|
||||||
|
|
||||||
|
|
||||||
|
def test_fix_comma_space():
|
||||||
|
'''Test adding space after comma.'''
|
||||||
|
|
||||||
|
value = 'Orth,Alan S.'
|
||||||
|
|
||||||
|
field_name = 'dc.contributor.author'
|
||||||
|
|
||||||
|
assert fix.comma_space(value, field_name) == 'Orth, Alan S.'
|
||||||
|
Reference in New Issue
Block a user