mirror of
https://github.com/ilri/csv-metadata-quality.git
synced 2024-12-22 12:12:18 +01:00
Always fix invalid multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository.
This commit is contained in:
parent
f00a07e2cd
commit
1008acf35e
@ -4,6 +4,11 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
## Unreleased
|
||||||
|
### Changed
|
||||||
|
- Fixing invalid multi-value separators like `|` and `|||` is no longer class-
|
||||||
|
ified as "unsafe" as I have yet to see a case where this was intentional
|
||||||
|
|
||||||
## [0.4.6] - 2021-03-11
|
## [0.4.6] - 2021-03-11
|
||||||
### Added
|
### Added
|
||||||
- Validation of dcterms.license field against SPDX license identifiers
|
- Validation of dcterms.license field against SPDX license identifiers
|
||||||
|
12
README.md
12
README.md
@ -15,7 +15,7 @@ If you use the DSpace CSV metadata quality checker please cite:
|
|||||||
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
|
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
|
||||||
- Validation of licenses against the list of [SPDX license identifiers](https://spdx.org/licenses)
|
- Validation of licenses against the list of [SPDX license identifiers](https://spdx.org/licenses)
|
||||||
- Fix leading, trailing, and excessive (ie, more than one) whitespace
|
- Fix leading, trailing, and excessive (ie, more than one) whitespace
|
||||||
- Fix invalid and unnecessary multi-value separators (`|`) using `--unsafe-fixes`
|
- Fix invalid and unnecessary multi-value separators (`|`)
|
||||||
- Fix problematic newlines (line feeds) using `--unsafe-fixes`
|
- Fix problematic newlines (line feeds) using `--unsafe-fixes`
|
||||||
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
|
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
|
||||||
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
|
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
|
||||||
@ -55,14 +55,14 @@ To validate and clean a CSV file you must specify input and output files using t
|
|||||||
$ csv-metadata-quality -i data/test.csv -o /tmp/test.csv
|
$ csv-metadata-quality -i data/test.csv -o /tmp/test.csv
|
||||||
```
|
```
|
||||||
|
|
||||||
## Unsafe Fixes
|
## Invalid Multi-Value Separators
|
||||||
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will attempt to fix invalid multi-value separators, remove newlines, and perform Unicode normalization.
|
While it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. This utility will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
|
||||||
|
|
||||||
### Invalid Multi-Value Separators
|
|
||||||
This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
|
|
||||||
|
|
||||||
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
|
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
|
||||||
|
|
||||||
|
## Unsafe Fixes
|
||||||
|
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines and perform Unicode normalization.
|
||||||
|
|
||||||
### Newlines
|
### Newlines
|
||||||
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
|
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
|
||||||
|
|
||||||
|
@ -111,7 +111,6 @@ def run(argv):
|
|||||||
df[column] = df[column].apply(check.suspicious_characters, field_name=column)
|
df[column] = df[column].apply(check.suspicious_characters, field_name=column)
|
||||||
|
|
||||||
# Fix: invalid and unnecessary multi-value separators
|
# Fix: invalid and unnecessary multi-value separators
|
||||||
if args.unsafe_fixes:
|
|
||||||
df[column] = df[column].apply(fix.separators, field_name=column)
|
df[column] = df[column].apply(fix.separators, field_name=column)
|
||||||
# Run whitespace fix again after fixing invalid separators
|
# Run whitespace fix again after fixing invalid separators
|
||||||
df[column] = df[column].apply(fix.whitespace, field_name=column)
|
df[column] = df[column].apply(fix.whitespace, field_name=column)
|
||||||
|
Loading…
Reference in New Issue
Block a user