1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-05-09 22:56:01 +02:00

Add check for "suspicious" characters

These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
This commit is contained in:
2019-07-29 17:08:49 +03:00
parent 8047a57cc5
commit fa4fa3491b
5 changed files with 39 additions and 0 deletions

View File

@ -11,6 +11,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
- Fix leading, trailing, and excessive whitespace ✓
- Fix invalid multi-value separators ("|") using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc ✓
- Check for "suspicious" characters that could indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" ✓
## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):