mirror of
https://github.com/ilri/csv-metadata-quality.git
synced 2025-05-09 14:46:00 +02:00
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.
This commit is contained in:
@ -10,6 +10,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
|
||||
- Validate subjects against the AGROVOC REST API
|
||||
- Fix leading, trailing, and excessive (ie, more than one) whitespace
|
||||
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
|
||||
- Fix problematic newlines (line feeds) using `--unsafe-fixes`
|
||||
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
|
||||
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
|
||||
- Remove duplicate metadata values
|
||||
|
Reference in New Issue
Block a user