mirror of
https://github.com/ilri/csv-metadata-quality.git
synced 2024-11-25 23:28:18 +01:00
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.
This commit is contained in:
parent
346e66ca98
commit
40d5f7d81b
@ -10,6 +10,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
|
|||||||
- Validate subjects against the AGROVOC REST API
|
- Validate subjects against the AGROVOC REST API
|
||||||
- Fix leading, trailing, and excessive (ie, more than one) whitespace
|
- Fix leading, trailing, and excessive (ie, more than one) whitespace
|
||||||
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
|
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
|
||||||
|
- Fix problematic newlines (line feeds) using `--unsafe-fixes`
|
||||||
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
|
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
|
||||||
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
|
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
|
||||||
- Remove duplicate metadata values
|
- Remove duplicate metadata values
|
||||||
|
@ -25,6 +25,10 @@ def main(argv):
|
|||||||
# Fix: whitespace
|
# Fix: whitespace
|
||||||
df[column] = df[column].apply(fix.whitespace)
|
df[column] = df[column].apply(fix.whitespace)
|
||||||
|
|
||||||
|
# Fix: newlines
|
||||||
|
if args.unsafe_fixes:
|
||||||
|
df[column] = df[column].apply(fix.newlines)
|
||||||
|
|
||||||
# Fix: unnecessary Unicode
|
# Fix: unnecessary Unicode
|
||||||
df[column] = df[column].apply(fix.unnecessary_unicode)
|
df[column] = df[column].apply(fix.unnecessary_unicode)
|
||||||
|
|
||||||
|
@ -134,3 +134,34 @@ def duplicates(field):
|
|||||||
new_field = '||'.join(new_values)
|
new_field = '||'.join(new_values)
|
||||||
|
|
||||||
return new_field
|
return new_field
|
||||||
|
|
||||||
|
|
||||||
|
def newlines(field):
|
||||||
|
"""Fix newlines.
|
||||||
|
|
||||||
|
Single metadata values should not span multiple lines because this is not
|
||||||
|
rendered properly in DSpace's XMLUI and even causes issues during import.
|
||||||
|
|
||||||
|
Implementation note: this currently only detects Unix line feeds (0x0a).
|
||||||
|
This is essentially when a user presses "Enter" to move to the next line.
|
||||||
|
Other newlines like the Windows carriage return are already handled with
|
||||||
|
the string stipping performed in the whitespace fixes.
|
||||||
|
|
||||||
|
Confusingly, in Vim '\n' matches a line feed when searching, but you must
|
||||||
|
use '\r' to *insert* a line feed, ie in a search and replace expression.
|
||||||
|
|
||||||
|
Return string with newlines removed.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Skip fields with missing values
|
||||||
|
if pd.isna(field):
|
||||||
|
return
|
||||||
|
|
||||||
|
# Check for Unix line feed (LF)
|
||||||
|
match = re.findall(r'\n', field)
|
||||||
|
|
||||||
|
if match:
|
||||||
|
print(f'Removing newline: {field}')
|
||||||
|
field = field.replace('\n', '')
|
||||||
|
|
||||||
|
return field
|
||||||
|
@ -17,3 +17,5 @@ Invalid ISO 639-2 language,2019-07-29,,,jp,
|
|||||||
Invalid ISO 639-3 language,2019-07-29,,,chi,
|
Invalid ISO 639-3 language,2019-07-29,,,chi,
|
||||||
Invalid language,2019-07-29,,,Span,
|
Invalid language,2019-07-29,,,Span,
|
||||||
Invalid AGROVOC subject,2019-07-29,,,,FOREST
|
Invalid AGROVOC subject,2019-07-29,,,,FOREST
|
||||||
|
Newline,2019-07-30,,,,"TANZA
|
||||||
|
NIA"
|
||||||
|
|
@ -47,3 +47,12 @@ def test_fix_duplicates():
|
|||||||
value = 'Kenya||Kenya'
|
value = 'Kenya||Kenya'
|
||||||
|
|
||||||
assert fix.duplicates(value) == 'Kenya'
|
assert fix.duplicates(value) == 'Kenya'
|
||||||
|
|
||||||
|
|
||||||
|
def test_fix_newlines():
|
||||||
|
'''Test fixing newlines.'''
|
||||||
|
|
||||||
|
value = '''Ken
|
||||||
|
ya'''
|
||||||
|
|
||||||
|
assert fix.newlines(value) == 'Kenya'
|
||||||
|
Loading…
Reference in New Issue
Block a user