Add support for removing newlines

This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.
2025-08-23 21:31:50 +02:00 · 2019-07-30 20:05:12 +03:00
parent 346e66ca98
commit 40d5f7d81b
5 changed files with 47 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -10,6 +10,7 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
 - Validate subjects against the AGROVOC REST API
 - Fix leading, trailing, and excessive (ie, more than one) whitespace
 - Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
+- Fix problematic newlines (line feeds) using `--unsafe-fixes`
 - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
 - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
 - Remove duplicate metadata values
--- a/csv_metadata_quality/app.py
+++ b/csv_metadata_quality/app.py
@@ -25,6 +25,10 @@ def main(argv):
        # Fix: whitespace
        df[column] = df[column].apply(fix.whitespace)

+        # Fix: newlines
+        if args.unsafe_fixes:
+            df[column] = df[column].apply(fix.newlines)
+
        # Fix: unnecessary Unicode
        df[column] = df[column].apply(fix.unnecessary_unicode)

--- a/csv_metadata_quality/fix.py
+++ b/csv_metadata_quality/fix.py
@@ -134,3 +134,34 @@ def duplicates(field):
    new_field = '||'.join(new_values)

    return new_field
+
+
+def newlines(field):
+    """Fix newlines.
+
+    Single metadata values should not span multiple lines because this is not
+    rendered properly in DSpace's XMLUI and even causes issues during import.
+
+    Implementation note: this currently only detects Unix line feeds (0x0a).
+    This is essentially when a user presses "Enter" to move to the next line.
+    Other newlines like the Windows carriage return are already handled with
+    the string stipping performed in the whitespace fixes.
+
+    Confusingly, in Vim '\n' matches a line feed when searching, but you must
+    use '\r' to *insert* a line feed, ie in a search and replace expression.
+
+    Return string with newlines removed.
+    """
+
+    # Skip fields with missing values
+    if pd.isna(field):
+        return
+
+    # Check for Unix line feed (LF)
+    match = re.findall(r'\n', field)
+
+    if match:
+        print(f'Removing newline: {field}')
+        field = field.replace('\n', '')
+
+    return field
--- a/data/test.csv
+++ b/data/test.csv
@@ -17,3 +17,5 @@ Invalid ISO 639-2 language,2019-07-29,,,jp,
 Invalid ISO 639-3 language,2019-07-29,,,chi,
 Invalid language,2019-07-29,,,Span,
 Invalid AGROVOC subject,2019-07-29,,,,FOREST
+Newline,2019-07-30,,,,"TANZA
+NIA"
--- a/tests/test_fix.py
+++ b/tests/test_fix.py
@@ -47,3 +47,12 @@ def test_fix_duplicates():
    value = 'Kenya||Kenya'

    assert fix.duplicates(value) == 'Kenya'
+
+
+def test_fix_newlines():
+    '''Test fixing newlines.'''
+
+    value = '''Ken
+ya'''
+
+    assert fix.newlines(value) == 'Kenya'