Add support for removing newlines

This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.
2025-08-23 05:11:49 +02:00 · 2019-07-30 20:05:12 +03:00
parent 346e66ca98
commit 40d5f7d81b
5 changed files with 47 additions and 0 deletions
--- a/csv_metadata_quality/fix.py
+++ b/csv_metadata_quality/fix.py
@@ -134,3 +134,34 @@ def duplicates(field):
    new_field = '||'.join(new_values)

    return new_field
+
+
+def newlines(field):
+    """Fix newlines.
+
+    Single metadata values should not span multiple lines because this is not
+    rendered properly in DSpace's XMLUI and even causes issues during import.
+
+    Implementation note: this currently only detects Unix line feeds (0x0a).
+    This is essentially when a user presses "Enter" to move to the next line.
+    Other newlines like the Windows carriage return are already handled with
+    the string stipping performed in the whitespace fixes.
+
+    Confusingly, in Vim '\n' matches a line feed when searching, but you must
+    use '\r' to *insert* a line feed, ie in a search and replace expression.
+
+    Return string with newlines removed.
+    """
+
+    # Skip fields with missing values
+    if pd.isna(field):
+        return
+
+    # Check for Unix line feed (LF)
+    match = re.findall(r'\n', field)
+
+    if match:
+        print(f'Removing newline: {field}')
+        field = field.replace('\n', '')
+
+    return field