csv_metadata_quality/app.py: read fields as strings

I suspect this undermines the PyArrow backend performance gains in recent Pandas 2.0.0, but we are dealing with messy data sometimes and we must rely on data being strings.
2025-07-21 13:33:01 +02:00 · 2023-06-12 10:38:05 +03:00
parent f3fb1ff7fb
commit d21d2621e3
1 changed files with 2 additions and 1 deletions
--- a/csv_metadata_quality/app.py
+++ b/csv_metadata_quality/app.py
@ -73,7 +73,8 @@ def run(argv):
    # set the signal handler for SIGINT (^C)
    signal.signal(signal.SIGINT, signal_handler)

-    df = pd.read_csv(args.input_file, dtype_backend="pyarrow")
+    # Read all fields as strings so dates don't get converted from 1998 to 1998.0
+    df = pd.read_csv(args.input_file, dtype_backend="pyarrow", dtype="str")

    # Check if the user requested to skip any fields
    if args.exclude_fields: