I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don’t know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader: test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT: I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines: if("Foobar" in line): line = line.replace("Foobar", "") if(";n" in line): line = line.replace(";n", ";")
The only thing that remains are rows that beginn with a ; Since I need to go back one entry in the list
Col_a;Col_b;Col_c;Col_d 2021;Foobar;Bla ;Blub
Blub belongs in the row above.
Here’s a simple Python script to merge lines until you have the desired number of fields.
import sys sep = ';' fields = 4 collected =  for line in sys.stdin: new = line.rstrip('n').split(sep) if collected: collected[-1] += new collected.extend(new[1:]) else: collected = new if len(collected) < fields: continue print(';'.join(collected)) collected = 
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost. The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use
csv.writer to write the fields back out as properly quoted CSV.