I have two csv files, and I want to combine these two csv files into one csv file. Assume that the two csv files are A.csv and B.csv, I have already known that there are some conflicts in them. For example, there are two columns, ID and name, in A.csv ID “12345” has name “Jack”， in B.csv ID “12345” has name “Tom”. So there are conflicts that the same ID has different name. Now I want to keep ID “12345”, and I want to choose name from A.csv, and abandon name from B.csv. How could I do that?
Here is some code I have tried, but it can only combine two csv files but connot deal with the conflicts, or more precisely, it cannot choose definite value from A.csv :
import pandas as pd import glob def merge(csv_list, outputfile): for input_file in csv_list: f = open(input_file, 'r', encoding='utf-8') data = pd.read_csv(f, error_bad_lines=False) data.to_csv(outputfile, mode='a', index=False) print('Combine Completed') def distinct(file): df = pd.read_csv(file, header=None) datalist = df.drop_duplicates() datalist.to_csv('result_new_month01.csv', index = False, header = False) print('Distint Completed') if __name__ = '__main__': csv_list = glob.glob('*.csv') output_csv_path = 'result.csv' print(csv_list) merge(csv_list) distinct(output_csv_path)
P.S. English is not my native language. Please excuse my syntax error.
Will put down my comments here as an answer:
The problem with your
merge function is, you’re reading one file and writing it out the same
result.csv in append mode without doing any check for duplicate names. In order to check for duplicates, they need to be in the same dataframe. From your code, you’re combining multiple CSV files, not necessary just
B.csv. So when you say “I want to choose name from A.csv, and abandon name from B.csv” – it looks like you really mean “keep the first one”.
Anyway, fix your
merge() function to keep reading files into a list of dataframes – with A.csv being first. And then use @gold_cy’s answer to concatenate the dataframes keeping only the first occurrence.
Or in your
distinct() function, instead of
datalist = df.drop_duplicates(), put
datalist = df.drop_duplicates("ID", keep="first").reset_index(drop=True) – but this can be done after the read-loop instead writing out a CSV full of duplicates, first drop the dupes and then write out to csv.
So here’s the change using the first method:
import pandas as pd import glob def merge_distinct(csv_list, outputfile): all_frames =  # list of dataframes for input_file in csv_list: # skip your file-open line and pass those params to pd.read_csv data = pd.read_csv(input_file, encoding='utf-8', error_bad_lines=False) all_frames.append(data) # append to list of dataframes print('Combine Completed') final = pd.concat(all_frames).drop_duplicates("ID", keep="first").reset_index(drop=True) final.to_csv('result_new_month01.csv', index=False, header=False) print('Distint Completed') if __name__ = '__main__': csv_list = sorted(glob.glob('*.csv')) # sort the list output_csv_path = 'result.csv' print(csv_list) merge_distinct(csv_list, output_csv_path)
- Instead of doing
f = open(...)just pass those arguments to
- why are you writing out the final csv with
header=False? That’s useful to have.
glob.glob()doesn’t guarantee sorting (depends on filesystem) so I’ve used
- File-system sorting is different from just sorting in
sortedwhich is essentially in ASCII/Unicode index order.