Is there a function can choose data from a specified csv file if there are conflicts while combining two csv files?

I have two csv files, and I want to combine these two csv files into one csv file. Assume that the two csv files are A.csv and B.csv, I have already known that there are some conflicts in them. For example, there are two columns, ID and name, in A.csv ID “12345” has name “Jack”, in B.csv ID “12345” has name “Tom”. So there are conflicts that the same ID has different name. Now I want to keep ID “12345”, and I want to choose name from A.csv, and abandon name from B.csv. How could I do that?

Here is some code I have tried, but it can only combine two csv files but connot deal with the conflicts, or more precisely, it cannot choose definite value from A.csv :

import pandas as pd
import glob

def merge(csv_list, outputfile):
    for input_file in csv_list:
        f = open(input_file, 'r', encoding='utf-8')
        data = pd.read_csv(f, error_bad_lines=False)
        data.to_csv(outputfile, mode='a', index=False)
    print('Combine Completed')

def distinct(file):
    df = pd.read_csv(file, header=None)
    datalist = df.drop_duplicates()
    datalist.to_csv('result_new_month01.csv', index = False, header = False)
    print('Distint Completed')

if __name__ = '__main__':
    csv_list = glob.glob('*.csv')
    output_csv_path = 'result.csv'
    print(csv_list)
    merge(csv_list)
    distinct(output_csv_path)

P.S. English is not my native language. Please excuse my syntax error.

Answer

Will put down my comments here as an answer:

The problem with your merge function is, you’re reading one file and writing it out the same result.csv in append mode without doing any check for duplicate names. In order to check for duplicates, they need to be in the same dataframe. From your code, you’re combining multiple CSV files, not necessary just A.csv and B.csv. So when you say “I want to choose name from A.csv, and abandon name from B.csv” – it looks like you really mean “keep the first one”.

Anyway, fix your merge() function to keep reading files into a list of dataframes – with A.csv being first. And then use @gold_cy’s answer to concatenate the dataframes keeping only the first occurrence.

Or in your distinct() function, instead of datalist = df.drop_duplicates(), put datalist = df.drop_duplicates("ID", keep="first").reset_index(drop=True) – but this can be done after the read-loop instead writing out a CSV full of duplicates, first drop the dupes and then write out to csv.

So here’s the change using the first method:

import pandas as pd
import glob

def merge_distinct(csv_list, outputfile):
    all_frames = []  # list of dataframes
    for input_file in csv_list:
        # skip your file-open line and pass those params to pd.read_csv
        data = pd.read_csv(input_file, encoding='utf-8', error_bad_lines=False)
        all_frames.append(data)  # append to list of dataframes
    print('Combine Completed')
    
    final = pd.concat(all_frames).drop_duplicates("ID", keep="first").reset_index(drop=True)
    final.to_csv('result_new_month01.csv', index=False, header=False)
    print('Distint Completed')

if __name__ = '__main__':
    csv_list = sorted(glob.glob('*.csv'))  # sort the list
    output_csv_path = 'result.csv'
    print(csv_list)
    merge_distinct(csv_list, output_csv_path)

Notes:

  • Instead of doing f = open(...) just pass those arguments to pd.read_csv().
  • why are you writing out the final csv with header=False? That’s useful to have.
  • glob.glob() doesn’t guarantee sorting (depends on filesystem) so I’ve used sorted() above.
  • File-system sorting is different from just sorting in sorted which is essentially in ASCII/Unicode index order.