I’m new to Python. I searched other questions here and didn’t find the exact situation I’m running into.
I need to be able to read in the contents of File-A and pull out the matching lines from File-B.
I know how to do this in PowerShell, but on large files it is very slow and I am trying to learn Python.
File-A contains just loan numbers – An 8 to 10 digit number per line File-A can contain 1 to thousands of lines
File-B can contain 1 to thousands of lines and has more data in it but each line will start with the same 8 to 10 digit loan number.
I need to read in File-A and find the matching line in File-B and write out those matching lines to a new file (all are text files)
Example contents of File-A – no spaces – 1 loan per line
272991 272992 272993
Example contents of File-B
272991~20210129~\Serv1LOC7675309867530 16618272991.pdf~0 272992~20210129~\Serv1LOC7675309867530 16618272992.pdf~0 272993~20210129~\Serv1LOC7675309867530 16618272993.pdf~0
Is someone able to assist, point me in the right direction or better yet, provide a working solution?
Here is what I have tried so far but all it does is create the new PulledLoans.txt file with nothing in it
import os # os.system('cls') os.chdir('C:\Temp\') print(os.getcwd()) # read file loanFile = 'Loans.txt' SourceFile = 'Orig.txt' NewFile = 'PulledLoans.txt' with open(loanFile, 'r') as content, open(SourceFile, 'r') as source: # print(content.readlines()) for loan in content: # print(loan, end='') if loan in source: print('found loan') with open(SourceFile) as dfile, open(loanFile) as ifile: lines = "n".join(set(dfile.read().splitlines()) & set(ifile.read().splitlines())) print(lines) with open(NewFile, 'w') as ofile: ofile.write(lines)
First, read everything from
fileB into a dictionary where the keys are the identifier, and the values are the entire line
file_b_data = dict() with open("fileB") as f_b: for line in f_b: line = line.strip() # Remove whitespace at start and end if not line: continue # If the line is blank, skip row = line.split("~") # Split by ~ identifier = row # First element is the identifier file_b_data[identifier] = line # Set the value of the dictionary
Next, read the lines from
fileA and get the matching values from your dictionary
with open("fileA") as f_a, open("outfile", "w") as f_w: for identifier in f_a: identifier = identifier.strip() if not identifier: continue if identifier in file_b_data: # Check that the identifier exists in previously read data out_line = file_b_data[identifier] + "n" # Get the value from the dict f_w.write(out_line) # Write it to output file
Alternatively, you could use the
pandas module to read all of
fileB into a dataframe, and then find the correct rows.
import pandas as pd file_b_data = pd.read_csv("fileB.txt", sep="~", names=["identifier", "date", "path", "something"], index_col=0)
Which gives us this dataframe:
identifier date path something 272991 20210129 \Serv1LOC7675309867530 16618272991.pdf 0 272992 20210129 \Serv1LOC7675309867530 16618272992.pdf 0 272993 20210129 \Serv1LOC7675309867530 16618272993.pdf 0
fileA: (I removed 272992 to illustrate that it actually works)
file_a_data = pd.read_csv("fileA.txt", names="identifier")
identifier 0 272991 1 272993
Then, look for these indices in
wanted_ids = file_a_data['identifiers'] wanted_rows = file_b_data.loc[wanted_ids, :] wanted_rows.to_csv("out_file.txt", sep="~",header=None)
which will write this file: (notice the
272992 row is missing because it wasn’t in
272991~20210129~\Serv1LOC7675309867530 16618272991.pdf~0 272993~20210129~\Serv1LOC7675309867530 16618272993.pdf~0