How to extract text in one file based from text in another file using Python

I’m new to Python. I searched other questions here and didn’t find the exact situation I’m running into.

I need to be able to read in the contents of File-A and pull out the matching lines from File-B.

I know how to do this in PowerShell, but on large files it is very slow and I am trying to learn Python.

File-A contains just loan numbers – An 8 to 10 digit number per line File-A can contain 1 to thousands of lines

File-B can contain 1 to thousands of lines and has more data in it but each line will start with the same 8 to 10 digit loan number.

I need to read in File-A and find the matching line in File-B and write out those matching lines to a new file (all are text files)

Example contents of File-A – no spaces – 1 loan per line

272991
272992
272993

Example contents of File-B

272991~20210129~\Serv1LOC767530986753016618272991.pdf~0
272992~20210129~\Serv1LOC767530986753016618272992.pdf~0
272993~20210129~\Serv1LOC767530986753016618272993.pdf~0

Is someone able to assist, point me in the right direction or better yet, provide a working solution?

Here is what I have tried so far but all it does is create the new PulledLoans.txt file with nothing in it

import os
# os.system('cls')
os.chdir('C:\Temp\')
print(os.getcwd())
# read file
loanFile = 'Loans.txt'
SourceFile = 'Orig.txt'
NewFile = 'PulledLoans.txt'

with open(loanFile, 'r') as content, open(SourceFile, 'r') as source:
    # print(content.readlines())
    for loan in content:
        # print(loan, end='')
        if loan in source:
            print('found loan')

with open(SourceFile) as dfile, open(loanFile) as ifile:
    lines = "n".join(set(dfile.read().splitlines()) & set(ifile.read().splitlines()))
    print(lines)
    
with open(NewFile, 'w') as ofile:
    ofile.write(lines)

Answer

First, read everything from fileB into a dictionary where the keys are the identifier, and the values are the entire line

file_b_data = dict()

with open("fileB") as f_b:
    for line in f_b:
        line = line.strip() # Remove whitespace at start and end
        if not line:
            continue # If the line is blank, skip

        row = line.split("~") # Split by ~
        identifier = row[0]   # First element is the identifier
        file_b_data[identifier] = line # Set the value of the dictionary

Next, read the lines from fileA and get the matching values from your dictionary

with open("fileA") as f_a, open("outfile", "w") as f_w:
    for identifier in f_a:
        identifier = identifier.strip()
        if not identifier:
            continue
        if identifier in file_b_data: # Check that the identifier exists in previously read data
            out_line = file_b_data[identifier] + "n" # Get the value from the dict
            f_w.write(out_line) # Write it to output file

Alternatively, you could use the pandas module to read all of fielA and fileB into a dataframe, and then find the correct rows.

import pandas as pd

file_b_data = pd.read_csv("fileB.txt", sep="~", names=["identifier", "date", "path", "something"], index_col=0)

Which gives us this dataframe:

identifier date     path                                         something
272991     20210129 \Serv1LOC767530986753016618272991.pdf 0
272992     20210129 \Serv1LOC767530986753016618272992.pdf 0
272993     20210129 \Serv1LOC767530986753016618272993.pdf 0

Same for fileA: (I removed 272992 to illustrate that it actually works)

file_a_data = pd.read_csv("fileA.txt", names="identifier")

giving us

   identifier
0      272991
1      272993

Then, look for these indices in file_b_data:

wanted_ids = file_a_data['identifiers']
wanted_rows = file_b_data.loc[wanted_ids, :]
wanted_rows.to_csv("out_file.txt", sep="~",header=None)

which will write this file: (notice the 272992 row is missing because it wasn’t in fileA)

272991~20210129~\Serv1LOC767530986753016618272991.pdf~0
272993~20210129~\Serv1LOC767530986753016618272993.pdf~0