How can I iterate until all entries are in a given column?

I am trying to apply a while statement to my code in order to run it until all the elements in the lists below (in the column Check) are in column Source.

My code is as so far:

while set_condition: # to set the condition
     newCol = pd.Series(list(set(df['Check']) - set(df['Source']))) # this check for elements which are not currently included in the column Source
     newList1 = newCol.apply(lambda x: my_function(x)) # this function should generate the lists n Check -> this explains why I need to create a while statement
     df = df.append(pd.DataFrame(dict('Source'=newCol, 'Check'=newList1)), ignore_index=True) # append the results in the new column
     df = df.explode('Check')

I will give you an example of the process and of how my_function works: let’s say that I have my initial dataset

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]

After exploding Check column and appending the results to Source, I will have

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]
dog     [] # this will be filled in after applying the function
cat     [] # this will be filled in after applying the function
bird    [] # this will be filled in after applying the function

Every elements in the lists should be added in Source column before applying the function. When I apply the function, I populate the lists of the other elements; so, for example I can have

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]
dog     [mouse, fish]  # they are filled in
cat     [mouse]
bird    [elephant, penguin]
fish    [dog]

Since fish and penguin are not in Source, I will need to run again the code in order to have the expected output (all the elements in the lists are already in the Source column):

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]
dog     [mouse, fish] 
cat     [mouse]
bird    [elephant, penguin]
fish    [dog]
penguin [bird]

as both dog and bird are already in Source, I will not need to apply again the function as all the lists are populated with elements already in the Source column. The code can stop to run.

What I would like to do is to stop the cycle/loop when all the elements in the lists are in the column Source and have applied the function to populate all the lists.

Thank you for all the help you will provide.

Answer

If you are repeating the loop until there are no more rows to add to the DataFrame, that is the same as saying that all of the elements of df['Check'] are found in df['Source']. You have to calculate that every loop anyway, so why not use it to break out of the loop?

while True: # loop forever!
     diff = set(df['Check']) - set(df['Source'])
     if len(diff) == 0:
         break # done!
     newCol = pd.Series(list(diff))
     newList1 = newCol.apply(lambda x: my_function(x))
     df = df.append(pd.DataFrame(dict('Source'=newCol, 'Check'=newList1)), ignore_index=True)
     df = df.explode('Check') # NOTE: I will use this to my advantage in the next suggested solution

Because continually appending to a DataFrame is taxing on memory, you might want to consider building the columns first, then building the DataFrame all at once outside of the loop. df['Check'] is going to end up exploded anyway, so start by exploding and build onto those lists:

df = df.explode('Check')
check = df['Check']                # Append to this list as we iterate
source = df['Source']              # Append to this list as we iterate
unique_source = set(source)
diff = set(check) - unique_source  # Iterate until this is empty
while len(diff) > 0:
    new_check = [my_function(x) for x in diff] # a list of lists
    check.append(new_check)    # Add the list of lists as-is, but explode later
    source.append(diff)        # Keep track of the new sources for the DataFrame...
    unique_source.update(diff) # and keep track of the unique sources for efficiency
    flat_check = set(x for sublist in new_check for x in sublist)
    diff = flat_check - unique_source  # We only have to check the new elements!

df = pd.DataFrame({"Check": check, "Source": source}).explode("Check") # build the entire DataFrame at once

There are a lot of ways you can play with this structure to get the structure of the DataFrame you want. For instance, if you don’t want to explode df['Check'], just keep around the original version of df at the beginning of this example and append the new data to that:

new_df = df.explode('Check')
unique_source = set(new_df['Source'])
diff = set(new_df['Check']) - unique_source
source = [] # append to empty lists
check = []  # append to empty lists
while len(diff) > 0:
    # ...

df = pd.append([df, pd.DataFrame({"Check": check, "Source": source})]) # keep the unexploded columns