drop duplicates with nested data (graph)

I have the following mapping table:

enter image description here


Sample data:

import pandas as pd
from numpy import nan
d = {'start': {0: 4, 1: 3, 2: 2, 3: 1, 4: 12, 5: 11, 6: 23, 7: 22, 8: 21}, 'name': {0: 'Vitamin',  1: 'Vitamin D',  2: 'Vitamin D3',  3: 'Colecalciferol',  4: 'Vitamin D2',  5: 'Ergocalcifero',  6: 'Vitamin K',  7: 'Vitamin K2',  8: 'Menachinon'}, 'end': {0: nan,  1: 4.0,  2: 3.0,  3: 2.0,  4: 3.0,  5: 12.0,  6: 4.0,  7: 23.0,  8: 22.0}}
df = pd.DataFrame(d)

l1 = ['Colecalciferol', 'Vitamin D']
l2 = ['Colecalciferol', 'Ergocalcifero', 'Vitamin D3']

Expected output:

l1 = ['Colecalciferol']
l2 = ['Colecalciferol', 'Ergocalcifero']

What I tried:

import networkx as nx
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'start', 'end', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source=1).reverse()

print(list(T))
# [1, 2.0, 3.0, 4.0, nan]

Essentially showing the successors of a term, here of start 1: ‘Colecalciferol’, but actually I think I need the ancestors of a term, not the successors.


Goal:

  • I want to remove duplicates, even of higher/lower level terms. e.g.: ‘Colecalciferol’ is a ‘Vitamin D3’ which is a ‘Vitamin D’.

  • Therefore, I want to remove ‘Vitamin D’ to preserve the information of the lowest level term in example (l1).

Answer

You were pretty close! Here’s a way to go with your graph approach: we simply check if the node has any predecessor, and if it does, it means it isn’t a lowest-level term and we don’t want to keep it.

import networkx as nx
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'start', 'end', create_using=nx.DiGraph())

filtered_l1 = []
for elmt in l1:
    node = int(df[df.name == elmt].start)
    if list(G.predecessors(node)) == []:
        filtered_l1.append(elmt)
print(filtered_l1)

The for loop above can be condensed in a one-liner: [x for x in l1 if list(G.predecessors(int(df[df.name == x].start))) == []]

A simpler approach that completely removes the dependency on networkx would be to simply check if a product’s start is the end of any product, in which case it isn’t bottom-level and we want it filtered out:

all_ends = df.end.unique()
filtered_l1 = [x for x in l1 if int(df[df.name == x].start) not in all_ends]