Get parent of nested array in recursive loop

I am trying to loop through PDF bookmarks and save them in an object, including their parent bookmark.

Thanks to PyPDF4 i can get those bookmarks as arrays within arrays.

For example:

[]
 0: Root 1
 1: []
     0: Child layer 1.1
     1: Child layer 1.2
     2: []
         0: Child layer 1.2.1
         1: Child layer 1.2.2
         1: Child layer 1.2.3
 2: Root 2

Since these arrays vary from file to file, I dont know how this array is structured. So I ended up choosing a recursive function to save them.

Function

def __iterate_through_bookmarks(outlines, bookmarks, layer = 0, parent = None):
  layer += 1
  print(layer)

  if isinstance(outlines, list): 
    for item in outlines:
      __iterate_through_bookmarks(item, bookmarks, layer, parent)
    return bookmarks
    

  print(outlines.title)
  bookmarks.append(bookmark(outlines, parent))

I experimented a bit, but i could not get it right. But seeing how the layer counter in the function gets the layers right gives me the hope, that this is possible.

Layer output:

1
2
Root 1
2
3
Child layer 1.1
3
Child layer 1.2
3
4
Child layer 1.2.1
4
Child layer 1.2.2
4
Child layer 1.2.3
2
Root 2

The child layers 1.x have the same layer counter because of how the recursive function works. However, i cannot find a solution for how to save their (identical) parent in an object.

The end goal is to return an array with bookmark objects. A class that stores title, position and parent of the outlines.

Does anyone have a suggestion?

Answer

With an explicit ‘children’ field

Recursively browse a nested list of dicts, where each dict has a “name” and a “children” field.

nested_list = [
  {'name': 'act IV',
   'children': [{'name': 'scene 4',
                 'children': [{'name': 'dual seduction',
                               'children': []},
                             ]}
               ]},
  {'name': 'act I',
   'children': [{'name': 'scene 2',
                 'children': [{'name': 'inconstancy praise',
                               'children': []},
                              {'name': 'sganarelle monologue',
                               'children': []}
                             ]}
               ]},
]

def get_list_of_bookmarks(nested_list, parent_name=None):
  bookmark_list = []
  for bookmark in nested_list:
    bookmark_list.append({'name': bookmark['name'], 'parent': parent_name})
    bookmark_list.extend(get_list_of_bookmarks(bookmark['children'], parent_name=bookmark['name']))
  return bookmark_list

print(get_list_of_bookmarks(nested_list))
# [{'name': 'act IV', 'parent': None},
#  {'name': 'scene 4', 'parent': 'act IV'},
#  {'name': 'dual seduction', 'parent': 'scene 4'},
#  {'name': 'act I', 'parent': None},
#  {'name': 'scene 2', 'parent': 'act I'},
#  {'name': 'inconstancy praise', 'parent': 'scene 2'},
#  {'name': 'sganarelle monologue', 'parent': 'scene 2'}]

Or alternatively, if you want to store a reference to the parent, rather than just the parent’s name:

def get_list_of_bookmarks(nested_list, parent=None):
  bookmark_list = []
  for bookmark_with_children in nested_list:
    bookmark_with_parent = {'name': bookmark_with_children['name'], 'parent': parent}
    bookmark_list.append(bookmark_with_parent)
    bookmark_list.extend(get_list_of_bookmarks(bookmark_with_children['children'], parent=bookmark_with_parent))
  return bookmark_list

print(get_list_of_bookmarks(nested_list))
# [{'name': 'act IV', 'parent': None},
# {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}},
# {'name': 'dual seduction', 'parent': {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}}},
# {'name': 'act I', 'parent': None},
# {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}},
# {'name': 'inconstancy praise', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}},
# {'name': 'sganarelle monologue', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}}]

Note that the output of print looks like it’s very redundant and full of copies, but it’s not. Every bookmark contains a reference to its parent; it does not contain of copy of its parent.

With no explicit ‘children’ field: order is relevant

Now assuming you don’t have an explicit 'children' field, and your nested list is simply a list of lists; sublists are considered children of the previous element.

nested_list = [
  'act IV',
  ['scene 4', ['dual seduction']],
  'act I',
  ['scene 2', ['inconstancy praise', 'sganarelle monologue']],
]

We can draw inspiration from the related question: Flatten an irregular list of lists.

from collections.abc import Iterable

def flatten_with_depth(l, depth=0):
    for el in l:
        if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten_with_depth(el, depth=depth+1)
        else:
            yield (el, depth)
# flatten_with_depth() adapted from flatten() at https://stackoverflow.com/a/2158532/3080723

print(list(flatten_with_depth(nested_list)))
# [('act IV', 0), ('scene 4', 1), ('dual seduction', 2), ('act I', 0), ('scene 2', 1), ('inconstancy praise', 2), ('sganarelle monologue', 2)]

Now the list of bookmarks is flattened, and with each bookmark we have its depths. The parent of a given bookmark at depth depth is the closest previous bookmark at depth depth-1. An efficient way to easily find the parents is to maintain a stack of the ancestors in the current branch.

def get_parents_knowing_depths(list_with_depths):
  ancestor_stack = []
  result = []
  for (bookmark_name, depth) in list_with_depths:
    if depth == 0:
      bookmark = {'name': bookmark_name, 'parent': None}
      ancestor_stack = [bookmark]
    else:
      while len(ancestor_stack) > depth:
        ancestor_stack.pop()
      bookmark = {'name': bookmark_name, 'parent': ancestor_stack[-1]}
      ancestor_stack.append(bookmark)
    result.append(bookmark)
  return result

print(get_parents_knowing_depths(flatten_with_depth(nested_list)))
# [{'name': 'act IV', 'parent': None},
#  {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}},
#  {'name': 'dual seduction', 'parent': {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}}},
#  {'name': 'act I', 'parent': None},
#  {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}},
#  {'name': 'inconstancy praise', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}},
#  {'name': 'sganarelle monologue', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}}]