Pyspark > Dataframe with multiple array columns into multiple rows with one value each

We have a pyspark dataframe with several columns containing arrays with multiple values. Our goal is to have each of this values of these columns in several rows, keeping the initial different columns. So, starting with something like this:

data = [
    ("A", ["a", "c"], ["1", "5"]),
    ("B", ["a", "b"], None),
    ("C", [], ["1"]),
]

Whats:

+---+------+------+
|id |list_a|list_b|
+---+------+------+
|A  |[a, c]|[1, 5]|
|B  |[a, b]|null  |
|C  |[]    |[1]   |
+---+------+------+

We would like to end up having:

+---+----+----+
|id |col |col |
+---+----+----+
|A  |a   |null|
|A  |c   |null|
|A  |null|1   |
|A  |null|5   |
|B  |a   |null|
|B  |b   |null|
|C  |null|1   |
+---+----+----+

We are thinking about several approaches:

  1. prefixing each value with a column indicator, merge all the arrays into a single one, explode it and reorganize the different values into different columns
  2. split the dataframe into several, each one with one of these array columns, explode the array column and then, concatenating the dataframes

But all of them smell like dirty, complex, error prone and inefficient workarounds.

Does anyone have an idea about how to solve this in an elegant manner?

Answer

In case both columns list_a and list_b could be empty, I would add a 4th case in the dataset

data = [
    ("A", ["a", "c"], ["1", "5"]),
    ("B", ["a", "b"], None),
    ("C", [], ["1"]),
    ("D", None, None),
]
df = spark.createDataFrame(data,["id","list_a","list_b"])

I would then split the original df in 3 (both nulls, list_a exploded and list_b exploded) and the execute a unionByName

dfnulls = df.filter(col("list_a").isNull() & col("list_b").isNull())
    .withColumn("list_a", lit(None))
    .withColumn("list_b", lit(None))

df1 = df
    .withColumn("list_a", explode_outer(col("list_a")))
    .withColumn("list_b", lit(None))
    .filter(~col("list_a").isNull())

df2 = df
    .withColumn("list_b", explode_outer(col("list_b")))
    .withColumn("list_a", lit(None))
    .filter(~col("list_b").isNull())

merged_df = df1.unionByName(df2).unionByName(dfnulls)

merged_df.show()

+---+------+------+
| id|list_a|list_b|
+---+------+------+
|  A|     a|  null|
|  A|     c|  null|
|  B|     a|  null|
|  B|     b|  null|
|  A|  null|     1|
|  A|  null|     5|
|  C|  null|     1|
|  D|  null|  null|
+---+------+------+