pyspark — best way to sum values in column of type Array(StringType()) after splitting

I have a dataframe like this,

name | scores
Dan  |  [1_10, 2_5, 3_2, 4_12.5]
Ann  |  [2_12.4, 3_4.5, 5_9.3]
Jon  |  [2_1.7]

For each row, I want to extract the number value (split item on underscored and take the index 1) from the items in the scores column which is a string and sum up the column.

My expected answer will look like this,

name | Total
Dan  |  29.5
Ann  |  26.2
Jon  |  1.7 

My data frame is very huge, the array column can contain millions of items in the worst case. Explode based on solution is not working out for me due to huge size of dataframe after explode.

My driver is small and I can’t afford to run a UDF to solve this.

Can RDD or map can help here? If so how to use it efficiently? I’m running pyspark 2.3 btw.

Answer

You can use aggregate and transform higher order functions:

df2 = df.selectExpr('name', """
    aggregate(
        transform(scores, x -> split(x, '_')[1]),
        double(0),
        (acc, x) -> acc + x
    ) as Total
""")

For older spark versions, try explode and group by:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'scores', F.explode('scores').alias('scores')
).withColumn(
    'scores', F.split('scores', '_')[1]
).groupBy('name').agg(F.sum('scores').alias('Total'))

An RDD solution:

from pyspark.sql import Row

df2 = df.rdd.map(lambda r: Row(name=r.name, Total=sum([float(i.split('_')[1]) for i in r.scores]))).toDF()

df2.show()
+----+-----+
|name|Total|
+----+-----+
| Dan| 29.5|
| Ann| 26.2|
| Jon|  1.7|
+----+-----+