I have a dataframe like this,
name | scores Dan | [1_10, 2_5, 3_2, 4_12.5] Ann | [2_12.4, 3_4.5, 5_9.3] Jon | [2_1.7]
For each row, I want to extract the number value (split item on underscored and take the index 1) from the items in the scores column which is a string and sum up the column.
My expected answer will look like this,
name | Total Dan | 29.5 Ann | 26.2 Jon | 1.7
My data frame is very huge, the array column can contain millions of items in the worst case. Explode based on solution is not working out for me due to huge size of dataframe after explode.
My driver is small and I can’t afford to run a UDF to solve this.
Can RDD or map can help here? If so how to use it efficiently? I’m running pyspark 2.3 btw.
You can use
transform higher order functions:
df2 = df.selectExpr('name', """ aggregate( transform(scores, x -> split(x, '_')), double(0), (acc, x) -> acc + x ) as Total """)
For older spark versions, try explode and group by:
import pyspark.sql.functions as F df2 = df.withColumn( 'scores', F.explode('scores').alias('scores') ).withColumn( 'scores', F.split('scores', '_') ).groupBy('name').agg(F.sum('scores').alias('Total'))
An RDD solution:
from pyspark.sql import Row df2 = df.rdd.map(lambda r: Row(name=r.name, Total=sum([float(i.split('_')) for i in r.scores]))).toDF() df2.show() +----+-----+ |name|Total| +----+-----+ | Dan| 29.5| | Ann| 26.2| | Jon| 1.7| +----+-----+