Cannot resolve column name when pivoting in Pyspark

Here’s my dataset

reloadmonthly

DataFrame[year: string, month: string, msisdn: string, reload_min: double, reload_max: double, reload_avg: double, reload_sum: double, rembal_min: string, rembal_max: string, rembal_avg: double, rembal_sum: double, period: string, application_type: string, periodloan: string, ix: string, last_x_month: double]

reloadmonthly.show(2)

+----+-----+-------------+----------+----------+----------+----------+----------+----------+----------+----------+------+----------------+----------+---+------------+
|year|month|       msisdn|reload_min|reload_max|reload_avg|reload_sum|rembal_min|rembal_max|rembal_avg|rembal_sum|period|application_type|periodloan| ix|last_x_month|
+----+-----+-------------+----------+----------+----------+----------+----------+----------+----------+----------+------+----------------+----------+---+------------+
|2019|   10| 628176789488|    5000.0|    5000.0|    5000.0|    5000.0|    5189.0|    5189.0|    5189.0|    5189.0|201910|              10|    202001|  1|         1.0|
|2019|   10|6281802031321|   25000.0|   25000.0|   25000.0|   25000.0|   25633.0|   25633.0|   25633.0|   25633.0|201910|             100|    202001|  1|         2.0|
+----+-----+-------------+----------+----------+----------+----------+----------+----------+----------+----------+------+----------------+----------+---+------------+
only showing top 2 rows

here’s my code

reloadid = reloadmonthly.dropDuplicates(["msisdn"])
reloadid = reloadid.join(

    packetmonthly.withColumn("p", F.expr("concat('reload_sum_l', last_x_month)"))
    .groupBy("msisdn")
    .pivot("p")
    .sum("reload_sum"),

    on=["msisdn"],
    how="left_outer",
)

here’s the error message

AnalysisException: 'Cannot resolve column name "reload_sum" among (year, month, msisdn, packet_min, packet_max, packet_avg, packet_sum, period, application_type, periodloan, ix, last_x_month, p);'

Answer

You are doing the pivot before (within) the join. Therefore, you are trying to pivot packetmonthly which, apparently, does not contain any column reload_sum (this column appear in reloadmonthly). I edited your code to enlight the part where you do the pivot within the join.

Maybe, you juste need to do the join before the pivot – I cannot really test because you did not give the definition of packetmonthly but the code should look like this :

reloadid = (
    reloadid.join(
        packetmonthly,
        on=["msisdn"],
        how="left_outer",
    )
    .withColumn("p", F.expr("concat('reload_sum_l', last_x_month)"))
    .groupBy("msisdn")
    .pivot("p")
    .sum("reload_sum")
)