Pyspark: Concatenate sorted columns startwith similar name

I am trying to concatenate string columns. I can achieve it using the below code without sorting columns. Appreciate someone can help me to sort and concatenate string columns.

Dataframe

import pyspark.sql.functions as f

df = spark.createDataFrame([
    ("kd", "gr", "hd", "ae", "nw"),
    ("zj", "sd", "mw", "op", "le"),
    ("ct", "wm", "kr", "vs", "qz"),],
    ("main", "main1", "main3", "main2", "main4")
)

+----+-----+-----+-----+-----+
|main|main1|main3|main2|main4|
+----+-----+-----+-----+-----+
|  kd|   gr|   hd|   ae|   nw|
|  zj|   sd|   mw|   op|   le|
|  ct|   wm|   kr|   vs|   qz|
+----+-----+-----+-----+-----+

Expected result:

+----+-----+-----+-----+-----+--------------+
|main|main1|main3|main2|main4|        result|
+----+-----+-----+-----+-----+--------------+
|  kd|   gr|   hd|   ae|   nw|kd_gr_ae_hd_nw|
|  zj|   sd|   mw|   op|   le|zj_sd_op_mw_le|
|  ct|   wm|   kr|   vs|   qz|ct_wm_vs_kr_qz|
+----+-----+-----+-----+-----+--------------+

Output:

df = df.withColumn('result', f.concat_ws(
    '_', *[c for c in df.columns if c.startswith("main")]))

df.show()

+----+-----+-----+-----+-----+--------------+
|main|main1|main3|main2|main4|        result|
+----+-----+-----+-----+-----+--------------+
|  kd|   gr|   hd|   ae|   nw|kd_gr_hd_ae_nw|
|  zj|   sd|   mw|   op|   le|zj_sd_mw_op_le|
|  ct|   wm|   kr|   vs|   qz|ct_wm_kr_vs_qz|
+----+-----+-----+-----+-----+--------------+

Answer

you can just sort the column names prior to concatenating them.

the program would be as follows :

col_names = [c for c in df.columns if c.startswith("main")]

sorted_names = sorted(col_names)

df = df.withColumn('result', f.concat_ws(
    '_', *sorted_names))

Leave a Reply

Your email address will not be published. Required fields are marked *