I am trying to concatenate string columns. I can achieve it using the below code without sorting columns. Appreciate someone can help me to sort and concatenate string columns.
Dataframe
import pyspark.sql.functions as f df = spark.createDataFrame([ ("kd", "gr", "hd", "ae", "nw"), ("zj", "sd", "mw", "op", "le"), ("ct", "wm", "kr", "vs", "qz"),], ("main", "main1", "main3", "main2", "main4") ) +----+-----+-----+-----+-----+ |main|main1|main3|main2|main4| +----+-----+-----+-----+-----+ | kd| gr| hd| ae| nw| | zj| sd| mw| op| le| | ct| wm| kr| vs| qz| +----+-----+-----+-----+-----+
Expected result:
+----+-----+-----+-----+-----+--------------+ |main|main1|main3|main2|main4| result| +----+-----+-----+-----+-----+--------------+ | kd| gr| hd| ae| nw|kd_gr_ae_hd_nw| | zj| sd| mw| op| le|zj_sd_op_mw_le| | ct| wm| kr| vs| qz|ct_wm_vs_kr_qz| +----+-----+-----+-----+-----+--------------+
Output:
df = df.withColumn('result', f.concat_ws( '_', *[c for c in df.columns if c.startswith("main")])) df.show() +----+-----+-----+-----+-----+--------------+ |main|main1|main3|main2|main4| result| +----+-----+-----+-----+-----+--------------+ | kd| gr| hd| ae| nw|kd_gr_hd_ae_nw| | zj| sd| mw| op| le|zj_sd_mw_op_le| | ct| wm| kr| vs| qz|ct_wm_kr_vs_qz| +----+-----+-----+-----+-----+--------------+
Answer
you can just sort the column names prior to concatenating them.
the program would be as follows :
col_names = [c for c in df.columns if c.startswith("main")] sorted_names = sorted(col_names) df = df.withColumn('result', f.concat_ws( '_', *sorted_names))