How to cast all columns of Spark Dataset to String in Java without withColumn?

I’ve tried the solution using withColumn specified here:

How to cast all columns of Spark dataset to string using Java

But, the solution is taking a hit on performance for huge number of columns (1k-6k). It takes more than 6 hours and then gets aborted.

Alternatively, I’m trying to use map to cast like below, but I get error here:

MapFunction<Column, Column> mapFunction = (c) -> {
    return c.cast("string");
};      

dataset = dataset.map(mapFunction, Encoders.bean(Column.class));

Error with above snippet:

The method map(Function1<Row,U>, Encoder<U>) in the type Dataset<Row> is not applicable for the arguments (MapFunction<Column,Column>, Encoder<Column>)

Import used:

import org.apache.spark.api.java.function.MapFunction;

Answer

Found the below solution for anyone looking for this:

String[] strColNameArray = dataset.columns();
List<Column> colNames = new ArrayList<>();
for(String strColName : strColNameArray){
    colNames.add(new Column(strColName).cast("string"));
}
dataset = dataset.select(JavaConversions.asScalaBuffer(colNames));`

Leave a Reply

Your email address will not be published. Required fields are marked *