Most effective way to transform Spark SQL DataFrame into a list of pojos

Assume you have the following Spark DataFrame extracted from Cassandra:

DataFrame df = cassandraSqlContext.sql(query);

with the following

+-----------------+------+-----------------+-----------------------------------------------------+
|assetid          |tslice|deviceid         |value                                                |
+-----------------+------+-----------------+-----------------------------------------------------+
|085eb9c6-8a16-...|201509|085eb9c6-8a16-...|Map(xval -> 120000, type -> xsd:double, yval -> 53.0)|
|085eb9c6-8a16-...|201509|085eb9c6-8a16-...|Map(xval -> 120000, type -> xsd:double, yval -> 53.0)|
|085eb9c6-8a16-...|201509|085eb9c6-8a16-...|Map(xval -> 120000, type -> xsd:double, yval -> 53.0)|
    ...

I would like to transform this DataFrame into a list of Java beans structured as follows

public class DataItem {
    private UUID assetID;
    private int tslice;
    private UUID deviceID;
    private Value value;

    // getters, setters...
}

and

public class Value {
    private double xval;
    private String type;
    private double yval;

    // getters, setters...
}

What is the best way to do that in Spark both in terms of performances and conciseness?

Thanks!

Answer

If you just have access to the DataFrame and want to convert it to a list pojos, you should collect the dataframe and iterate the list of org.apache.spark.sql.Row to populate the list of pojos.

Or

You can use spark-cassandra connector which contains the methods to create JavaRDD which can be collected to get the list of pojos.

Code:

SparkContextJavaFunctions functions = CassandraJavaUtil.javaFunctions(sparkContext);
JavaRDD<DataItem> cassandraRowsRDD = functions.cassandraTable("keyspace", "table_name",
                           CassandraJavaUtil.mapRowTo(DataItem.class));
//required list of pojos
List<DataItem> = cassandraRowsRDD.collect();

Leave a Reply

Your email address will not be published. Required fields are marked *