This recipe is inspired by http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema and all rights are owned by their respective owners.

In Spark SQL, the best way to create SchemaRDD is by using scala case class. Spark uses Java’s reflection API to figure out the fields and build the schema.

There are several cases where you would not want to do it. One of them being case class’ limitation that it can only support 22 fields.

Let’s look at an alternative approach, i.e., specifying schema programmatically. Let’s take a simple file person.txt which has three fields: firstName, lastName and age.

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> import sqlContext._
scala> val person = sc.textFile("hdfs://localhost:9000/user/hduser/person")
scala> import org.apache.spark.sql._
scala>  val schema = StructType(Array(StructField("firstName",StringType,true),StructField("lastName",StringType,true),StructField("age",IntegerType,true)))
scala> val rowRDD = person.map(_.split(",")).map(p => org.apache.spark.sql.Row(p(0),p(1),p(2).toInt))
scala> val personSchemaRDD = sqlContext.applySchema(rowRDD, schema)
scala> personSchemaRDD.registerTempTable("person")
scala> sql("select * from person").foreach(println)
Top