Spark SQL is great at executing SQL but sometimes you want to stick to the RDD level. Then comes the role of DSL. Sometimes how exactly to use Spark with DSL becomes confusing. This recipe is an attempt to reduce that.

Let’s get done with pleasantries first, i.e., loading SqlContext and imports:

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> import sqlContext._
scala> import org.apache.spark.sql.catalyst.plans._

We are going to do an age old employee department example here. The goal is to print information about departments and employees in it. Department information should be printed even if there are no employees in it.

Creating department case class, creating a Scala List, converting it into RDD of departments and then adding a symbol alias:

scala> case class Dept(dept_id:String,dept_name:String)
scala> val dept = sc.parallelize(List( ("DEPT01","Information Technology"), ("DEPT02","WHITE HOUSE"),("DEPT03","EX-PRESIDENTS OFFICE"),("DEPT04","SALES"))).map( d => Dept(d._1,d._2)).as('dept)

Creating employee case class, creating a Scala List, converting it into RDD of Emps and then adding a symbol alias:

scala> case class Emp(first_name:String,last_name:String,dept_id:String)
scala> val emp = sc.parallelize(List( ("Rishi","Yadav","DEPT01"),("Barack","Obama","DEPT02"),("Bill","Clinton","DEPT03"))).map( e => Emp(e._1,e._2,e._3)).as('emp)

Now doing Left Outer Join:

scala> val alldepts = dept.join(emp,LeftOuter,Some("dept.dept_id".attr === "emp.dept_id".attr)).select("dept.dept_id".attr,'dept_name','first_name','last_name')
scala> alldepts.foreach(println)
Top