Spark SQL does not support date type, so things like duration become tough to calculate. That said, in Spark everything is RDD. So that’s a hidden weapon which can always be used when higher level functionality is limited.

Let’s take a case where we are getting two dates in String format from either a text file or Parquet file.

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> import sqlContext._
scala> sqlContext.setConf("spark.sql.parquet.binaryAsString","true")
scala> val myfile = sqlContext.parquetFile("datafolder")
scala> myfile.registerTempTable("mytable")
scala>val resultsRDD =sql("select old_date, new_date from mytable")
scala>resultsRDD.foreach(println)
[2014-01-01 10:04:00,2014-11-12 01:56:48]

Now you obviously cannot subtract strings. But what about Timestamps? Here:

scala> val resultsRDD=sql("select old_date,new_date from mytable")

Unfortunately that results in error.

java.lang.RuntimeException: Type TimestampType does not support numeric operations

So what’s the solution? Let’s summon the power of RDD.

scala>import java.sql.Timestamp
scala> val durationRDD = resultsRDD.map(e => (Timestamp.valueOf(e.getString(1)).getTime - Timestamp.valueOf(e.getString(0)).getTime))

If you have many columns, you’ll maintain the tuple structure and would only calculate duration using the two fields.

It would look something similar to:

scala>import java.sql.Timestamp
scala> val durationRDD = resultsRDD.map(e => (Timestamp.valueOf(e.getString(1)).getTime - Timestamp.valueOf(e.getString(0)).getTime,e.getInt(2),e.getLog(3)))

If you have any feedback, please send to bigdata@infoobjects.com.

 

Top