SqlContext and SchemaRDD in Spark 1.2

SqlContext and SchemaRDD in Spark 1.2

Please note SchemaRDD in Spark 1.2 has been replaced by DataFrames in Spark 1.3. SqlContext can be used to load underlying data in JSON and Parquet format like: scala> import sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> import...
Spark with Avro

Spark with Avro

This recipe is based on the Databricks spark-avro library and all rights to this library are owned by Databricks. Download the spark-avro library: $wget https://github.com/databricks/spark-avro/archive/master.zip $unzip master.zip $cd spark-avro-master $sbt...
Spark SQL: Calculating Duration – Timeformat to Date

Spark SQL: Calculating Duration – Timeformat to Date

Spark SQL does not support date type, so things like duration become tough to calculate. That said, in Spark everything is RDD. So that’s a hidden weapon which can always be used when higher level functionality is limited. Let’s take a case where we are...
RDD, SchemaRDD, MySQL and Joins

RDD, SchemaRDD, MySQL and Joins

RDD being a unit of compute in Spark is capable of doing all complex operations which traditionally are done using complex queries in databases. It’s sometimes difficult to get the exact steps to perform these operations, so this blog is an attempt in that...
Demystifying Compression

Demystifying Compression

Compression has an important role to play in Big Data technologies. It makes both storage and transport of data more efficient. Then why are so many compression formats, and what are the things we have to balance while making a decision about which compression format...
Top