Spark Recipes
The power of Spark as a Big Data compute platform does not need any explanation (at least not in cookbook). This cookbook contains recipes to solve common day-to-day needs in Spark. If you have any feedback about these recipes, please submit your feedback to bigdata@infoobjects.com.
Kafka Connect: Connecting JDBC Source Using Mysql
Notice: Confluent Platform is the trademark and property of Confluent Inc. Kafka 0.90 comes with Kafka Connect. Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to quickly define...
Spark Streaming: Window vs Batch
Spark Streaming is a microbatch based streaming library. What that means is that streaming data is divided into batches based on time slice called batch interval. Every batch gets converted into an RDD and this continous stream of RDDs is represented as DStream....
Spark MLLib: Dense vs Sparse Vectors
Let's use house as a vector with following features: Square footage Last sold price Lot Size Number of rooms Number of bath rooms year built zip code Tennis Court Pool Jacuzzi Sports Court Now let's put some values Square Footage => 4450 Last Sold price =>...
Different Parts of Hive
Hive has 5 parts HiveQL as the language Hive as a command-line tools which you can call client Hive MetaStore (port:9083) Hive warehouse HiveServer2 (port: 10000) HiveQL HiveQL is a query language similar to SQL but supports limited SQL functions. It also adds few...
Different Ways of Setting AWS Credentials in Spark
To connect to S3 from Spark you need two environment variables for security credentials and they are: AWS_ACCESS_KEY AWS_SECRET_ACCESS_KEY There are three ways to set them up. 1. In .bashrc file, add following two lines at the end. Replace these dummy values with the...
Spark: Creating Machine Learning Pipelines Using Spark ML
Spark ML is a new library which uses dataframes as it's first class citizen as opposed to RDD. Another feature of Spark ML is that it helps in combining multiple machine learning algorithms into a single pipeline. You can read more about dataframes here. Spark ML uses...
Spark Submit With SBT
This recipe assumes sbt is installed and you have already gone over mysql with Spark recipe. I am a big fan of Spark Shell. Biggest proof is Spark Cookbook which has all recipes in the form of collection of single commands on Spark Shell. It makes it easy to...
Spark: Accumulators
Accumulators are Spark's answer to MapReduce counters but they do much more than that. Let's first start with a simple example which uses accumulators for simple count. scala> val ac = sc.accumulator(0) scala> val words =...
Spark: Calculating Correlation Using RDD of Vectors
Correlation is a relationship among two variables so if one changes, other also changes. Correlation measures how strong this change is from 0 t0 1. 0 means there is no correlation at all while one means perfect correlation i.e. if first variable become double, second...
Spark: Connecting to Amazon EC2
Apache Spark installation comes bundled with spark-ec2 script which makes it easy to create Spark instances on EC2. This recipe will cover connecting to EC2 using this script. Login to your Amazon AWS account Click on Security Credentials under your account name in...
Spark: Connecting To A JDBC Data-Source Using Dataframes
So far in Spark, JdbcRDD has been the right way to connect with a relational data source. In Spark 1.4 onwards there is an inbuilt datasource available to connect to a jdbc source using dataframes. Dataframe Spark introduced dataframes in version 1.3 and enriched...
Journey of Schema Into Big Data World
Schemas have a long history in the software world. Most of their popularity should be attributed to relational databases like oracle. In RDBMS world schemas are your best friend. You create schema, do operations using schema and hardly care about underlying storage...
Spark: JDBC Using DataFrames
For Spark 1.3 onwards, JdbcRDD is not recommended as DataFrames have support to load JDBC. Let us look at a simple example in this recipe. Using JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most...
Spark: DataFrames And JDBC
For Spark 1.3 onward, JdbcRDD is not recommended as DataFrames have support to load JDBC. Let us look at a simple example in this recipe. Using JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most...
Spark: DataFrames And Parquet
This recipe works with Spark 1.3 onward. Apache Parquet as a file format has garnered significant attention recently. Let's say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In Row oriented format all columns are scanned...
DataFrames With Apache Spark
Apache Spark 1.3 introduced a new concept called DataFrames and it got further developed in version 1.4. Dataframes also complement project Tungsten, a new initiative to improve CPU performance in Spark. Motivation behind DataFrames Spark so far has had a...
SparkSql: Left Outer Join With DSL
Spark SQL is great at executing SQL but sometimes you want to stick to the RDD level. Then comes the role of DSL. Sometimes how exactly to use Spark with DSL becomes confusing. This recipe is an attempt to reduce that. Let's get done with pleasantries first, i.e.,...
Spark SQL: SchemaRDD: Programmatically Specifying Schema
This recipe is inspired by http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema and all rights are owned by their respective owners. In Spark SQL, the best way to create SchemaRDD is by using scala case class. Spark...
SqlContext and SchemaRDD in Spark 1.2
Please note SchemaRDD in Spark 1.2 has been replaced by DataFrames in Spark 1.3. SqlContext can be used to load underlying data in JSON and Parquet format like: scala> import sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> import sqlContext._...
Spark with Avro
This recipe is based on the Databricks spark-avro library and all rights to this library are owned by Databricks. Download the spark-avro library: $wget https://github.com/databricks/spark-avro/archive/master.zip $unzip master.zip $cd spark-avro-master $sbt...
Spark SQL: Calculating Duration – Timeformat to Date
Spark SQL does not support date type, so things like duration become tough to calculate. That said, in Spark everything is RDD. So that's a hidden weapon which can always be used when higher level functionality is limited. Let's take a case where we are getting two...
RDD, SchemaRDD, MySQL and Joins
RDD being a unit of compute in Spark is capable of doing all complex operations which traditionally are done using complex queries in databases. It's sometimes difficult to get the exact steps to perform these operations, so this blog is an attempt in that direction...
Demystifying Compression
Compression has an important role to play in Big Data technologies. It makes both storage and transport of data more efficient. Then why are so many compression formats, and what are the things we have to balance while making a decision about which compression format...
Spark SQL: Parquet
This recipe works with Spark 1.2. For Spark 1.3 version, please click here. Apache Parquet, as a file format, has garnered significant attention recently. Let's say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In row...
Spark Core: sc.textFile vs sc.WholeTextFiles
While loading a RDD from source data, there are two choices which look similar. scala> val movies = sc.textFile("movies") scala> val movies = sc.wholeTextFiles("movies") sc.textFile SparkContext's TextFile method, i.e., sc.textFile in Spark Shell, creates a RDD...
Spark SQL: JdbcRDD
Using JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most probably you'll use it with spark-submit but I have put it here in spark-shell to illustrate easier. Database Preparation We are going to...
Spark SQL: SqlContext vs HiveContext
There are two ways to create context in Spark SQL: SqlContext: scala> import org.apache.spark.sql._ scala> var sqlContext = new SQLContext(sc) HiveContext: scala> import org.apache.spark.sql.hive._ scala> val hc = new HiveContext(sc) Though most of the...