Spark Recipes

The power of Spark as a Big Data compute platform does not need any explanation (at least not in cookbook). This cookbook contains recipes to solve common day-to-day needs in Spark. If you have any feedback about these recipes, please submit your feedback to bigdata@infoobjects.com.

HBase Client in a Kerberos Enabled Cluster

Use-case: HBase Servers are in a Kerberos Enabled Cluster. HBase Servers (Masters and RegionServers) are configured to use Authentication to Connect to Zookeeper. Assumption: HBase + secured Zookeeper. This Java code snippet can be used to connect to HBase configured...
Read This

AWS: CIDR blocks explained

One part of working with public cloud is that a developer needs to understand far more than he or she is used to. These are same parameters which are facets of distributed computing. Following is a fun animation to explain different parts of it. In essence there are 4...
Read This

Amazon DynamoDB Sizing Demystified

Amazon DynamoDB has a partition as a unit of storage (partitions are automatically replicated across three AZs). Now when it comes to sizing there can be two considerations data size network throughput A partition can contain maximum of 10 GB of data so if you have 50...
Read This

Spark: Inferring Schema Using Case Classes

To make this recipe one should know about its main ingredient and that is case classes. These are special classes in Scala and the main spice of this ingredient is that all the grunt work which is needed in Java can be done in case classes in one code line. Spark uses...
Read This

Logistic Regression with Spark MLlib

Dataset: Diabetes data from https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data Load it in medical_data folder in hdfs scala> import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS scala>...
Read This

Kafka Connect: Connecting JDBC Source Using Mysql

Notice: Confluent Platform is the trademark and property of Confluent Inc. Kafka 0.90 comes with Kafka Connect. Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to quickly define...

read more

Spark Streaming: Window vs Batch

Spark Streaming is a microbatch based streaming library. What that means is that streaming data is divided into batches based on time slice called batch interval. Every batch gets converted into an RDD and this continous stream of RDDs is represented as DStream....

read more

Spark MLLib: Dense vs Sparse Vectors

Let's use house as a vector with following features: Square footage Last sold price Lot Size Number of rooms Number of bath rooms year built zip code Tennis Court Pool Jacuzzi Sports Court Now let's put some values Square Footage => 4450 Last Sold price =>...

read more

Different Parts of Hive

Hive has 5 parts HiveQL as the language Hive as a command-line tools which you can call client Hive MetaStore (port:9083) Hive warehouse HiveServer2 (port: 10000) HiveQL HiveQL is a query language similar to SQL but supports limited SQL functions. It also adds few...

read more

Different Ways of Setting AWS Credentials in Spark

To connect to S3 from Spark you need two environment variables for security credentials and they are: AWS_ACCESS_KEY AWS_SECRET_ACCESS_KEY There are three ways to set them up. 1. In .bashrc file, add following two lines at the end. Replace these dummy values with the...

read more

Spark: Creating Machine Learning Pipelines Using Spark ML

Spark ML is a new library which uses dataframes as it's first class citizen as opposed to RDD. Another feature of Spark ML is that it helps in combining multiple machine learning algorithms into a single pipeline. You can read more about dataframes here. Spark ML uses...

read more

Spark Submit With SBT

This recipe assumes sbt is installed and you have already gone over mysql with Spark recipe. I am a big fan of Spark Shell. Biggest proof is Spark Cookbook which has all recipes in the form of collection of single commands on Spark Shell. It makes it easy to...

read more

Spark: Accumulators

Accumulators are Spark's answer to MapReduce counters but they do much more than that. Let's first start with a simple example which uses accumulators for simple count. scala> val ac = sc.accumulator(0) scala> val words =...

read more

Spark: Calculating Correlation Using RDD of Vectors

Correlation is a relationship among two variables so if one changes, other also changes. Correlation measures how strong this change is from 0 t0 1. 0 means there is no correlation at all while one means perfect correlation i.e. if first variable become double, second...

read more

Spark: Connecting to Amazon EC2

Apache Spark installation comes bundled with spark-ec2 script which makes it easy to create Spark instances on EC2. This recipe will cover connecting to EC2 using this script. Login to your Amazon AWS account Click on Security Credentials under your account name in...

read more

Spark: Connecting To A JDBC Data-Source Using Dataframes

So far in Spark, JdbcRDD has been the right way to connect with a relational data source. In Spark 1.4 onwards there is an inbuilt datasource available to connect to a jdbc source using dataframes. Dataframe Spark introduced dataframes in version 1.3 and enriched...

read more

Journey of Schema Into Big Data World

Schemas have a long history in the software world. Most of their popularity should be attributed to relational databases like oracle. In RDBMS world schemas are your best friend. You create schema, do operations using schema and hardly care about underlying storage...

read more

Spark: JDBC Using DataFrames

For Spark 1.3 onwards, JdbcRDD is not recommended as DataFrames have support to load JDBC. Let us look at a simple example in this recipe. Using JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most...

read more

Spark: DataFrames And JDBC

For Spark 1.3 onward, JdbcRDD is not recommended as DataFrames have support to load JDBC. Let us look at a simple example in this recipe. Using JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most...

read more

Spark: DataFrames And Parquet

This recipe works with Spark 1.3 onward. Apache Parquet as a file format has garnered significant attention recently. Let's say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In Row oriented format all columns are scanned...

read more

DataFrames With Apache Spark

Apache Spark 1.3 introduced a new concept called DataFrames and it got further developed in version 1.4. Dataframes also complement project Tungsten, a new initiative to improve CPU performance in Spark. Motivation behind DataFrames Spark so far has had a...

read more

SparkSql: Left Outer Join With DSL

Spark SQL is great at executing SQL but sometimes you want to stick to the RDD level. Then comes the role of DSL. Sometimes how exactly to use Spark with DSL becomes confusing. This recipe is an attempt to reduce that. Let's get done with pleasantries first, i.e.,...

read more

Spark SQL: SchemaRDD: Programmatically Specifying Schema

This recipe is inspired by http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema and all rights are owned by their respective owners. In Spark SQL, the best way to create SchemaRDD is by using scala case class. Spark...

read more

SqlContext and SchemaRDD in Spark 1.2

Please note SchemaRDD in Spark 1.2 has been replaced by DataFrames in Spark 1.3. SqlContext can be used to load underlying data in JSON and Parquet format like: scala> import sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> import sqlContext._...

read more

Spark with Avro

This recipe is based on the Databricks spark-avro library and all rights to this library are owned by Databricks. Download the spark-avro library: $wget https://github.com/databricks/spark-avro/archive/master.zip $unzip master.zip $cd spark-avro-master $sbt...

read more

Spark SQL: Calculating Duration – Timeformat to Date

Spark SQL does not support date type, so things like duration become tough to calculate. That said, in Spark everything is RDD. So that's a hidden weapon which can always be used when higher level functionality is limited. Let's take a case where we are getting two...

read more

RDD, SchemaRDD, MySQL and Joins

RDD being a unit of compute in Spark is capable of doing all complex operations which traditionally are done using complex queries in databases. It's sometimes difficult to get the exact steps to perform these operations, so this blog is an attempt in that direction...

read more

Demystifying Compression

Compression has an important role to play in Big Data technologies. It makes both storage and transport of data more efficient. Then why are so many compression formats, and what are the things we have to balance while making a decision about which compression format...

read more

Spark SQL: Parquet

This recipe works with Spark 1.2. For Spark 1.3 version, please click here. Apache Parquet, as a file format, has garnered significant attention recently. Let's say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In row...

read more

Spark Core: sc.textFile vs sc.WholeTextFiles

While loading a RDD from source data, there are two choices which look similar. scala> val movies = sc.textFile("movies") scala> val movies = sc.wholeTextFiles("movies") sc.textFile SparkContext's TextFile method, i.e., sc.textFile in Spark Shell, creates a RDD...

read more

Spark SQL: JdbcRDD

Using JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most probably you'll use it with spark-submit but I have put it here in spark-shell to illustrate easier. Database Preparation We are going to...

read more

Spark SQL: SqlContext vs HiveContext

There are two ways to create context in Spark SQL: SqlContext: scala> import org.apache.spark.sql._ scala> var sqlContext = new SQLContext(sc) HiveContext: scala> import org.apache.spark.sql.hive._ scala> val hc = new HiveContext(sc) Though most of the...

read more
Top