Spark/Kafka/Hadoop & AWS PaaS Services Cook Book

The power of Spark as a Big Data compute platform does not need any explanation (at least not in cookbook). This cookbook contains recipes to solve common day-to-day needs in Spark. To keep it simple, recipes are given to run on Spark Shell which provides Spark Context as a val sc. In Scala program you can initialize Spark Context using:


val conf = new SparkConf().setAppName(appName).setMaster(master)
val sparkContext = new SparkContext(conf)

If you have any feedback about these recipes, please submit your feedback to

HBase Client in a Kerberos Enabled Cluster

Use-case: HBase Servers are in a Kerberos Enabled Cluster. HBase Servers (Masters and RegionServers) are configured to use Authentication to Connect to Zookeeper.Assumption: HBase + secured Zookeeper.This Java code snippet can be used to connect to HBase configured with zookeeper/rpc. HBase client can be on a remote node or inside ...
Read More


Principal A principal is IAM entity which is allowed to interact with AWS resources. A principal can be a human or an application. There are three types of principals. Root User ...
Read More

AWS: Elastic Load Balancing

Elastic load balancing is a highly available service that distributes instances across ec2 instances. Type of Load Balancers Internet-facing load balancer Takes request from clients over the internet and pass them to ec2 instances. Load balances gets a DNS name which clients can use to send requests. Internal Load Balancers Internal load balancers take ...
Read More

AWS: CIDR blocks explained

One part of working with public cloud is that a developer needs to understand far more than he or she is used to. These are same parameters which are facets of distributed computing. Following is a fun animation to explain different parts of it. In essence there are 4 parts ...
Read More

TensorFrames: TensorFlow with Spark

TensorFlow is a deep learning library. Tensors Tensors are inspired by the Tensors in physics which are multidimensional objects in hyper space. In the same way, you can consider a vector as a 1d tensor and a matrix as 2d tensor. In TensorFlow these dimensions are called ranks. So to restate ...
Read More

Streaming Twitter Data to Cassandra Using Spark Streaming

We are going to use DSE installed on infoobjects big data sandbox. In cqlsh shell create a new keyspace twitter cqlsh> create keyspace Twitter with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; Switch to the keyspace cqlsh> use twitter; create tweets table cqlsh:twitter> create table tweets(insert_time bigint primary key,tweet text); Now let's switch ...
Read More

Cassandra: Compaction simplified

In Cassandra, SSTables are periodically compacted. This process looks complicated but this is similar to checkpointing in HDFS i.e. essentially a housekeeping process. Data from Memtables get flushed to SSTables periodically. Now at some point of time data in these sstables need to be reconciled, this is where compaction comes into ...
Read More

Amazon DynamoDB sizing demystified

Amazon DynamoDB has a partition as a unit of storage (partitions are automatically replicated across three AZs). Now when it comes to sizing there can be two considerations data size network throughput A partition can contain maximum of 10 GB of data so if you have 50 GB of data to be stored, ...
Read More

Serialization frameworks demystified

Before getting into the details of different serialization formats like Avro, Thrift and Protocol Buffers, Let's try to see why we need serialization. Since computer science concepts are boring, let's start with an analogy and then we can get back to the tech. Lets start with a hypothetical teleport project. The ...
Read More

Spark: Inferring schema using case classes

To make this recipe one should know about its main ingredient and that is case classes. These are special classes in Scala and the main spice of this ingredient is that all the grunt work which is needed in Java can be done in case classes in one code line ...
Read More