HBase Client in a Kerberos Enabled Cluster

HBase Client in a Kerberos Enabled Cluster

Use-case: HBase Servers are in a Kerberos Enabled Cluster. HBase Servers (Masters and RegionServers) are configured to use Authentication to Connect to Zookeeper. Assumption: HBase + secured Zookeeper. This Java code snippet can be used to connect to HBase configured...
AWS: CIDR blocks explained

AWS: CIDR blocks explained

One part of working with public cloud is that a developer needs to understand far more than he or she is used to. These are same parameters which are facets of distributed computing. Following is a fun animation to explain different parts of it. In essence there are 4...
Amazon DynamoDB Sizing Demystified

Amazon DynamoDB Sizing Demystified

Amazon DynamoDB has a partition as a unit of storage (partitions are automatically replicated across three AZs). Now when it comes to sizing there can be two considerations data size network throughput A partition can contain maximum of 10 GB of data so if you have 50...
Spark: Inferring Schema Using Case Classes

Spark: Inferring Schema Using Case Classes

To make this recipe one should know about its main ingredient and that is case classes. These are special classes in Scala and the main spice of this ingredient is that all the grunt work which is needed in Java can be done in case classes in one code line. Spark uses...
Logistic Regression with Spark MLlib

Logistic Regression with Spark MLlib

Dataset: Diabetes data from https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data Load it in medical_data folder in hdfs scala> import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS...
Kafka Connect: Connecting JDBC Source Using Mysql

Kafka Connect: Connecting JDBC Source Using Mysql

Notice: Confluent Platform is the trademark and property of Confluent Inc. Kafka 0.90 comes with Kafka Connect. Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to quickly define...
Spark Streaming: Window vs Batch

Spark Streaming: Window vs Batch

Spark Streaming is a microbatch based streaming library. What that means is that streaming data is divided into batches based on time slice called batch interval. Every batch gets converted into an RDD and this continous stream of RDDs is represented as DStream....
Spark MLLib: Dense vs Sparse Vectors

Spark MLLib: Dense vs Sparse Vectors

Let’s use house as a vector with following features: Square footage Last sold price Lot Size Number of rooms Number of bath rooms year built zip code Tennis Court Pool Jacuzzi Sports Court Now let’s put some values Square Footage => 4450 Last Sold price...
Different Parts of Hive

Different Parts of Hive

Hive has 5 parts HiveQL as the language Hive as a command-line tools which you can call client Hive MetaStore (port:9083) Hive warehouse HiveServer2 (port: 10000) HiveQL HiveQL is a query language similar to SQL but supports limited SQL functions. It also adds few...
Different Ways of Setting AWS Credentials in Spark

Different Ways of Setting AWS Credentials in Spark

To connect to S3 from Spark you need two environment variables for security credentials and they are: AWS_ACCESS_KEY AWS_SECRET_ACCESS_KEY There are three ways to set them up. 1. In .bashrc file, add following two lines at the end. Replace these dummy values with the...
Top