Spark Streaming: Window vs Batch

Spark Streaming: Window vs Batch

Spark Streaming is a microbatch based streaming library. What that means is that streaming data is divided into batches based on time slice called batch interval. Every batch gets converted into an RDD and this continous stream of RDDs is represented as DStream....
Spark MLLib: Dense vs Sparse Vectors

Spark MLLib: Dense vs Sparse Vectors

Let’s use house as a vector with following features: Square footage Last sold price Lot Size Number of rooms Number of bath rooms year built zip code Tennis Court Pool Jacuzzi Sports Court Now let’s put some values Square Footage => 4450 Last Sold price...
Different Parts of Hive

Different Parts of Hive

Hive has 5 parts HiveQL as the language Hive as a command-line tools which you can call client Hive MetaStore (port:9083) Hive warehouse HiveServer2 (port: 10000) HiveQL HiveQL is a query language similar to SQL but supports limited SQL functions. It also adds few...
Different Ways of Setting AWS Credentials in Spark

Different Ways of Setting AWS Credentials in Spark

To connect to S3 from Spark you need two environment variables for security credentials and they are: AWS_ACCESS_KEY AWS_SECRET_ACCESS_KEY There are three ways to set them up. 1. In .bashrc file, add following two lines at the end. Replace these dummy values with the...
Top