Real-time Streaming of Big Data

As users saw power of Spark and in-memory analytics. They got more and more impatient (in a good way). They want all workloads to be handled (which Spark does well with a comprehensive suite of libraries) in one layer and they want to get answers from all of them now (low latency/near realtime). This has made streaming front and center of big data game.

3 Tenets

Streaming-first approach

Streaming-first approach means that all the data will be live streamed and analyzed first rather than being stored in HDFS and then analyzed.

Streaming-only approach

Streaming-only approach means that unlike lambda architecture there will not be parallel processing of batch and streaming data loads. All data will be streamed first and then based on priority/hotness of data, next course of action will be decided.

In-memory storage

Data will land in memory and stay in memory for subsequent processing and even to deliver to reporting engines. If data needs to be retained it will be persisted in data stores after the fact.

Some Definitions

How to define (near)real-time?

Everyone has a different definition of real-time but the generally accepted standard is that anything with sub-second latency is real-time while latency is seconds is near real-time. Though algorithmic trading may require real-time performance, most of the enterprise needs are met using real-time. In fact a lot of data in enterprises is processed in hours at present so near real-time is a big jump.

Streaming Data Ingestion

The first step is to ingest data with lowest latency possible. Even if you have dedicated dark-fiber link you are limited by speed of light (which is not bad). For most of the WAN based data ingestion, latency can be 100s of milliseconds.
Once data arrives it needs to be broken into digestible pieces for fast processing. These micro-batches of data are determined by how often data is sliced called batch interval.