Spark Streaming is a microbatch based streaming library. What that means is that streaming data is divided into batches based on time slice called batch interval. Every batch gets converted into an RDD and this continous stream of RDDs is represented as DStream.

Sometimes we need to know what happened in last n seconds every m seconds. As a simple example, lets say batch interval is 10 seconds and we need to know what happened in last 60 seconds every 30 seconds. Here 60 seconds is called window length and 30 second slide interval. Lets say first 6 batches are A,B,C,D,E,F which are part of first window. After 30 seconds second window will ve formed which will have D,E,F,G,H,I. As you can see 3 batches are common between first and second window.

One thing to remember about window is that Spark holds onto the whole window in memory. In first window it will combine RDD A to F using union operator to create one big RDD. It is going to take 6 times memory and is ok if thats what you need. In some cases though you may need to transfer some state batch to batch. This can be accomplished using updateStateByKey. In Spark 1.6 there is traceStateByKey method which has better performance.
Spark 2.0 is going to come up with DStreams made of Streaming datasets which is going to make it very easy to run queries against that.

Top