2016 – The Year of Fast Data

A lot of technologies change so fast that sometimes the name given to them becomes a misnomer. Big data is one such technology. It’s no longer big but fast. Most of the enterprises do not have petabytes of data but they have data which moves very fast. In other words out of volume, velocity, variety of big data, the volume part has become predominant.

A glimpse into the history

Big data started with the advent of Hadoop by Doug Cutting and then it’s subsequent growth at Yahoo. It was based on papers Google File System and MapReduce by Google ( a web-scale company). It grew at Yahoo (another web-scale company).

Facebook (yet another web-scale company) created Hive so that Hadoop can be accessed using SQL like interface. LinkedIn and Netflix (both web-scale companies) have also contributed to the eco-system in various ways.

While these web-scale companies provided perfect lab environments, few startups took the charge of both implementing and contributing to the open-source effort. Most notable ones being Cloudera, MapR and Hortonworks.

Transition from web-scale to enterprise play

For big data technologies specifically Hadoop to have wider adoption, it had to meet enterprise needs. Enterprises are nowhere like web-scale companies. This led to a long period in which enterprises had split-brain scenario for big data adoption. One part of the company  full of geeks would be excited by the promise of big data, arrange some funding and start playing with it in it’s own silo. Rest of the company would not touch it as it did not look enterprise-ready.

Enterprises have some cross-cutting concerns like security and data governance. These cross-cutting concerns though important, were not really a stumbling block. The stumbling block was latency, Hadoop was just too slow. Enterprises got used to low latency experience provided by database technologies. They wanted the same experience.

It all changed with memory becoming first-class storage. SAP could see it coming when they introduced SAP HANA around 5 years back for enterprise data. For the big data eco-system Apache Spark has done the same. Spark uses memory in the commodity slave nodes both for storage as well as compute. Spark works well with Hadoop as underlying storage layer but is not limited to it.

Though 2015 was the year a lot of companies moved their Hadoop workloads to Spark, most of these workloads were still batch ETL/ELT type loads.

2016: I want it all and I want it now

As companies saw power of Spark and in-memory analytics. They got more and more impatient (in a good way). In 2016 they want all workloads to be handled (which Spark does well with a comprehensive suite of libraries) and they want to get answers from all of them now (low latency/near realtime). This has made streaming front and center of big data game.

Another trend which has fueled the need for streaming is IoT. The evolution in IoT technologies (IoT = sensor data) has created unlimited number of use cases of streaming. Now you can do (near) real-time analytics on data coming from programmable logic controllers (PLCs), airplanes, offshore drilling platforms etc.

Regarding streaming for IoT data there are plenty of opportunities to leverage the stream of IIoT (Industrial IoT) data and run monitoring rules against in real time opening new opportunities to bring value to our customers (plant floor, manufacturing, etc.). I see an evolution where features like security, multi-tenancy for the data and a high level UI to configure the rules will move the intelligence to the crowd (crowdsourcing analytics for the ‘fast torrential stream”)

Juan Asenjo, Principal Engineer and IoT Evangelist, Rockwell Automation

Enterprise systems are no way behind. For two decades for analytics enterprise data (which in reality is streaming e.g. point-of-sale data) was transformed into cubes and processed using data warehousing systems. This batching and slowing of data was needed as the technology was not capable. Now with in-memory analytics, all this data can directly be processed in memory and results produced in real-time.

Summary

Looking at crystal ball for 2016. I see following trends clearly emerging

  • Memory being first-class storage. All data lending in memory first and then pipelined for various needs.
  • Streaming data use cases coming at accelerating pace. I want it all and I want it now.
  • Traditional data warehouses being disrupted and slowly replaced by in-memory analytics layer.
  • Big data integration (I love to call it plumbing) becoming primary consulting play.

We InfoObjects with bleeding edge in Big Data see a consistent pattern emerging. This consistency is good for both us and the clients and this is creating a new design pattern for streaming analytics. In fact in 2015 a designer pattern called Kappa architecture became very famous which now has taken a backseat to streaming-first and streaming-only approach.