Is data locality really a virtue?

Hadoop has started with data locality as one it’s primary features. Compute happens on a node where data is stored, it reduces data which needs to be shuffled over the network. Since every commodity machine has some basic compute power, you do not need specialized hardware and it brings the cost to a fraction of what it would be otherwise.

It worked well for on-prem installations of Hadoop but slowly there was more and more traction for cloud. My initial impression was that AWS is meant for startups which need to ramp-up fast. Though it was true it was myopic. Cloud in general and AWS in particular are attracting attention of every size companies.

Coming back to data locality as virtue. I think it is a virtue when your load is storage intensive. What we see with most of customers is that their load is compute intensive. When your load is compute heavy, you load dataset once and then run multiple operations on it. This cost of loading one time looks reasonable. Spark which caches datasets in memory makes it even easier as jobs always see in-memory copy anyway.

Critics can argue that it is like going back to SAN/NAS world. Apparently it may look like but actually it’s not as compute is still very much distributed. It’s a long fought battle between on-prem vs cloud but cloud seems to be heading towards victory.