Hadoop Distributed File System

Hadoop has two parts:

  • HDFS (Hadoop Distributed File System) – a robust file system written in Java that enables large data files to be distributed, stored and replicated across multiple nodes. Built-in scalability and reliability eliminate the need for expensive RAID storage.
  • MapReduce – a framework for mapping and distributing computational processing across nodes in the cluster (or a large server farm), and then collecting and combining the results to provide analytics results.

As compute became more and more demanding, MapReduce faded in glory. HDFS on the other hand, has withstood the test of time. The most disruptive part about HDFS is the cost of storage. Cost of storage in a traditional EDW system like Teradata/Netzza/Oracle can be anything between $10,000 to $60,000 per TB. HDFS storage cost is anything between $100-$300/TB. This is a 100 times difference, and it is disruptive.

If you see some blogs on the Internet, the numbers are much higher like this. I have used much smaller numbers as I am using the assumption that with time prices have come down and will keep coming down even for traditional EDWs.

The biggest cost savings in HDFS come from the use of commodity hardware + open source software. Any machine with some storage capacity and processing power is a good fit. Commodity does not mean desktop machines, though. They still have to be server class machines with typical configuration like this:

  • 24-48 GB Ram
  • 4-10 TB HD preferably on 2-4 spindles
  • 8-16 core processors
  • Gigabit Ethernet (preferably 10 gig)

In HDFS, a lot of premium features like following come bundled in:

  • NameNode High Availability (HA)
  • Federation
  • Snapshots