Schemas have a long history in the software world. Most of their popularity should be attributed to relational databases like oracle.
In RDBMS world schemas are your best friend. You create schema, do operations using schema and hardly care about underlying storage (unless you are a DBA). Schema is fact need to be created before even a single byte of data is written, therefore the schema type in relational world is called schema-on-write.
In Big Data world specifically Hadoop, this changed. Hadoop is primarily a storage platform and with diminishing popularity of MapReduce, it has become even more so. Hadoop has schema type called schema-on-read in which, you do not even bother about schema until you are ready to read the data.When it comes to reading the data, you need to figure out a schema. The reason being that you have to give some structure to even most unstructured data before you can make any use of it.
In Hadoop data is stored in the form of blocks on slave machines. Now this block which is my default 64 MB and can be as big as 1 GB in a lot of deployments, needs to be sliced into digestible pieces. This digestible piece is called a record and the format using which you slice data is called InputFormat. The image below gives a good idea of it using pizza eating as analogy.
This InputFormat which by default is TextInputFormat, you can think of as earliest form of schema in Hadoop MapReduce. Users were not much interested in writing MapReduce code directly and that lead to creation of Hive. Hive had it’s own schema on read but that was real relational schema aka metadata.Hive stores metadta, in a separate relational storage called metastore.
Hive definitely made work easier but still managing schema separately was not big data developers favorite.
Slowly came formats like avro and parquet which embed schema in the same files which has data.These formats bring their own benefits which are plenty but having schema embedded in them is something which separated them with earlier formats.JSON which became immensely popular on it’s own (mostly as replacement of XML) brings same benefits of schema combined with data and is also extensively used in big data systems.
This trend also is reflecting in evolution of Apache Spark, the big data compute framework InfoObjects has specific focus on. Spark started with Resilient Distributed Dataset i.e. RDDs as unit of compute as well as unit of in-memory storage. Spark had to address need to work with relational queries early-on. Initially it came up with Shark which was Hive on Spark. Shark had challenges and that led to creation of Spark SQL a year back.
Spark SQL initially had a unit of compute called SchemaRDD which was essentially RDD + Schema put on top of it. Spark 1.3 onwards SchemaRDDs evolved into DataFrames. Now there is a lot of work being done in Spark to treat DataFrames as first-class objects, latest being Spark ML module.
In summary, this evolution of storage into Parquet type formats and evolution of compute into DataFrames have made life of a big data developer very easy. Now with one command a JSON file can be loaded into DataFrame and with another command saves into parquet format.
scala> val df = sqlContext.load("hdfs://localhost:9000/user/hduser/person", "json") scala> df.select("first_name", "last_name").save("person.parquet", "parquet")