Hive has 5 parts

  1. HiveQL as the language
  2. Hive as a command-line tools which you can call client
  3. Hive MetaStore (port:9083)
  4. Hive warehouse
  5. HiveServer2 (port: 10000)

HiveQL is a query language similar to SQL but supports limited SQL functions. It also adds few more features which do not exist in SQL. In case of Spark when we used HiveContext, it supports a subset of HiveQL features.

Hive as a tool
Hive as a command line tool is used to convert hive queries into MapReduce jobs. Since this is super slow, it’s hardly used anymore.

Hive Metastore
Hive Metastore by default is stored in derby but almost always stored in MySQL in production instances. Hive Metastore has schema definitions of underlying data for Hive tables.

With the advent of file formats like Parquet which store schema along with data, the need to have a separate metastore is reducing. That being said having a common metadata layer among tools has it’s own value e.g. if you would like to have common metadata layer between Hadoop and SAP HANA.

Hive warehouse
Hive warehouse is just a segregated location in HDFS which is owned and managed by Hive user. By default this location is /user/hive/warehouse.This gives a convenient location to store all structured data.

Hive Server2
You need a way to connect to Hive (metastore+warehouse) from other applications. Hive Server2 provides a service level interface for that. This used thrift as a protocol.

Contributed by Spark Training class of Feb 16