Spark SQL: Parquet

Spark SQL: Parquet

This recipe works with Spark 1.2. For Spark 1.3 version, please click here. Apache Parquet, as a file format, has garnered significant attention recently. Let’s say you have a table with 100 columns, most of the time you are going to access 3-10 columns. In row...
Spark Core: sc.textFile vs sc.WholeTextFiles

Spark Core: sc.textFile vs sc.WholeTextFiles

While loading a RDD from source data, there are two choices which look similar. scala> val movies = sc.textFile(“movies”) scala> val movies = sc.wholeTextFiles(“movies”) sc.textFile SparkContext’s TextFile method, i.e., sc.textFile...
Spark SQL: JdbcRDD

Spark SQL: JdbcRDD

Using JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most probably you’ll use it with spark-submit but I have put it here in spark-shell to illustrate easier. Database Preparation We are going...
Spark SQL: SqlContext vs HiveContext

Spark SQL: SqlContext vs HiveContext

There are two ways to create context in Spark SQL: SqlContext: scala> import org.apache.spark.sql._ scala> var sqlContext = new SQLContext(sc) HiveContext: scala> import org.apache.spark.sql.hive._ scala> val hc = new HiveContext(sc) Though most of the...
Top