For Spark 1.3 onward, JdbcRDD is not recommended as DataFrames have support to load JDBC.
Let us look at a simple example in this recipe.
JdbcRDD with Spark is slightly confusing, so I thought about putting a simple use case to explain the functionality. Most probably you’ll use it with
spark-submit but I have put it here in
spark-shell to illustrate easily.
We are going to load a person table in MySQL database. Here’s DDL:
CREATE TABLE `person` ( `person_id` int(11) NOT NULL AUTO_INCREMENT, `first_name` varchar(30) DEFAULT NULL, `last_name` varchar(30) DEFAULT NULL, `gender` char(1) DEFAULT NULL, PRIMARY KEY (`person_id`) )
The following is an example of some data which is loaded:
I am a fan of oversimplification when it comes to examples, as you can see.
We have to make the MySQL JDBC driver available to spark-shell. I am using:
You can download it from http://dev.mysql.com/downloads/connector/j/ if it’s not already available.
Get Hands dirty on spark-shell
Load Spark shell with MySQL driver and launch it.
$spark-shell --driver-class-path /path-to-mysql-jar/mysql-connector-java-*.*.**-bin.jar --jars /path-to-mysql-jar/mysql-connector-java-*.*.**-bin.jar
scala> val personDF = sqlContext.load("jdbc",Map("url" -> "jdbc:mysql://localhost:3306/hadoopdb?user=hduser&password=******","dbtable" -> "person"))
Now query the results:
Write it to HDFS:
InfoObjects Spark Sandbox comes fully-loaded with Hadoop, Spark and more than 10 other ecosystem technologies. Please email firstname.lastname@example.org to request download access to the Sandbox.