Spark ML is a new library which uses dataframes as it’s first class citizen as opposed to RDD. Another feature of Spark ML is that it helps in combining multiple machine learning algorithms into a single pipeline. You can read more about dataframes here.

Spark ML uses transformers to transform one DataFrame into another Dataframe and Estimator represents machine learning algorithm, which learns from the data. Therefore input to an estimator is DataFrame and output is a Transformer. Having Estimator get training data to train the algorithms or transformer making predictions in each stage, it creates a perfect recipe of machine learning pipeline.

How is this recipe made?
For now let us assume whether someone is a basketball player or not a basket ball player. For this let us also assume LogisticRegression to be the machine learning algorithm. Logisticregression is yet another exciting recipe in the cook book.

1. Start the Spark Shell:

 $ spark-shell

2. Do the imports:

scala> import org.apache.spark.mlib.linalg.{Vector,vectors}
 
scala> import org.apache.spark.mlib.regression.LabeledPoint
 
scala> import org.apache.spark.ml.classification.
LogisticRegression

3. Create a labeled point for Lebron who is a basketball player, is 80 inches tall height and weighs 250 lbs:

scala> val lebron = LabeledPoint (1.0,Vectors.dense (80.0,250.0))

4. Create a labeled point for Tim who is not a basketball player, is 70 inches tall height and weight is 150 lbs:

scala> val tim = LabeledPoint (0.0, Vectors.dense (70.0,150.0)

5. Create a labeled point for Brittany who is a basketball player, is 80 inches tall height and weight 207 lbs:

 scala> val brittany = LabeledPoint (1.0,Vectors.dense (80.0,207.0))

6. Create a Labeled point for Stacey who is not a basketball Player, is 65 inches tall and weighs 120 lbs:

scala> val  stacey = LabeledPoint (0.0,Vectors.dense (65.0,120.0))

7. Create Training RDD:

scala> val trainingRDD = sc.parallelize (List(Labron,tim,brittany,stacey))

8. Create a training DataFrame:

Scala> val trainingDF = trainingRDD.toDF

9. Create LogisticRegression estimator:

scala> val estimator = new LogisticRegression

10. Create a transformer by fitting the estimator with training DataFrame:

scala> val transformer = estimator.fit (trainingDF)

11. Now, let’s create a test data– John is 90 inches tall and weighs 270 lbs and is a basketball player :

scala> val john = Vectors.dense (90.0,27-0.0)

12. Create another test data — Tom is 62 inches tall and weighs 150 Lbs, and is not a basketball player:

scala> val tom = Vectors.dense (62.0,120.0)

13. Create training RDD:

scala> val testRDD = sc.parallelize(List ( john, tom))

14. Create a features case class:

scala> case class Feature (v:Vector)

15. Map the testRDD to an RDD for features:

scala> val featuresRDD = testRDD.map ( v => Feature (v) )

16. Convert featuresRDD into a DataFrame with column name “features”:

scala> val featuresDF = featuresRDD. toDF ("features")

17. Transform featuresDF by adding the predictions column to it:

scala> val predictionsDF = transformer. transform (featuresDF)

18. Print the predictionsDF:

scala> predictionsDF. foreach (printLn)

19. PredictionsDF, as you can see, creates three columns – rawPrediction, probability and prediction – besides keeping features. Let’s select only features and prediction:

scala> val shorterpredictionsDF = predictionsDF.
 
select ("features", "prediction")

20. Rename the prediction to isbasketball player:

scala> val playerDF = shorterpredictionsDF. toDF ("features", "isbasketballplayer")

21. Print the schema for PlayerDF:

scala> playerDF.printschema

 

 

 

 

Top