Correlation is a relationship among two variables so if one changes, other also changes. Correlation measures how strong this change is from 0 t0 1. 0 means there is no correlation at all while one means perfect correlation i.e. if first variable become double, second also becomes double.

In this recipe we will cover how to calculate correlation matrix using RDD of Vectors.

  1. Let’s get imports out of our way
            scala> import org.apache.spark.mllib.linalg._
    scala> import org.apache.spark.mllib.stat.Statistics
  2. Create an RDD of Vector containing house size and house price
            scala> val sp = sc.parallelize(List(
    Vectors.dense(2100,1620000),
    Vectors.dense(2300,1690000),
    Vectors.dense(2046,1400000),
    Vectors.dense(4314,2000000),
    Vectors.dense(1244,1060000),
    Vectors.dense(4608,3830000),
    Vectors.dense(2173,1230000),
    Vectors.dense(2750,2400000),
    Vectors.dense(4010,3280000),
    Vectors.dense(1959,1480000)
    ))
  3. Calculate Correlation Matrix
     scala> val corr = Statistics.corr(sp)
  4. Print the matrix corr
    1.0                 0.8593856666378791  
    0.8593856666378791  1.0
  5. Let’s put it in a more understandable format
    house size house price
    house size 1.0 0.8593856666378791
    house price 0.8593856666378791 1.0
Top