 Select Page

Correlation is a relationship among two variables so if one changes, other also changes. Correlation measures how strong this change is from 0 t0 1. 0 means there is no correlation at all while one means perfect correlation i.e. if first variable become double, second also becomes double.

In this recipe we will cover how to calculate correlation matrix using RDD of Vectors.

1. Let’s get imports out of our way
scala> import org.apache.spark.mllib.linalg._
scala> import org.apache.spark.mllib.stat.Statistics

2. Create an RDD of Vector containing house size and house price
scala> val sp = sc.parallelize(List(
Vectors.dense(2100,1620000),
Vectors.dense(2300,1690000),
Vectors.dense(2046,1400000),
Vectors.dense(4314,2000000),
Vectors.dense(1244,1060000),
Vectors.dense(4608,3830000),
Vectors.dense(2173,1230000),
Vectors.dense(2750,2400000),
Vectors.dense(4010,3280000),
Vectors.dense(1959,1480000)
))
3. Calculate Correlation Matrix
scala> val corr = Statistics.corr(sp)
4. Print the matrix corr
1.0                 0.8593856666378791
0.8593856666378791  1.0

5. Let’s put it in a more understandable format
house size house price
house size 1.0 0.8593856666378791
house price 0.8593856666378791 1.0
Top