 Select Page

Correlation is a relationship among two variables so if one changes, other also changes. Correlation measures how strong this change is from 0 t0 1. 0 means there is no correlation at all while one means perfect correlation i.e. if first variable become double, second also becomes double.

In this recipe we will cover how to calculate correlation matrix using RDD of Vectors.

1. Let’s get imports out of our way
```        scala> import org.apache.spark.mllib.linalg._
scala> import org.apache.spark.mllib.stat.Statistics
```
2. Create an RDD of Vector containing house size and house price
```        scala> val sp = sc.parallelize(List(
Vectors.dense(2100,1620000),
Vectors.dense(2300,1690000),
Vectors.dense(2046,1400000),
Vectors.dense(4314,2000000),
Vectors.dense(1244,1060000),
Vectors.dense(4608,3830000),
Vectors.dense(2173,1230000),
Vectors.dense(2750,2400000),
Vectors.dense(4010,3280000),
Vectors.dense(1959,1480000)
))```
3. Calculate Correlation Matrix
` scala> val corr = Statistics.corr(sp)`
4. Print the matrix corr
```1.0                 0.8593856666378791
0.8593856666378791  1.0
```
5. Let’s put it in a more understandable format
house size house price
house size 1.0 0.8593856666378791
house price 0.8593856666378791 1.0
Top