Let’s use house as a vector with following features:

Square footage
Last sold price
Lot Size
Number of rooms
Number of bath rooms
year built
zip code
Tennis Court
Sports Court
Now let’s put some values

Square Footage => 4450

Last Sold price => 2600000

Lot size => 40000

Number of rooms => 4

Number of bath rooms => 4

Year Built => 1978

Zip Code => 95070

Tennis Court => YES

Pool => YES

Jacuzzi => YES

Sports Court => NO

Now let’s build a dense vector for this house.

$ spark-shell


import org.apache.spark.mllib.linalg.{Vectors,Vector}

Now we have an issue here. MLLib expects every feature to be represented as double. so we’ll convert nominal values to double.

$ spark-shell

scala> val myHouse = Vectors.dense(4450d,2600000d,4000d,4.0,4.0,1978.0,95070d,1.0,1.0,1.0,0.0)

Now let’s take another house which does not have a lot of features. Now does it make sense to give some value to everything or we should only store non-null values.
Let’s take another house.

Square Footage => 1650

Last Sold price => 500000

Lot size => 800

Number of rooms => 3

Number of bath rooms => 3

Year Built => 2009

Zip Code => 95054

Tennis Court => No

Pool => No

Jacuzzi => No

Sports Court => NO

Best way to represent it would be in the form of sparse vector

scala> Vectors.sparse(7,Array(0,1,2,3,4,5,6),Array(1650d,50000d,800d,3.0,3.0,2009,95054))