Let’s use house as a vector with following features:

Square footage
Last sold price
Lot Size
Number of rooms
Number of bath rooms
year built
zip code
Tennis Court
Pool
Jacuzzi
Sports Court
Now let’s put some values

Square Footage => 4450

Last Sold price => 2600000

Lot size => 40000

Number of rooms => 4

Number of bath rooms => 4

Year Built => 1978

Zip Code => 95070

Tennis Court => YES

Pool => YES

Jacuzzi => YES

Sports Court => NO

Now let’s build a dense vector for this house.

$ spark-shell
 
scala>
 
import org.apache.spark.mllib.linalg.{Vectors,Vector}

Now we have an issue here. MLLib expects every feature to be represented as double. so we’ll convert nominal values to double.

$ spark-shell
 
scala> val myHouse = Vectors.dense(4450d,2600000d,4000d,4.0,4.0,1978.0,95070d,1.0,1.0,1.0,0.0)

Now let’s take another house which does not have a lot of features. Now does it make sense to give some value to everything or we should only store non-null values.
Let’s take another house.

Square Footage => 1650

Last Sold price => 500000

Lot size => 800

Number of rooms => 3

Number of bath rooms => 3

Year Built => 2009

Zip Code => 95054

Tennis Court => No

Pool => No

Jacuzzi => No

Sports Court => NO

Best way to represent it would be in the form of sparse vector

scala> Vectors.sparse(7,Array(0,1,2,3,4,5,6),Array(1650d,50000d,800d,3.0,3.0,2009,95054))
Top