Apache Spark 2:Data Processing and Real-Time Analytics
上QQ阅读APP看书,第一时间看更新

VectorAssembler

Before we start with the actual machine learning algorithm, we need to apply one final transformation. We have to create one additional feature column containing all the information of the columns that we want the machine learning algorithm to consider. This is done by org.apache.spark.ml.feature.VectorAssembler as follows:

import org.apache.spark.ml.feature.VectorAssembler
vectorAssembler = new VectorAssembler()
.setInputCols(Array("colorVec", "field2", "field3","field4"))
.setOutputCol("features")

This transformer adds only one single column to the resulting DataFrame called features, which is of the org.apache.spark.ml.linalg.Vector type. In other words, this new column called features, created by the VectorAssembler, contains all the defined columns (in this case, colorVec, field2, field3, and field4) encoded in a single vector object for each row. This is the format the Apache SparkML algorithms are happy with.