scikit-learn Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

There's more...

In the preceding multi-output regression, you could be concerned with the dummy variable trap: the collinearity of the outputs. Without dropping any output columns, you assume that there is a fourth option: that a flower can be of none of the three types. To prevent the trap, drop the last column and assume that the flower has to be of one of the three types as we do not have any training examples where it is not one of the three flower types.

There are other ways to create categorical variables in scikit-learn and Python. The DictVectorizer class is a good option if you like to limit the dependencies of your projects to only scikit-learn and you have a fairly simple encoding scheme. However, if you require more sophisticated categorical encoding, patsy is a very good option.