scikit-learn Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

How to do it...

Similar to scaling, there are two ways to binarize features in scikit-learn:

  • preprocessing.binarize
  • preprocessing.Binarizer

The Boston dataset's target variable is the median value of houses in thousands. This dataset is good for testing regression and other continuous predictors, but consider a situation where we want to simply predict whether a house's value is more than the overall mean.

  1. To do this, we will want to create a threshold value of the mean. If the value is greater than the mean, produce a 1; if it is less, produce a 0:
from sklearn import preprocessing
new_target = preprocessing.binarize(y,threshold=boston.target.mean())
new_target[:5]

array([[ 1.],
[ 0.],
[ 1.],
[ 1.],
[ 1.]])
  1. This was easy, but let's check to make sure it worked correctly:
(y[:5] > y.mean()).astype(int)

array([[1], [0], [1], [1], [1]])
  1. Given the simplicity of the operation in NumPy, it's a fair question to ask why you would want to use the built-in functionality of scikit-learn. Pipelines, covered in the Putting it all together with pipelines recipe, will help to explain this; in anticipation of this, let's use the Binarizer class:
binar = preprocessing.Binarizer(y.mean())
new_target = binar.fit_transform(y)
new_target[:5]

array([[ 1.], [ 0.], [ 1.], [ 1.], [ 1.]])