上QQ阅读APP看书，第一时间看更新

How to do it...

Similar to scaling, there are two ways to binarize features in scikit-learn:

preprocessing.binarize
preprocessing.Binarizer

The Boston dataset's target variable is the median value of houses in thousands. This dataset is good for testing regression and other continuous predictors, but consider a situation where we want to simply predict whether a house's value is more than the overall mean.

To do this, we will want to create a threshold value of the mean. If the value is greater than the mean, produce a 1; if it is less, produce a 0:

from sklearn import preprocessing
new_target = preprocessing.binarize(y,threshold=boston.target.mean())
new_target[:5]

 array([[ 1.],
        [ 0.],
        [ 1.],
        [ 1.],
        [ 1.]])

This was easy, but let's check to make sure it worked correctly:

(y[:5] > y.mean()).astype(int)

array([[1],
       [0],
       [1],
       [1],
       [1]])

Given the simplicity of the operation in NumPy, it's a fair question to ask why you would want to use the built-in functionality of scikit-learn. Pipelines, covered in the Putting it all together with pipelines recipe, will help to explain this; in anticipation of this, let's use the Binarizer class:

binar = preprocessing.Binarizer(y.mean())
new_target = binar.fit_transform(y)
new_target[:5]

array([[ 1.],
       [ 0.],
       [ 1.],
       [ 1.],
       [ 1.]])