scikit-learn Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

Getting ready

The first thing to do when learning how to input missing values is to create missing values. NumPy's masking will make this extremely simple:

from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
iris_X = iris.data
masking_array = np.random.binomial(1, .25,iris_X.shape).astype(bool)
iris_X[masking_array] = np.nan

To unravel this a bit, in case NumPy isn't too familiar, it's possible to index arrays with other arrays in NumPy. So, to create the random missing data, a random Boolean array is created, which is of the same shape as the iris dataset. Then, it's possible to make an assignment via the masked array. It's important to note that because a random array is used, it is likely that your masking_array will be different from what's used here.

To make sure this works, use the following command (since we're using a random mask, it might not match directly):

masking_array[:5]

array([[ True, False, False, True], [False, False, False, False], [False, False, False, False], [ True, False, False, False], [False, False, False, True]], dtype=bool)

iris_X [:5]

array([[ nan, 3.5, 1.4, nan], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ nan, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, nan]])