scikit-learn Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

How it works...

Let's walk you through how scikit-learn produces the regression dataset by taking a look at the source code (with some modifications for clarity). Any undefined variables are assumed to have the default value of make_regression.

It's actually surprisingly simple to follow. First, a random array is generated with the size specified when the function is called:

X = np.random.randn(n_samples, n_features)

Given the basic dataset, the target dataset is then generated:

ground_truth = np.zeros((np_samples, n_target))
ground_truth[:n_informative, :] = 100*np.random.rand(n_informative, n_targets)

The dot product of X and ground_truth are taken to get the final target values. Bias, if any, is added at this time:

y = np.dot(X, ground_truth) + bias

The dot product is simply a matrix multiplication. So, our final dataset will have n_samples, which is the number of rows from the dataset, and n_target, which is the number of target variables.

Due to NumPy's broadcasting, bias can be a scalar value, and this value will be added to every sample. Finally, it's a simple matter of adding any noise and shuffling the dataset. Voila, we have a dataset that's perfect for testing regression.