
How it works...
Let's walk you through how scikit-learn produces the regression dataset by taking a look at the source code (with some modifications for clarity). Any undefined variables are assumed to have the default value of make_regression.
It's actually surprisingly simple to follow. First, a random array is generated with the size specified when the function is called:
X = np.random.randn(n_samples, n_features)
Given the basic dataset, the target dataset is then generated:
ground_truth = np.zeros((np_samples, n_target))
ground_truth[:n_informative, :] = 100*np.random.rand(n_informative, n_targets)
The dot product of X and ground_truth are taken to get the final target values. Bias, if any, is added at this time:
y = np.dot(X, ground_truth) + bias
The dot product is simply a matrix multiplication. So, our final dataset will have n_samples, which is the number of rows from the dataset, and n_target, which is the number of target variables.
Due to NumPy's broadcasting, bias can be a scalar value, and this value will be added to every sample. Finally, it's a simple matter of adding any noise and shuffling the dataset. Voila, we have a dataset that's perfect for testing regression.