L1 penalty in action
To see how the L1 penalty works, we can use a simulated linear regression problem. The code for the rest of this chapter is in Chapter3/overfitting.R. We simulate the data, using a correlated set of predictors:
set.seed(1234)
X <- mvrnorm(n = 200, mu = c(0, 0, 0, 0, 0),
Sigma = matrix(c(
1, .9999, .99, .99, .10,
.9999, 1, .99, .99, .10,
.99, .99, 1, .99, .10,
.99, .99, .99, 1, .10,
.10, .10, .10, .10, 1
), ncol = 5))
y <- rnorm(200, 3 + X %*% matrix(c(1, 1, 1, 1, 0)), .5)
Next, we can fit an OLS regression model to the first 100 cases, and then use lasso. To use lasso, we use the glmnet() function from the glmnet package. This function can actually fit the L1 or the L2 (discussed in the next section) penalties, and which one occurs is determined by the argument, alpha. When alpha = 1, it is the L1 penalty (that is, lasso), and when alpha = 0, it is the L2 penalty (that is, ridge regression). Further, because we don't know which value of lambda we should pick, we can evaluate a range of options and tune this hyper-parameter automatically using cross-validation, which is the cv.glmnet() function. We can then plot the lasso object to see the mean squared error for a variety of lambda values to allow us to select the correct level of regularization:
m.ols <- lm(y[1:100] ~ X[1:100, ])
m.lasso.cv <- cv.glmnet(X[1:100, ], y[1:100], alpha = 1)
plot(m.lasso.cv)
One thing that we can see from the graph is that, when the penalty gets too high, the cross-validated model increases. Indeed, lasso seems to do well with very low lambda values, perhaps indicating lasso does not help improve out-of-sample performance/generalizability much for this dataset. For the sake of this example, we will continue but in actual use, this might give us pause to consider whether lasso was really helping. Finally, we can compare the coefficients with those from lasso:
cbind(OLS = coef(m.ols),Lasso = coef(m.lasso.cv)[,1])
OLS Lasso
(Intercept) 2.958 2.99
X[1:100, ]1 -0.082 1.41
X[1:100, ]2 2.239 0.71
X[1:100, ]3 0.602 0.51
X[1:100, ]4 1.235 1.17
X[1:100, ]5 -0.041 0.00
Notice that the OLS coefficients are noisier and also that, in lasso, predictor 5 is penalized to 0. Recall from the simulated data that the true coefficients are 3, 1, 1, 1, 1, and 0. The OLS estimates have much too low a value for the first predictor and much too high a value for the second, whereas lasso has more accurate values for each. This demonstrates that lasso regression generalizes better than OLS regression for this dataset.