Ridge regression
The package we're using will be glmnet. I like it because it has a built-in cross-validation function, standardizes the input features, and returns coefficients on their original scale, so it's quite easy to implement. If you standardize your features yourself, you can specify standardize = FALSE in the function. Either way, don't run features that aren't standardized as the results will be undesirable as the regularization won't be applied evenly. If you do standardize on your own, I recommend utilizing the vtreat package functions as we did in Chapter 2, Linear Regression, specifying scale = TRUE in the prepare() function. This will help us apply the centering and scaling values from your training data to the test/validation sets.
I'll let glmnet handle the standardizing, and we can begin with a 70/30 train/test split:
> set.seed(1066)
> index <- caret::createDataPartition(sim_df$y, p = 0.7, list = F)
> train <- sim_df[index, ]
> test <- sim_df[-index, ]
Now, glmnet requires that your features are input as a matrix, and if you're doing a classification problem, the response is a factor. This code handles the requirement:
> x <- as.matrix(train[, -17])
> y <- as.factor(train$y)
For the function to train the algorithm, there're a couple of things you can specify. Here, I'll execute five-fold cross-validation—the loss function for training, which in the case of classification can be class for misclassification errors or auc for the area under the curve. I'll go with auc, and leave it up to you to assess. Since this is ridge regression, our alpha will be equal to 0.
Accordingly, we'll set the family argument to binomial. This makes the function run a logistic regression instead of its standard linear regression. The following is the code to train the ridge regression algorithm:
> set.seed(1999)
> ridge <- glmnet::cv.glmnet(
x,
y,
nfolds = 5,
type.measure = "auc",
alpha = 0,
family = "binomial"
)
To begin, glmnet offers a number of different plots. The default plot shows the relationship of the log of the lambda values searched and its relation to the loss function, in our case auc:
> plot(ridge)
The output of the preceding code is as follows:
We see log(Lambda) on the x axis and AUC on the y axis. At the top of the plot is a series of the value 16. This tracks the number of non-zero coefficients corresponding to log(Lamda). We'll see how that changes with LASSO. The two dotted vertical lines show the log(Lambda) value with the maximum AUC and the log(Lamda) value with the maximum AUC within one standard error of the maximum, for the left and right lines respectively.
Let's review what those actual lambda values are:
> ridge$lambda.min
[1] 0.01216653
> ridge$lambda.1se
[1] 0.04475312
Recall that, if lambda were equal to zero, there would be no regularization penalty at all.
To see the coefficients, run this code:
> coef(ridge, s = "lambda.1se")
17 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.535798579
TwoFactor1 -0.541881256
TwoFactor2 0.530637287
Linear1 -0.005472570
Linear2 -0.506143897
Linear3 0.454702486
Linear4 -0.316847306
Linear5 0.182733133
Linear6 0.070036471
Nonlinear1 0.354214422
Nonlinear2 0.238778841
Nonlinear3 0.322499067
Noise1 -0.028226796
Noise2 0.002973271
Noise3 0.014767631
Noise4 0.038038078
random1 -0.237527142
- Calculate the odds by exponentiation, for example, exp(coef)
Calculate the probability with the formula, probability = odds / 1+ odds
Notice that the noise features and Linear1, which are irrelevant to making a prediction, are close, but not equal to, zero. The algorithm puts a larger coefficient on Linear5 versus Linear6. By the way, those features are on the same scale, so a direct comparison is possible. To predict the probabilities with a glmnet model, be sure to specify type = "response" and which lambda value to use. I recommend starting with using the lambda.1se value to prevent overfitting. But you can experiment accordingly:
> ridge_pred <-
data.frame(predict(ridge, newx = x, type = "response", s = "lambda.1se"))
Like in the previous chapter on logistic regression, a plot of the probability distributions by class is in order:
> classifierplots::density_plot(y, ridge_pred$X1)
The output of the preceding code is as follows:
There seems to be an excellent separation in the probabilities above 50%.
Let's see the AUC:
> Metrics::auc(y, ridge_pred$X1)
[1] 0.8632982
The AUC is above 0.86. This brings us to the question of whether or not this will remain consistent on the test data:
> ridge_test <-
data.frame(predict(ridge, newx = as.matrix(test[, -17]),
type = 'response'), s = "lambda.1se")
> Metrics::auc(test$y, ridge_test$X1)
[1] 0.8706708
> Metrics::logLoss(test$y, ridge_test$X1)
[1] 0.4307592
> classifierplots::density_plot(test$y, ridge_test$X1)
The output of the preceding code is as follows:
There's very consistent performance between the train and test data. The AUC is now above 0.87, and we have a benchmark log-loss of 0.4307592. You can try different k-folds, a different loss function, and even different random seeds to see how the model changes. For now, we need to move on to the next algorithm, LASSO.