Deep Learning with R for Beginners
上QQ阅读APP看书,第一时间看更新

Use case – improving out-of-sample model performance using dropout

Dropout is a novel approach to regularization that is particularly valuable for large and complex deep neural networks. For a much more detailed exploration of dropout in deep neural networks, see Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinav, R. (2014). The concept behind dropout is actually quite straightforward. During the training of the model, units (for example, input and hidden neurons) are probabilistically dropped along with all connections to and from them.

For example, the following diagram is an example of what might happen at each step of training for a model where hidden neurons and their connections are dropped with a probability of 1/3 for each epoch. Once a node is dropped, its connections to the next layer are also dropped. In the the following diagram, the grayed-out nodes and dashed connections are the ones that were dropped. It is important to note that the choice of nodes that are dropped changes for each epoch:

Figure 3.12: Dropout applied to a layer for different epochs

One way to think about dropout is that it forces models to be more robust to perturbations. Although many neurons are included in the full model, during training they are not all simultaneously present, and so neurons must operate somewhat more independently than they would otherwise. Another way of viewing dropout is that, if you have a large model with N weights between hidden neurons, but 50% are dropped during training, although all N weights will be used during some stages of training, you have effectively halved the total model complexity as the average number of weights will be halved. This reduces model complexity, and hence helps to prevent the overfitting of the data. Because of this feature, if the proportion of dropout is p, Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014) recommend scaling up the target model complexity by 1/p in order to end up with a roughly equally complex model.

During model testing/scoring, neurons are not usually dropped because it is computationally inconvenient. Instead, we can use an approximate average based on scaling the weights from a single neural network based on each weight's probability of being included (that is, 1/p). This is usually taken care of by the deep learning library.

In addition to working well, this approximate weight re-scaling is a fairly trivial calculation. Thus, the primary computational cost of dropout comes from the fact that a model with more neurons and weights must be used because so many (a commonly recommended value is around 50% for hidden neurons) are dropped during each training update.

Although dropout is easy to implement, a larger model may be required to compensate. To speed up training, a higher learning rate can be used so that fewer epochs are required. One potential downside of combining these approaches is that, with fewer neurons and a faster learning rate, some weights may become quite large. Fortunately, it is possible to use dropout along with other forms of regularization, such as the L1 or L2 penalty. Taken together, the result is a larger model that that can quickly (a faster Learning rate) explore a broader parameter space, but is regularized through dropout and a penalty to keep the weights in check.

To show the use of dropout in a neural network, we will return to the Modified National Institute of Standards and Technology (MNIST) dataset (which we downloaded in Chapter 2, Training a Prediction Model) we worked with previously. We will use the nn.train() function from the deepnet package, as it allows for dropout. As in the previous chapter, we will run the four models in parallel to reduce the time it takes. Specifically, we compare four models, two with and two without dropout regularization and with either 40 or 80 hidden neurons. For dropout, we specify the proportion to dropout separately for the hidden and visible units. Based on the rule of thumb that about 50% of hidden units (and 80% of observed units) should be kept, we specify the dropout proportions at 0.5 and 0.2, respectively:

## Fit Models
nn.models <- foreach(i = 1:4, .combine = 'c') %dopar% {
set.seed(1234)
list(nn.train(
x = as.matrix(digits.X),
y = model.matrix(~ 0 + digits.y),
hidden = c(40, 80, 40, 80)[i],
activationfun = "tanh",
learningrate = 0.8,
momentum = 0.5,
numepochs = 150,
output = "softmax",
hidden_dropout = c(0, 0, .5, .5)[i],
visible_dropout = c(0, 0, .2, .2)[i]))
}

Next, we can loop through the models to obtain predicted values and get the overall model performance:

nn.yhat <- lapply(nn.models, function(obj) {
encodeClassLabels(nn.predict(obj, as.matrix(digits.X)))
})
perf.train <- do.call(cbind, lapply(nn.yhat, function(yhat) {
caret::confusionMatrix(xtabs(~ I(yhat - 1) + digits.y))$overall
}))
colnames(perf.train) <- c("N40", "N80", "N40_Reg", "N80_Reg")
options(digits = 4)
perf.train
N40 N80 N40_Reg N80_Reg
Accuracy 0.9478 0.9622 0.9278 0.9400
Kappa 0.9420 0.9580 0.9197 0.9333
AccuracyLower 0.9413 0.9565 0.9203 0.9331
AccuracyUpper 0.9538 0.9673 0.9348 0.9464
AccuracyNull 0.1126 0.1126 0.1126 0.1126
AccuracyPValue 0.0000 0.0000 0.0000 0.0000
McnemarPValue NaN NaN NaN NaN

When evaluating the models in the in-sample training data, it seems those without regularization perform better those with regularization. Of course, the real test comes with the testing or holdout data:

nn.yhat.test <- lapply(nn.models, function(obj) {
encodeClassLabels(nn.predict(obj, as.matrix(test.X)))
})

perf.test <- do.call(cbind, lapply(nn.yhat.test, function(yhat) {
caret::confusionMatrix(xtabs(~ I(yhat - 1) + test.y))$overall
}))
colnames(perf.test) <- c("N40", "N80", "N40_Reg", "N80_Reg")

perf.test
N40 N80 N40_Reg N80_Reg
Accuracy 0.8890 0.8520 0.8980 0.9030
Kappa 0.8765 0.8352 0.8864 0.8920
AccuracyLower 0.8679 0.8285 0.8776 0.8830
AccuracyUpper 0.9078 0.8734 0.9161 0.9206
AccuracyNull 0.1180 0.1180 0.1180 0.1180
AccuracyPValue 0.0000 0.0000 0.0000 0.0000
McnemarPValue NaN NaN NaN NaN

The testing data highlights that the in-sample performance was overly optimistic (accuracy = 0.9622 versus accuracy = 0.8520 for the 80-neuron, non-regularized model in the training and testing data, respectively). We can see the advantage of the regularized models for both the 40- and the 80-neuron models. Although both still perform worse in the testing data than they did in the training data, they perform on a par with, or better than, the equivalent non-regularized models in the testing data. This difference is particularly important for the 80-neuron model as the best performing model on the test data is the regularized model.

Although these numbers are by no means record-setting, they do show the value of using dropout, or regularization more generally, and how one might go about trying to tune the model and dropout parameters to improve the ultimate testing performance.