Adopting regularization for avoiding overfitting
Intuitively, a good machine learning model should achieve low error on training data. Mathematically, this is equivalent to minimizing the loss function on the training data given the machine learning model built. This is expressed by the following formula.:
However, this might not be enough. A model can become excessively complex in order to capture all the relations inherently expressed by the training data. This increase of complexity might have two negative consequences. First, a complex model might require a significant amount of time to be executed. Second, a complex model can achieve very good performance on training data—because all the inherent relations in trained data are memorized, but not so good performance on validation data—as the model is not able to generalize on fresh unseen data. Again, learning is more about generalization than memorization. The following graph represents a typical loss function decreasing on both validation and training sets. However, a certain point the loss on validation starts to increase because of overfitting:
As a rule of thumb, if during the training we see that the loss increases on validation, after an initial decrease, then we have a problem of model complexity that overfits training. Indeed, overfitting is the word used in machine learning for concisely describing this phenomenon.
In order to solve the overfitting problem, we need a way to capture the complexity of a model, that is, how complex a model can be. What could be the solution? Well, a model is nothing more than a vector of weights. Therefore the complexity of a model can be conveniently represented as the number of nonzero weights. In other words, if we have two models, M1 and M2, achieving pretty much the same performance in terms of loss function, then we should choose the simplest model that has the minimum number of nonzero weights. We can use a hyperparameter ⅄>=0 for controlling what the importance of having a simple model is, as in this formula:
There are three different types of regularizations used in machine learning:
- L1 regularization (also known as lasso): The complexity of the model is expressed as the sum of the absolute values of the weights
- L2 regularization (also known as ridge): The complexity of the model is expressed as the sum of the squares of the weights
- Elastic net regularization: The complexity of the model is captured by a combination of the two preceding techniques
Note that the same idea of regularization can be applied independently to the weights, to the model, and to the activation.
Therefore, playing with regularization can be a good way to increase the performance of a network, in particular when there is an evident situation of overfitting. This set of experiments is left as an exercise for the interested reader.
Note that Keras supports both l1, l2, and elastic net regularizations. Adding regularization is easy; for instance, here we have a l2 regularizer for kernel (the weight W):
from keras import regularizers model.add(Dense(64, input_dim=64, kernel_regularizer=regularizers.l2(0.01)))
A full description of the available parameters is available at: https://keras.io/regularizers/.