Hands-On Python Deep Learning for the Web
上QQ阅读APP看书,第一时间看更新

Overfitting and underfitting

When an ML model performs very well on the training data but poorly on the data from either the test set or validation set, the phenomenon is referred to as overfitting. There can be several reasons for this; the following are the most common ones: 

  • The model is very complex with respect to the data. A decision tree with very high levels and a neural network with many layers are good examples of model complexity in this case. 
  • The data has lots of features but very few instances of the population. 

In ML literature, the problem of overfitting is also treated as a problem of high varianceRegularization is the most widely used approach to prevent overfitting.

We have already discussed the concept of bias. A model has a low bias if it performs well on the training data, that is, the model is not making too many assumptions on the data to infer its representation. If the model fails miserably on the training data, it is said that the model has a high bias and the model is underfitting. There can be many reasons for underfitting as well. The following are the most common ones in this case:

  • The model is too simple to learn the underlying representation of the data given to it.
  • The features of the data have not been engineered well before feeding them to the ML model. The engineering part is very popularly known as feature engineering. 
Based on this discussion, we can draw a very useful conclusion: an ML model that is overfitting might be suffering from the issue of high variance whereas an underfitting model might be suffering from the issue of high bias. 

The discussion of overfitting and underfitting remains incomplete without the following diagram (shown by Andrew Ng during his flagship course, Machine Learning):

The preceding diagram beautifully illustrates underfitting and overfitting in terms of curvea fitting through the data points. It also gives us an idea of a model that generalizes well, that is, performs well on both the train and test sets. The model prediction line in blue is way off the samples, leading to underfitting, while in the case of overfitting, the model captures all points in the training data but does not yield a model that would perform well on data outside training data.

Often, the idea of learning representations of the data is treated as a problem of approximating a function that best describes the data. And a function can easily be plotted graphically like the previous one, hence, the idea of curve fitting. The sweet spot between underfitting and overfitting where a model generalizes well is called a good fit.