Hands-On Python Deep Learning for the Web
上QQ阅读APP看书,第一时间看更新

Model evaluation

We have trained an ML model but how well will the model perform on the data it has never seen before? We answer this question using model evaluation

Different machine learning algorithms call for different evaluation metrics. 

For supervised learning methods, we usually use the following:

  • The confusion matrix, which is a matrix consisting of four values: True Positive, False Positive, True Negative, and False Negative
  • Accuracy, precision, recall, and F1-score (these are all byproducts of the confusion matrix)
  • The Receiver Operator Characteristic (ROC) curve and the Area Under Curve (AUC) metric
  • R-square (coefficient of determination), Root Mean Square Error (RMSE), F-statistic, Akaike Information Criterion (AIC), and p-values specifically for regression models

Throughout this book, we will be incorporating these metrics to evaluate our models. Although these are the most common evaluation metrics, be it for ML or DL, there are more specific evaluation metrics that correspond to different domains. We will get to that as well as we go along. 

It worth mentioning here that we often tend to fall into the trap of the  accuracy paradox  in the case of  classification  problems where the data is  imbalanced. I n these cases, classification accuracy only tells one part of the story, that is, it gives  the percentage of correct predictions made out of the total number of predictions made. This system fails miserably in the case of imbalanced datasets because accuracy does not capture how well a model is performing at predicting the negative instances of the dataset (which is originally the problem—predicting  the uncommon class(es)).

Following are the most commonly used metrics for evaluating unsupervised methods such as clustering:

  • Silhouette coefficients
  • Sum of squared errors
  • Homogeneity, completeness, and the V-measure
  • The Calinski-Harabasz index
The evaluation metrics/error metrics remain the same for a train set, a test set, or a validation set. We cannot just jump to a conclusion just by looking at the performance of a model on the train set.