上QQ阅读APP看书,第一时间看更新
Model evaluation
We have trained an ML model but how well will the model perform on the data it has never seen before? We answer this question using model evaluation.
Different machine learning algorithms call for different evaluation metrics.
For supervised learning methods, we usually use the following:
- The confusion matrix, which is a matrix consisting of four values: True Positive, False Positive, True Negative, and False Negative
- Accuracy, precision, recall, and F1-score (these are all byproducts of the confusion matrix)
- The Receiver Operator Characteristic (ROC) curve and the Area Under Curve (AUC) metric
- R-square (coefficient of determination), Root Mean Square Error (RMSE), F-statistic, Akaike Information Criterion (AIC), and p-values specifically for regression models
Throughout this book, we will be incorporating these metrics to evaluate our models. Although these are the most common evaluation metrics, be it for ML or DL, there are more specific evaluation metrics that correspond to different domains. We will get to that as well as we go along.
It worth mentioning here that we often tend to fall into the trap of the accuracy paradox in the case of classification problems where the data is imbalanced. I n these cases, classification accuracy only tells one part of the story, that is, it gives the percentage of correct predictions made out of the total number of predictions made. This system fails miserably in the case of imbalanced datasets because accuracy does not capture how well a model is performing at predicting the negative instances of the dataset (which is originally the problem—predicting the uncommon class(es)).
Following are the most commonly used metrics for evaluating unsupervised methods such as clustering:
- Silhouette coefficients
- Sum of squared errors
- Homogeneity, completeness, and the V-measure
- The Calinski-Harabasz index
The evaluation metrics/error metrics remain the same for a train set, a test set, or a validation set. We cannot just jump to a conclusion just by looking at the performance of a model on the train set.