Model selection
What are we to conclude from all this work? We have the confusion matrices and error rates from our models to guide us, but we can get a little more sophisticated when it comes to selecting the classification models. An effective tool for a classification model comparison is the Receiver Operating Characteristic (ROC) chart. Very simply, ROC is a technique for visualizing, organizing, and selecting classifiers based on their performance (Fawcett, 2006). On the ROC chart, the y-axis is the True Positive Rate (TPR) and the x-axis is the False Positive Rate (FPR). The following are the calculations, which are quite simple:
TPR = Positives correctly classified / total positives
FPR = Negatives incorrectly classified / total negatives
Plotting the ROC results will generate a curve, and thus you are able to produce the Area Under the Curve (AUC). The AUC provides you with an effective indicator of performance, and it can be shown that the AUC is equal to the probability that the observer will correctly identify the positive case when presented with a randomly chosen pair of cases in which one case is positive and one case is negative (Hanley JA & McNeil BJ, 1982). In our case, we will just switch the observer with our algorithms and evaluate accordingly.
To create an ROC chart in R, you can use the ROCR package. I think this is a great package and allows you to build a chart in just three lines of code. The package also has an excellent companion website (with examples and a presentation) that can be found at the following link:
http://rocr.bioinf.mpi-sb.mpg.de/.
What I want to show are three different plots on our ROC chart: the full model, the reduced model using BIC to select the features, the MARS model, and a bad model. This so-called bad model will include just one predictive feature and will provide an effective contrast to our other models. Therefore, let's load the ROCR package, build this poorly performing model, and call it bad.fit for simplicity, using the thick feature:
> library(ROCR)
> bad.fit <- glm(class ~ thick, family = binomial,
data = test)
> test.bad.probs = predict(bad.fit, type =
"response") #save
probabilities
It is now possible to build the ROC chart with three lines of code per model using the test dataset. We will first create an object that saves the predicted probabilities with the actual classification. Next, we will use this object to create another object with the calculated TPR and FPR. Then, we will build the chart with the plot() function. Let's get started with the model using all of the features or, as I call it, the full model. This was the initial one that we built back in the Logistic regression model section of this chapter:
> pred.full <- prediction(test.probs, test$class)
The following is the performance object with the TPR and FPR:
> perf.full <- performance(pred.full, "tpr", "fpr")
The following plot command with the title of ROC and col=1 will color the line black:
> plot(perf.full, main = "ROC", col = 1)
The output of the preceding command is as follows:
As stated previously, the curve represents TPR on the y-axis and FPR on the x-axis. If you have the perfect classifier with no false positives, then the line will run vertically at 0.0 on the x-axis. If a model is no better than chance, then the line will run diagonally from the lower left corner to the upper right one. As a reminder, the full model missed out on five labels: three false positives and two false negatives. We can now add the other models for comparison using a similar code, starting with the model built using BIC (refer to the Logistic regression with cross-validation section of this chapter), as follows:
> pred.bic <- prediction(test.bic.probs,
test$class)
> perf.bic <- performance(pred.bic, "tpr", "fpr")
> plot(perf.bic, col = 2, add = TRUE)
The add=TRUE parameter in the plot command added the line to the existing chart. Finally, we will add the poorly performing model, the MARS model, and include a legend chart, as follows:
> pred.bad <- prediction(test.bad.probs,
test$class)
> perf.bad <- performance(pred.bad, "tpr", "fpr")
> plot(perf.bad, col = 3, add = TRUE)
> pred.earth <- prediction(test.earth.probs,
test$class)
> perf.earth <- performance(pred.earth, "tpr",
"fpr")
> plot(perf.earth, col = 4, add = TRUE)
> legend(0.6, 0.6, c("FULL", "BIC", "BAD",
"EARTH"), 1:4)
The following is the output of the preceding code snippet:
We can see that the FULL model, BIC model and the MARS model are nearly superimposed. It is also quite clear that the BAD model performed as poorly as was expected.
The final thing that we can do here is compute the AUC. This is again done in the ROCR package with the creation of a performance object, except that you have to substitute auc for tpr and fpr. The code and output are as follows:
> performance(pred.full, "auc")@y.values
[[1]]
[1] 0.9972672
> performance(pred.bic, "auc")@y.values
[[1]]
[1] 0.9944293
> performance(pred.bad, "auc")@y.values
[[1]]
[1] 0.8962056
> performance(pred.earth, "auc")@y.values
[[1]]
[1] 0.9952701
The highest AUC is for the full model at 0.997. We also see 99.4 percent for the BIC model, 89.6 percent for the bad model and 99.5 for MARS. So, to all intents and purposes, with the exception of the bad model we have no difference in predictive powers between them. What are we to do? A simple solution would be to re-randomize the train and test sets and try this analysis again, perhaps using a 60/40 split and a different randomization seed. But if we end up with a similar result, then what? I think a statistical purist would recommend selecting the most parsimonious model, while others may be more inclined to include all the variables. It comes down to trade-offs, that is, model accuracy versus interpretability, simplicity, and scalability. In this instance, it seems safe to default to the simpler model, which has the same accuracy. It goes without saying that we won't always get this level of predictability with just GLMs or discriminant analysis. We will tackle these problems in upcoming chapters with more complex techniques and hopefully improve our predictive ability. The beauty of machine learning is that there are several ways to skin the proverbial cat.