KNN modeling
As stated previously, we'll begin with feature selection. The caret package helps out in this matter. In RFE, a model is built using all features, and a feature importance value is assigned. Then the features are recursively pruned and an optimal number of features selected based on a performance metric such as accuracy. In short, it's a type of backward feature elimination.
To do this, we'll need to set the random seed, specify the cross-validation method in caret's rfeControl() function, perform a recursive feature selection with the rfe() function, and then test how the model performs on the test set. In rfeControl(), you'll need to specify the function based on the model being used. There are several different functions that you can use. Here we'll need lrFuncs. To see a list of the available functions, your best bet is to explore the documentation with ?rfeControl and ?caretFuncs. The metric we'll use is Cohen's Kappa statistic, which we used and explained in a prior chapter.
To recap, the Kappa statistic is commonly used to provide a measure of how well two evaluators can classify an observation correctly. It gives an insight into this problem by adjusting the accuracy scores, which is done by accounting for the evaluators being entirely correct by mere chance. The formula for the statistic is: Kappa = (percent of agreement - percent of chance agreement) / (1 - percent of chance agreement).
The percent of agreement is the rate that the evaluators agreed on for the class (accuracy), and percent of chance agreement is the rate that the evaluators randomly agreed on. The higher the statistic, the better they performed, with the maximum agreement being one.
Altman (1991) provides a heuristic to assist us in the interpretation of the statistic, which is shown in the following table:
The following code gets our control function established:
> ctrl <- caret::rfeControl(
functions = caret::lrFuncs,
method = "cv",
number = 10,
verbose = TRUE
)
I now specify the number of feature subsets for consideration between 25 and 35. After setting the random seed, we can run the RFE using a KNN algorithm. With verbose = TRUE, the status of training is displayed in the console. Of course, setting that to FALSE will hide it:
> subsets <- c(25:35)
> set.seed(1863)
> knnProfile <- caret::rfe(
train_df,
train_y,
sizes = subsets,
rfeControl = ctrl,
method = "knn",
metric = "Kappa"
)
Calling the knnProfile object tells us what we need to know:
> knnProfile #33
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
25 0.8377 0.5265 0.01524 0.05107
26 0.8383 0.5276 0.01594 0.05359 *
27 0.8377 0.5271 0.01616 0.05462
28 0.8375 0.5257 0.01612 0.05416
29 0.8370 0.5247 0.01668 0.05503
30 0.8370 0.5241 0.01654 0.05464
31 0.8381 0.5272 0.01649 0.05465
32 0.8368 0.5233 0.01727 0.05623
33 0.8361 0.5212 0.01623 0.05393
34 0.8366 0.5231 0.01676 0.05525
35 0.8361 0.5218 0.01644 0.05487
39 0.8361 0.5217 0.01705 0.05660
The top 5 variables (out of 26):
V74, V35, V22, V78, V20
The results state that 26 features provide the highest Kappa statistic of 0.5276 (moderate strength), and it offers the highest accuracy rate of 83.83%. The output also gives us the top 5 features based on importance score. If you want, you can plot the results by putting it into a dataframe and passing it to ggplot:
> knn_results <- knnProfile$results
> ggplot2::ggplot(knn_results, aes(Variables, Kappa)) +
ggplot2::geom_line(color = 'darkred', size = 2) +
ggthemes::theme_economist()
The output of the preceding code is as follows:
Let's select those 26 features in a new dataframe, then add to the dataframe the response, train_y. This will get our data ready for training the KNN model:
> vars <- knnProfile$optVariables
> x_selected <-
train_df[, (colnames(train_df) %in% vars)]
> knn_df <- cbind(x_selected, train_y)
What I like to do is use the train.kknn() function from the kknn package. We use cross-validation again within the train.kknn() function to select the best parameters for the optimal k neighbors and a kernel function.
The kernel function allows you to specify an unweighted k neighbors algorithm using the Euclidian distance and weighted functions for distance.
For the weighting of the distances, many different methods are available. For our purpose, the package that we'll use has ten different weighting schemas, which includes unweighted. They're rectangular (unweighted), triangular, Epanechnikov, biweight, triweight, cosine, inversion, Gaussian, rank, and optimal. A full discussion of these weighting techniques is available in Hechenbichler K. and Schliep K.P. (2004).
For simplicity, let's focus on just two: triangular and epanechnikov. Before having the weights assigned, the algorithm standardizes all of the distances so that they're between zero and one. The triangular weighting method multiplies the observation distance by one minus the distance. With Epanechnikov, the distance is multiplied by ¾ times (one minus the distance). For our problem, we'll incorporate these weighting methods along with the standard unweighted version for comparison purposes.
After specifying a random seed, we'll create the train set object with kknn(). This function asks for the maximum number of k-nearest neighbor values (kmax), distance (one is equal to Euclidean and two is equal to absolute), kcv for the number of k-fold cross-validation, and kernel. For this model, kmax will be set to 25 and distance will be 1:
> knn_fit <-
kknn::train.kknn(
train_y ~ .,
data = knn_df,
distance = 1,
kmax = 25,
kcv = 10,
kernel = c("rectangular", "triangular", "epanechnikov")
)
A nice feature of the package is the ability to plot and compare the results, as follows:
> plot(knn_fit)
The following is the output of the preceding command:
This plot shows k on the x axis and the percentage of misclassified observations by the kernel on the y axis. The weighted (triangular) version at k: 17 performs the best. You can also call the object to see what the classification error and the best parameter are in the following way:
> knn_fit
Call:
kknn::train.kknn(formula = train_y ~ ., data = knn_df, kmax = 25, distance = 1, kernel = c("rectangular", "triangular", "epanechnikov"), kcv = 10)
Type of response variable: nominal
Minimal misclassification: 0.154754
Best kernel: triangular
Best k: 17
With the model object created, it's time to see how it performs, starting with the predicted probabilities on the training data:
> knn_pred_train <-
data.frame(predict(knn_fit, newdata = knn_df, type = "prob"))
> classifierplots::density_plot(train_y, knn_pred_train$X1)
The output of the preceding code is as follows:
The plot shows quality separation between the probability densities for events versus non-events. This should have a high area under the curve value:
> Metrics::auc(train_y, knn_pred_train$X1)
[1] 0.9460519
Almost 0.95! Well, let me say that it's quite good, but I sense that we've overfitted and will see this low bias on train turn into a miss on the test set. Let's have a look, but also determine the probability cut point to minimize misclassification error:
> InformationValue::optimalCutoff(train_y, knn_pred_train$X1)
[1] 0.48
So, 0.48 minimizes error on the training data. This will help us produce a confusion matrix, but first, here's the density plot and AUC for test data:
> knn_pred_test <-
data.frame(predict(knn_fit, newdata = test, type = "prob"))
> classifierplots::density_plot(test_y, knn_pred_test$X1)
The output of the preceding code is as follows:
Given the different skews in the density plots from before, it sure does look like we lost some predictive power on the test data:
> Metrics::auc(test_y, knn_pred_test$X1)
[1] 0.8592589
Indeed, our area under the curve has fallen from 0.95 to 0.86. We can drill down further into this model's performance with a confusion matrix and associated results. We'll use the caret package and the confusionMatrix() function. This version provides a considerable amount of detail, and it will produce all of the statistics that we need to evaluate and select the best model. You need to specify your predictions as a factor, not probability, and the actual values need to be structured as a factor. I recommend you specify the positive class—in other words, our events:
> pred_class <- as.factor(ifelse(knn_pred_test$X1 >= 0.48, "1", "0"))
> caret::confusionMatrix(data = pred_class, reference = test_y, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1262 178
1 191 292
Accuracy : 0.8081
95% CI : (0.7898, 0.8255)
No Information Rate : 0.7556
P-Value [Acc > NIR] : 0.00000002214
Kappa : 0.4853
Mcnemar's Test P-Value : 0.5322
Sensitivity : 0.6213
Specificity : 0.8685
Pos Pred Value : 0.6046
Neg Pred Value : 0.8764
Prevalence : 0.2444
Detection Rate : 0.1518
Detection Prevalence : 0.2512
Balanced Accuracy : 0.7449
'Positive' Class : 1
The function produces some items that we already covered such as Accuracy and Kappa. Here are the other statistics that it provides:
- No Information Rate is the proportion of the largest class: 76 % of no events.
- P-Value is used to test the hypothesis that the accuracy is actually better than No Information Rate.
- We'll not concern ourselves with Mcnemar's Test, which is used for the analysis of matched pairs, primarily in epidemiology studies.
- Sensitivity is the true positive rate.
- Specificity is the true negative rate.
- The positive predictive value (Pos Pred Value) is the probability of an observation being classified as being an event and it truly is an event. The following formula is used:
- The negative predictive value (Neg Pred Value) is the probability of an observation being classified as a non-event and it truly isn't an event. The formula for this is as follows:
- Prevalence is the estimated population prevalence of events, calculated here as the total of the second column (the 1 column) divided by the total observations.
- Detection Rate is the rate of the true positives that have been identified.
- Detection Prevalence is the predicted prevalence rate or, in our case, the bottom row divided by the total observations.
- Balanced Accuracy is the average accuracy obtained from either class. This measure accounts for a potential bias in the classifier algorithm, thus potentially over predicting the most frequent class. This is simply:
Sensitivity + Specificity divided by 2.
You can discern some model weakness in Sensitivity and positive predictive value. Feel free to try on your own changing different distance weighting options and see if you can improve performance. Otherwise, let's proceed to SVM and compare the performance alongside what we just completed.