上QQ阅读APP看书，第一时间看更新

Generating model combinations

The first thing we need to do is develop a way of getting the combinations of regressors we want to test. Since this is a combinatorial problem, the number of combinations is exponential with the number of available options. In our case, with the 19 available variables, the number of possible models is the sum of the number of models we can create with one regressor plus the number of models we can create with two regressors, and so on, until we sum the number of models we can create with all 19 regressors. This is what the sum is:

Of course, computing so many models, although easy for a computer, may take a while, so we want to limit the minimum and maximum number of regressors allowed in the combinations. To do so, we specify the minimum and maximum percentage of regressors that will be included in the min_percentage and max_percentage parameters, respectively. In our case, if we specify min_percentage = 0.9 and max_percentage = 1.0, we're asking for all combinations that contain between 17 and 19 of the regressors, which adds up to 191 models. Imagine the time it would take you to generate 191 model specifications manually! Hopefully thinking about that will make you realize the power of this technique.

To start, we create the generate_combinations_unvectorized() function that will output the a list with all the possible combinations given the variables and the min_percentage and max_percentage parameters mentioned earlier. The first thing we do is remove the Proportion variable by specifying it as FALSE in the variables vector (the variables object here corresponds to the numerical_variables object, but we have adjusted its name within this function to make it more readable). The other unwanted variables (NVotes, Leave, Vote, and RegionName) were removed in the get_numerical_variable_names() function at the beginning of the chapter. Next, we get the actual names of the variables with TRUE values so that we can work with string and not Boolean. After that, we compute the total number of variables as n, and the actual number of variables we will include in the combinations by taking the percentage parameters, multiplying them by the number of variables, and getting either the floor or ceiling for that number to make sure that we include the extremes. After that, we initialize the all_combinations object that will contain the list of combinations we want. The next part is the progress bar object that we won't explain as we have used it before.

The actual work is done inside the for loop. Notice that it goes from the minimum to the maximum number of variables we want inside our combinations. In each iteration, we compute the number of combinations which is returned to us as a matrix where each column represents a different combination and each row contains the index of the variables for that particular combination. This means that we need to add each of those columns to our total list of combinations (all_combinations), which is what we do inside the nested for loop. Finally, since we have nested lists, we want to use the unlist() function to bring them to the same level, but we don't want to do it recursively because we would just end with a single long list and we wouldn't be able to differentiate one combination from another.

I encourage you to change the return statement to avoid using the recursive = FALSE parameter, as well as avoiding the use of the unlist() function at all. Doing so will quickly show you what effect they have on the function's output, and why we need them.

library(progress)

generate_combinations_unvectorized <- function(variables, min_percentage, max_percentage) {
    variables[["Proportion"]] <- FALSE
    variables                 <- names(variables[variables == TRUE])
    n                         <- length(variables)
    n_min                     <- floor(n * min_percentage)
    n_max                     <- ceiling(n * max_percentage)
    all_combinations          <- NULL

    progress_bar <- progress_bar$new(
        format = "Progress [:bar] :percent ETA: :eta",
        total = length(n_min:n_max)
    )

    for (k in n_min:n_max) {
        progress_bar$tick()
        combinations <- combn(variables, k)
        for (column in 1:ncol(combinations)) {
            new_list <- list(combinations[, column])
            all_combinations <- c(all_combinations, list(new_list))
        }
    }
    return(unlist(all_combinations, recursive = FALSE))
}

A sample output of the object that the generate_combinations_unvectorized() function does is shown next. As you can see, it's a list where each element is a vector or type character. The first combination created contains only 17 variables, which is the minimum number of variables used when the total number of variables is 19 and the minimum percentage requested is 90%. The last combination (combination number 191), contains all 19 variables and corresponds to the model we built manually earlier in this chapter:

combinations <- generate_combinations_unvectorized(
    numerical_variables, 0.9, 1.0
)

combinations
[[1]]
 [1] "Residents"     "Households"    "White"         "Owned"
 [5] "OwnedOutright" "SocialRent"    "PrivateRent"   "Students"
 [9] "Unemp"         "UnempRate_EA"  "HigherOccup"   "Density"
[13] "Deprived"      "MultiDepriv"   "Age_18to44"    "Age_45plus"
[17] "NonWhite"

...

[[191]]
 [1] "Residents"          "Households"         "White"
 [4] "Owned"              "OwnedOutright"      "SocialRent"
 [7] "PrivateRent"        "Students"           "Unemp"
[10] "UnempRate_EA"       "HigherOccup"        "Density"
[13] "Deprived"           "MultiDepriv"        "Age_18to44"
[16] "Age_45plus"         "NonWhite"           "HighEducationLevel"
[19] "LowEducationLevel"

Getting only those combinations that contain between 90% and 100% of the variables may seem a bit restrictive. What if we want to generate all possible combinations? In that case, we would change the first parameter to be 0, but it may not finish in a practical amount of time. The reason is that our generate_combinations_unvectorized() function, as the name implies, is not vectorized, and even worse, has nested for loops. This is a huge bottleneck in this particular case, and it's something you want to look out for in your own code. One possible solution is to make a vectorized version of the function. For those of you interested, we have included a file named vectorized_vs_unvectorized.R in this book's code repository (https://github.com/PacktPublishing/R-Programming-By-Example), that shows the said implementation. We also include some tests that will show you just how much faster the vectorized implementation is. Just to give you a spoiler, it can be hundreds of times faster! For those cases where vectorizing and other approaches that only depend on R itself are not good enough, you can try delegating the task to a faster (compiled) language. We will see how to do that in Chapter 9, Implementing an Efficient Simple Moving Average.

Going back to our example, the next thing to do is to create the find_best_fit() function, which will go through each of the combinations generated, use the data_train data to train a model with the corresponding combination, test it's accuracy with the measure selection (either Proportion (numerical) or Vote (categorical)) and will save the corresponding score in a scores vector. Then, it will find the index of the optimal score by either finding the minimum or maximum score, depending on the measure selection we're using (Proportion requires us to minimize while Vote requires us to maximize), and finally it will recreate the optimal model, print it's information, and return the model to the user. The compute_model_and_fit(), compute_score(), and print_best_model_info() functions will be developed next as we're following a top-down approach:

find_best_fit <- function(measure, data_train, data_test, combinations) {
    n_cases <- length(combinations)
    progress_bar <- progress_bar$new(
        format = "Progress [:bar] :percent ETA: :eta",
        total = n_cases
    )
    scores <- lapply(1:n_cases, function(i) {
        progress_bar$tick()
        results <- compute_model_and_fit(combinations[[i]], data_train)
        score <- compute_score(measure, results[["fit"]], data_test)
        return(score)
    })
    i <- ifelse(measure == "Proportion", which.min(scores), which.max(scores))
    best_results <- compute_model_and_fit(combinations[[i]], data_train)
    best_score <- compute_score(measure, best_results[["fit"]], data_test)
    print_best_model_info(i, best_results[["model"]], best_score, measure)
    return(best_results[["fit"]])
}

Next, we create the compute_model_and_fit() function, which simply generates the formula for the selected combination and uses it within the lm() function. As you can see in the combinations object, we returned previously from the generate_combinations_unvectorized() function, it's a list with character vectors, it's not a formula we can pass to the lm() function; this is why we need the generate_model() function, which will take on of these vectors, and concatenate its elements into a single string with the plus (+) sign between them by using the paste() function with the collapse = " + " argument, and it will prepend the Proportion ~ string to it. This gives us back a formula object specified by a string like Proportion ~ Residents + ... + NonWhite, which contains, instead of the dots, all the variables in the first combination shown in the preceding code. This string is then used inside the lm() function to execute our regression, and both model and fit are returned within a list to be used in the following steps:

compute_model_and_fit <- function(combination, data_train) {
    model <- generate_model(combination)
    return(list(model = model, fit = lm(model, data_train)))
}

generate_model <- function(combination) {
    sum <- paste(combination, collapse = " + ")
    return(formula(paste("Proportion", "~", sum)))
}

As can be seen by the score <- compute_score(measure, results[["fit"]], data_test) line, the compute_score() function receives a measure object, a fit object (which comes from the results list), and the data used for testing. It computes the score using the strategy pattern mentioned earlier for the plots used to check the normality assumption. Basically, depending on the value of the measure string (the chosen strategy), it will choose one of the two functions that share the same signature, and that function will be used to compute the final predictions. We send the se.fit = TRUE parameter to the predict() function we had seen before because we want the standard errors to also be sent in case we use the numerical score which requires them. The score_proportions() and score_votes() functions were defined previously in this chapter:

compute_score <- function(measure, fit, data_test) {
    if (measure == "Proportion") {
        score <- score_proportions
    } else {
        score <- score_votes
    }
    predictions <- predict(fit, data_test, se.fit = TRUE)
    return(score(data_test, predictions))
}

Finally, we create a little convenience function called print_best_model_info() that will print results about the best model found. It simply takes the index of the best model, the model formula, its score, and the measure type, and prints all of that for the user. As you can see, since the model object is not a simple string but a formula object, we need to work a little with it to get the results we want by converting it into a string and splitting it using the plus sign (+) we know is included; otherwise, it would be a very long string:

print_best_model_info <- function(i, model, best_score, measure){
    print("*************************************")
    print(paste("Best model number:", i))
    print(paste("Best score:       ", best_score))
    print(paste("Score measure:    ", measure))
    print("Best model:")
    print(strsplit(toString(model), "\\+"))
    print("*************************************")
}

We can find the best model, according to the Proportion measure by calling the following:

best_lm_fit_by_proportions <- find_best_fit(
    measure = "Proportion",
    data_train = data_train,
    data_test = data_test,
    combinations = combinations
)
#> [1] "*************************************"
#> [1] "Best model number: 3"
#> [1] "Best score:        10.2362983528259"
#> [1] "Score measure:     Proportion"
#> [1] "Best model:"
#> [[1]]
#>  [1] "~, Proportion, Residents " " Households "
#>  [3] " White "                   " Owned "
#>  [5] " OwnedOutright "           " SocialRent "
#>  [7] " PrivateRent "             " Students "
#>  [9] " Unemp "                   " UnempRate_EA "
#> [11] " HigherOccup "             " Density "
#> [13] " Deprived "                " MultiDepriv "
#> [15] " Age_18to44 "              " Age_45plus "
#> [17] " LowEducationLevel"
#> [1] "*************************************"

As we can see, the best model was the third one out of the 191 models, and it had a score of 10.23. We can also see the regressors used in the model. As you can see, NonWhite and HighEducationLevel were left out by the optimization method, probably due to their counterparts containing all the information necessary for their respective groups. It's no coincidence that those are among the most representative variables in the data.

To find the best model according to the Vote measure, we use the following code. Note that given the good techniques we used to create this function, all we have to do is change the value of the measure parameter to optimize our search using a different approach:

best_lm_fit_by_votes <- find_best_fit(
    measure = "Vote",
    data_train = data_train,
    data_test = data_test,
    combinations = combinations
)
#> [1] "*************************************"
#> [1] "Best model number: 7"
#> [1] "Best score:        220"
#> [1] "Score measure:     Vote"
#> [1] "Best model:"
#> [[1]]
#>  [1] "~, Proportion, Residents " " Households "
#>  [3] " White "                   " Owned "
#>  [5] " OwnedOutright "           " SocialRent "
#>  [7] " PrivateRent "             " Students "
#>  [9] " Unemp "                   " UnempRate_EA "
#> [11] " HigherOccup "             " Density "
#> [13] " Deprived "                " MultiDepriv "
#> [15] " Age_45plus "              " NonWhite "104
#> [17] " HighEducationLevel"
#> [1] "*************************************"

In this case, the best model was the seventh one out of the 191 models, with 220 out of 241 correct predictions, which gives us an accuracy of 91%, an improvement given the accuracy we had computed earlier in the chapter. In this case, LowEducationLevel and Age_18to44 were left out. Again, no coincidence that these are part of the most important variables in the data.