
Training and testing datasets
For us to be able to measure the predictive accuracy of our models, we need to use some observations to validate our results. This means that our data will be split into three different groups:
- Training data
- Testing data
- Predicting data
The predicting data is the data that we don't have complete cases for, specifically these are wards for which the Vote and Proportion variables have NA values. Our final objective is to provide predictions for these ward's Proportion and Vote variables using what we can learn from other wards for which we do have data for these variables, and it's something we'll do toward the end of the chapter.
The data that has complete cases will be split into two parts, training, and testing data. Training data is used to extract knowledge and learn the relationship among variables. Testing is treated as if it had NA values for Proportion and Vote, and we produce predictions for them. These predictions are then compared to the real values in the corresponding observations, and this helps us understand how good our predictions are in a way that is objective since those observations are never seen by the trained models.
We created the predicting data in the previous section, and we called it data_incomplete. To create the training and testing data, we use the sample() function. It will take as input a list of numbers from which it will pick a certain number of values (size). The list of numbers will go from 1 to the total number of observations available in the data with complete cases. We specify the number of observations that will be picked for the training data as around 70% of the total number of observations available, and use the replace = FALSE argument to specify that the picked observations may not be duplicated (by avoiding a sample with replacement).
The testing data is composed of the remaining 30% of the observations. Since sample is a Boolean vector that contains a TRUE or FALSE value for each observation to specify whether or not it should be included, respectively, we can negate the vector to pick the other part of the data by prepending a minus sign (-) to the binary vector, effectively making every TRUE value a FALSE value, and vice versa. To understand this, let's look at the following code:
set.seed(12345) n <- nrow(data) sample <- sample(1:n, size = round(0.7 * n), replace = FALSE) data_train <- data[ sample, ] data_test <- data[-sample, ]
If we did this process various times, we would find that every time we get different samples for the training and testing sets, and this may confuse us about our results. This is because the sample() function is stochastic, meaning that it will use pseudo random number generator to make the selection for us (computers can not generate real randomness, they simulate numbers that appear to be random even though they are not, that's why it's called pseudo random). If we want our process to be reproducible, meaning that, every time we run it the exact same samples are selected, then we must specify an initial seed before applying this process to precondition the pseudo random number generator. To do so, we need to pass an integer to the set.seed() function, as we do at the beginning of the code snippet. The seed argument must stay fixed to reproduce the same samples, and with it in place, every time we generate a random sample, we will get the same sample so that our results are reproducible.