Advanced Machine Learning with R
上QQ阅读APP看书,第一时间看更新

Data understanding and preparation

To start, we will load the necessary packages and put the required ones in the environment. The data is in the MASS package:

> library(magrittr)

> install.packages(caret)

> install.packages(MASS)

> library(MASS)

> install.packages("neuralnet")

> install.packages("vtreat")

The neuralnet package will be used for building the model and caret for data preparation. Let's load the data and examine its structure:

> data(shuttle)

> str(shuttle)

The data consists of 256 observations and 7 features. Notice that all of the features are categorical and the response is use with two levels, auto and noauto, as follows:

  • stability: This is stable positioning or not (stab/xstab)
  • error: This is the size of the error (MM / SS / LX)
  • sign: This is the sign of the error, positive or negative (pp/nn)
  • wind: This is the wind sign (head / tail)
  • magn: This is the wind strength (Light / Medium / Strong / Out of Range)
  • vis: This is the visibility (yes / no)

Here, we will look at a table of the response/outcome:

> table(shuttle$use)
auto noauto
145 111

Almost 57% of the time, the decision is to use the autolander. We'll now get our training and testing data set up for modeling:

> set.seed(1942)

> trainIndex <-
caret::createDataPartition(shuttle$use, p = .6, list = FALSE)

> shuttleTrain <- shuttle[trainIndex, -7]

> shuttleTest <- shuttle[-trainIndex, -7]

We are going to treat the data to create numeric features, and also drop the cat_P features that the function creates. We covered the idea of treating a dataframe in Chapter 1, Preparing and Understanding Data:

> treatShuttle <- vtreat::designTreatmentsZ(shuttleTrain, colnames(shuttleTrain))

> train_treated <- vtreat::prepare(treatShuttle, shuttleTrain)

> train_treated <- train_treated[, c(-1,-2)]

> test_treated <- vtreat::prepare(treatShuttle, shuttleTest)

> test_treated <- test_treated[, c(-1, -2)]

The next couple portions of code I find awkward. Because neuralnet() requires a formula and the data in a dataframe, we have to turn the response into a numeric list and then add it to our treated train and test data:

> shuttle_trainY <- shuttle[trainIndex, 7]

> train_treated$y <- ifelse(shuttle_trainY == "auto", 1, 0)

> shuttle_testY <- shuttle[-trainIndex, 7]

> test_treated$y <- ifelse(shuttle_testY == "auto", 1, 0)

The function in neuralnet will call for the use of a formula as we used elsewhere, such as y~x1+x2+x3+x4, data = df. In the past, we used y~ to specify all the other variables in the data as inputs. However, neuralnet does not accommodate this at the time of writing. The way around this limitation is to use the as.formula() function. After first creating an object of the variable names, we will use this as an input to paste the variables properly on the right-hand side of the equation:

> n <- names(train_treated)

> form <- as.formula(paste("y ~", paste(n[!n %in% "y"], collapse = " + ")))

The object form give us what we need to build our model.