上QQ阅读APP看书，第一时间看更新

Building neural network models

The code is in the Chapter2 folder of the code for this book. If you have not already downloaded and unzipped the code, go back to Chapter 1, Getting Started with Deep Learning, for the link to download the code. Unzip the code into a folder in your machine, and you will see folders for different chapters. The code we will be following is Chapter2\chapter2.R.

We will use the MNIST dataset to build some neural network models. The first few lines in the script look to see whether the data file (train.csv) is in the data directory. If the file already exists in the data directory then it proceeds; if it isn't, it downloads a ZIP file from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/R/data/mnist_csv.zip, and unzips it into the data folder. This check means that you don't have to download the data manually and the program only downloads the file once. Here is the code to download the data:

dataDirectory <- "../data"
if (!file.exists(paste(dataDirectory,'/train.csv',sep="")))
{
  link <- 'https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/R/data/mnist_csv.zip'
  if (!file.exists(paste(dataDirectory,'/mnist_csv.zip',sep="")))
    download.file(link, destfile = paste(dataDirectory,'/mnist_csv.zip',sep=""))
  unzip(paste(dataDirectory,'/mnist_csv.zip',sep=""), exdir = dataDirectory)
  if (file.exists(paste(dataDirectory,'/test.csv',sep="")))
    file.remove(paste(dataDirectory,'/test.csv',sep=""))
}

As an alternative, the MNIST data is also available in Keras, so we can download it from that library and save it as a CSV file:

if (!file.exists(paste(dataDirectory,'/train.csv',sep="")))
{
 library(keras)
 mnist <- dataset_mnist()
 c(c(x_train,y_train),c(x_test,y_test)) %<-% dataset_mnist()
 x_train <- array_reshape(x_train,c(dim(x_train)[1],dim(x_train)[2]*dim(x_train)[3]))
 y_train <- array_reshape(y_train,c(length(y_train),1))
 data_mnist <- as.data.frame(cbind(y_train,x_train))
 colnames(data_mnist)[1] <- "label"
 colnames(data_mnist)[2:ncol(data_mnist)] <- paste("pixel",seq(1,784),sep="")
 write.csv(data_mnist,paste(dataDirectory,'/train.csv',sep=""),row.names = FALSE)
}

When you load any new dataset for the first time, the first thing you should do is a quick check on the data to ensure that the number of rows and columns are as expected, as shown in the following code:

digits <- read.csv("../data/train.csv")
dim(digits)
[1] 42000 785

head(colnames(digits), 4)
[1] "label" "pixel0" "pixel1" "pixel2"

tail(colnames(digits), 4)
[1] "pixel780" "pixel781" "pixel782" "pixel783"

head(digits[, 1:4])
  label pixel0 pixel1 pixel2
1     1      0      0      0
2     0      0      0      0
3     1      0      0      0
4     4      0      0      0
5     0      0      0      0
6     0      0      0      0

The data looks OK, we have 42000 rows and 785 columns. The header was imported correctly and the values are numeric. Now that we have loaded the data and performed some validation checks on it, we can move on to modeling. Our first model will use the neuralnet library as this allows us to visualize the neural net. We will select only the rows where the label is either 5 or 6, and create a binary classification task to differentiate between them. Of course, you can pick any digits you choose, but using 5 and 6 is a good choice because they are similar graphically, and therefore our model will have to work harder than if we picked two digits that were not so similar, for example, 1 and 8. We rename the labels as 0 and 1 for modeling and then separate that data into a train and a test split.

We then perform dimensionality-reduction using principal components analysis (PCA) on the training data—we use PCA because we want to reduce the number of predictor variables in our data to a reasonable number for plotting. PCA requires that we remove columns that have zero variance; these are the columns that have the same value for each instance. In our image data, there is a border around all images, that is, the values are all zero. Note how we find the columns that have zero variance using only the data used to train the model; it would be incorrect to apply this check first and then split the data for modelling.

Dimensionality-reduction: Our image data is grayscale data with values from 0 (black) to 255 (white). These values are highly correlated, that is, if a pixel is black ( that is, 0), it is likely that the pixels around it are either black or dark gray. Similarly if a pixel is white (255), it is likely that the pixels around it are either white or light gray. Dimensionality-reduction is an unsupervised machine learning technique that takes an input dataset and produces an output dataset with the same
number of rows but fewer columns. Crucially though, these fewer columns can explain most of the signal in the input dataset. PCA is one dimensionality-reduction algorithm. We use it here because we want to create a dataset with a small number of columns to plot the network, but we still want our algorithm to produce good results.

The following code selects the rows where the label is either 5 or 6 and creates a train/test split. It also removes columns where the variance is zero; these are columns that have the same value for every row:

digits_nn <- digits[(digits$label==5) | (digits$label==6),]
digits_nn$y <- digits_nn$label
digits_nn$label <- NULL
table(digits_nn$y)
   5    6 
3795 4137 

digits_nn$y <- ifelse(digits_nn$y==5, 0, 1)
table(digits_nn$y)
   0    1 
3795 4137 
 
set.seed(42)
sample <- sample(nrow(digits_nn), nrow(digits_nn)*0.8)
test <- setdiff(seq_len(nrow(digits_nn)), sample)
 
digits.X2 <- digits_nn[,apply(digits_nn[sample,1:(ncol(digits_nn)-1)], 2, var, na.rm=TRUE) != 0]
length(digits.X2)
[1] 624

We have reduced the number of column data from 784 to 624, that is, 160 columns had the same value for all rows. Now, we perform PCA on the data and plot the cumulative sum of the variances:

df.pca <- prcomp(digits.X2[sample,],center = TRUE,scale. = TRUE) 
s<-summary(df.pca)
cumprop<-s$importance[3, ]
plot(cumprop, type = "l",main="Cumulative sum",xlab="PCA component")

The cumulative sum of PCA explained variance shows how many principal components are needed to explain the proportion of variance in the input data. In layman's terms, this plot shows that we can use the first 100 variables (the principal components) and this will account for over 80% of the variance in the original data:

Figure 2.1: Cumulative sum of the Principal Components explained variance.

The next code block selects out the principal components that account for 50% of our variance and use those variables to create a neural network:

num_cols <- min(which(cumprop>0.5))
cumprop[num_cols]
 PC23 
0.50275 

newdat<-data.frame(df.pca$x[,1:num_cols])
newdat$y<-digits_nn[sample,"y"]
col_names <- names(newdat)
f <- as.formula(paste("y ~", paste(col_names[!col_names %in% "y"],collapse="+")))
nn <- neuralnet(f,data=newdat,hidden=c(4,2),linear.output = FALSE)

We can see that 50% of the variance in the original data can be accounted by only 23 principal components. Next, we plot the neural network by calling the plot function:

plot(nn)

This produces a plot similar to the following screenshot. We can see the input variables (PC1 to PC23), the hidden layers and biases, and even the network weights:

Figure 2.2: An example of a neural network with weights and biases

We selected 23 principal components to use as predictors for our neural network library. We chose to use two hidden layers, the first with four nodes and the second with two nodes. The plot outputs the coefficients, which are not all decipherable from the plot, but there are functions to access them if required.

Next, we will create predictions on a holdout or test dataset that was not used to build either the dimensionality-reduction or the neural network model. We have to first pass the test data into the predict function, passing in the df.pca object created earlier, to get the principal components for the test data. We can then pass this data into the neural network prediction (filtering the columns to the first 23 principal components) and then show the confusion matrix and overall accuracy:

test.data <- predict(df.pca, newdata=digits_nn[test,colnames(digits.X2)])
test.data <- as.data.frame(test.data)
preds <- compute(nn,test.data[,1:num_cols])
preds <- ifelse(preds$net.result > 0.5, "1", "0")
t<-table(digits_nn[test,"y"], preds,dnn=c("Actual", "Predicted"))
acc<-round(100.0*sum(diag(t))/sum(t),2)
print(t)
 Predicted
Actual 0   1
 0   740  17
 1    17 813
print(sprintf(" accuracy = %1.2f%%",acc))
[1] " accuracy = 97.86%"

We achieved 97.86% accuracy, which is not bad considering we only used 23 principal components in our neural network. It is important to note that these 23 variables are not directly comparable to any columns in the input dataset or each other. In fact, the whole point of PCA, or any dimensionality-reduction algorithm, is to produce columns that are not correlated with each other.

Next, we will move on to create models that perform multi-classification, that is, they can classify digits 0-9. We will convert the labels (the digits 0 to 9) to a factor so R knows that this is a classification not a regression problem. For real-world problems, you should use all the data available, but if we used all 42,000 rows, it would take a very long time to train using the neural network packages in R. We will select 5,000 rows for training and 1,000 rows for test purposes. We should select the rows at random and ensure that there is no overlap between the rows in our training and test datasets. We also separate the data into the features or predictors (digits.x) and the outcome (digits.Y). We are using all the columns except the labels as the predictors here:

sample <- sample(nrow(digits), 6000)
train <- sample[1:5000]
test <- sample[5001:6000]

digits.X <- digits[train, -1]
digits.y_n <- digits[train, 1]
digits$label <- factor(digits$label, levels = 0:9)
digits.y <- digits[train, 1]

digits.test.X <- digits[test, -1]
digits.test.y <- digits[test, 1]
rm(sample,train,test)

Finally, before we get started building our neural network, let's quickly check the distribution of the digits. This can be important as, for example, if one digit occurs very rarely, we may need to adjust our modeling approach to ensure that, it's given enough weight in the performance evaluation if we care about accurately predicting that specific digit. The following code snippet creates a bar plot showing the frequency of each digit label:

barplot(table(digits.y),main="Distribution of y values (train)")

We can see from the plot that the categories are fairly evenly distributed so there is no need to increase the weight or importance given to any particular one:

Figure 2.3: Distribution of y values for train dataset

Now let's build and train our first neural network using the nnet package through the caret package wrapper. First, we use the set.seed() function and specify a specific seed so that the results are reproducible. The exact seed is not important, what matters is that the same seed is used each time you run the script. The train() function first takes the feature or predictor data (x), and then the outcome variable (y), as arguments. The train() function can work with a variety of models, determined via the method argument. Although many aspects of machine learning models are learned automatically, some parameters have to be set. These vary by the method used; for example, in neural networks, one parameter is the number of hidden units. The train() function provides an easy way to try a variety of these tuning parameters as a named data frame to the tuneGrid argument. It returns the performance measures for each set of tuning parameters and returns the best trained model. We will start with just five hidden neurons in our model, and a modest decay rate. The learning rate controls how much each iteration or step can influence the current weights. The decay rate is the regularization hyper-parameter, which is used to prevent the model from overfitting. Another argument, trcontrol, controls additional aspects of train(), and is used, when a variety of tuning parameters are being evaluated, to tell the caret package how to validate and pick the best tuning parameter.

For this example, we will set the method for training control to none as we only have one set of tuning parameters being used here. Finally, at the end, we can specify additional, named arguments that are passed on to the actual nnet() function (or whatever algorithm is specified). Because of the number of predictors (784), we increase the maximum number of weights to 10,000 and specify a maximum of 100 iterations. Due to the relatively small amount of data, and the paucity of hidden neurons, this first model does not take too long to run:

set.seed(42) 
tic <- proc.time()
digits.m1 <- caret::train(digits.X, digits.y,
           method = "nnet",
           tuneGrid = expand.grid(
             .size = c(5),
             .decay = 0.1),
           trControl = trainControl(method = "none"),
           MaxNWts = 10000,
           maxit = 100)
print(proc.time() - tic)
   user system elapsed 
  54.47 0.06 54.85

The predict() function generates a set of predictions for the data. We will use the test dataset to evaluate the model; this contains records that were not used to train the model. We examine the distribution of the predicted digits in the following diagram.

digits.yhat1 <- predict(digits.m1,newdata=digits.test.X)
accuracy <- 100.0*sum(digits.yhat1==digits.test.y)/length(digits.test.y)
print(sprintf(" accuracy = %1.2f%%",accuracy))
[1] " accuracy = 54.80%"
barplot(table(digits.yhat1),main="Distribution of y values (model 1)")

It is clear that this is not a good model because the distribution of the predicted values is very different from the distribution of the actual values:

Figure 2.4: Distribution of y values from prediction model

The barplot is a simple check of the predictions, and shows us that our model is not very accurate. We can also calculate the accuracy by finding the percentage of rows from the predictions that match the actual value. The accuracy for this model is 54.8%, which is not good. A more formal evaluation of the model's performance is possible using the confusionMatrix() function in the caret package. Because there is a function by the same name in the RSNNS package, they are masked so we call the function using caret::confusionMatrix to ensure the function from the caret package is used. The following code shows the confusion matrix and performance metrics on the test set:

caret::confusionMatrix(xtabs(~digits.yhat1 + digits.test.y))
Confusion Matrix and Statistics

            digits.test.y
digits.yhat1    0   1   2   3   4   5   6   7   8   9
           0   61   1   0   1   0   2   0   0   0   1
           1    1 104   0   2   0   4   3   9  12   8
           2    6   2  91  56   4  20  68   1  41   1
           3    0   0   0   0   0   0   0   0   0   0
           4    2   0   4   1  67   1  22   4   2  21
           5   39   0   6  45   4  46   0   5  30  16
           6    0   0   0   0   0   0   0   0   0   0
           7    0   0   0   6   9   0   0  91   2  75
           8    0   0   0   0   0   0   0   0   0   0
           9    0   0   0   0   0   0   0   3   0   0

Overall Statistics
                                          
               Accuracy : 0.46 
                 95% CI : (0.4288, 0.4915)
    No Information Rate : 0.122 
    P-Value [Acc > NIR] : < 2.2e-16 
                                          
                  Kappa : 0.4019 
 Mcnemar's Test P-Value : NA 

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity            0.5596   0.9720   0.9010    0.000   0.7976   0.6301    0.000
Specificity            0.9944   0.9563   0.7786    1.000   0.9378   0.8436    1.000
Pos Pred Value         0.9242   0.7273   0.3138      NaN   0.5403   0.2408      NaN
Neg Pred Value         0.9486   0.9965   0.9859    0.889   0.9806   0.9666    0.907
Prevalence             0.1090   0.1070   0.1010    0.111   0.0840   0.0730    0.093
Detection Rate         0.0610   0.1040   0.0910    0.000   0.0670   0.0460    0.000
Detection Prevalence   0.0660   0.1430   0.2900    0.000   0.1240   0.1910    0.000
Balanced Accuracy      0.7770   0.9641   0.8398    0.500   0.8677   0.7369    0.500
                     Class: 7 Class: 8 Class: 9
Sensitivity            0.8053    0.000   0.0000
Specificity            0.8963    1.000   0.9966
Pos Pred Value         0.4973      NaN   0.0000
Neg Pred Value         0.9731    0.913   0.8776
Prevalence             0.1130    0.087   0.1220
Detection Rate         0.0910    0.000   0.0000
Detection Prevalence   0.1830    0.000   0.0030
Balanced Accuracy      0.8508    0.500   0.4983

Because we had multiple digits, there are three main sections to the performance output. First, the actual frequency cross tab is shown. Correct predictions are on the diagonal, with various frequencies of misclassification on the off diagonals. Next are the overall statistics, which refer to the model's performance across all classes. Accuracy is simply the proportion of cases correctly classified, along with a 95% confidence interval, which can be useful, especially for smaller datasets where there may be considerable uncertainty in the estimate.

No Information Rate refers to what accuracy could be expected without any information by merely guessing the most frequent class, in this case, 1, which occurred 11.16% of the time. The p-value tests whether the observed accuracy (44.3%) is significantly different from No Information Rate (11.2% ). Although statistically significant, this is not very meaningful for digit-classification, where we would expect to do far better than simply guessing the most frequent digit! Finally, individual performance metrics for each digit are shown. These are based on calculating that digit versus every other digit, so that each is a binary comparison.

Now that we have some basic understanding of how to set up, train, and evaluate model performance, we will try increasing the number of hidden neurons, which is one key way to improve model performance, at the cost of greatly increasing the model complexity. Recall from Chapter 1, Getting Started with Deep Learning, that every predictor or feature connects to each hidden neuron, and each hidden neuron connects to each outcome or output. With 784 features, each additional hidden neuron adds a substantial number of parameters, which also results in longer run times. Depending on your computer, be prepared to wait a number of minutes for these next model to finish:

set.seed(42) 
tic <- proc.time()
digits.m2 <- caret::train(digits.X, digits.y,
           method = "nnet",
           tuneGrid = expand.grid(
             .size = c(10),
             .decay = 0.1),
           trControl = trainControl(method = "none"),
            MaxNWts = 50000,
            maxit = 100)
print(proc.time() - tic)
   user system elapsed 
 154.49 0.09 155.33 

digits.yhat2 <- predict(digits.m2,newdata=digits.test.X)
accuracy <- 100.0*sum(digits.yhat2==digits.test.y)/length(digits.test.y)
print(sprintf(" accuracy = %1.2f%%",accuracy))
[1] " accuracy = 66.30%"
barplot(table(digits.yhat2),main="Distribution of y values (model 2)")

This model is better than the previous model but the distribution of the predicted values is still uneven:

Figure 2.5: Distribution of y values from prediction model

Increasing the number of hidden neurons from 5 to 10 improved our in-sample performance from an overall accuracy of 54.8% to 66.3%, but this is still quite some way from ideal (imagine character-recognition software that mixed up over 30% of all the characters!). We increase again, this time to 40 hidden neurons, and wait even longer for the model to finish training:

set.seed(42) 
tic <- proc.time()
digits.m3 <- caret::train(digits.X, digits.y,
           method = "nnet",
           tuneGrid = expand.grid(
             .size = c(40),
             .decay = 0.1),
           trControl = trainControl(method = "none"),
           MaxNWts = 50000,
           maxit = 100)
print(proc.time() - tic)
   user system elapsed 
2450.16 0.96 2457.55

digits.yhat3 <- predict(digits.m3,newdata=digits.test.X)
accuracy <- 100.0*sum(digits.yhat3==digits.test.y)/length(digits.test.y)
print(sprintf(" accuracy = %1.2f%%",accuracy))
[1] " accuracy = 82.20%"
barplot(table(digits.yhat3),main="Distribution of y values (model 3)")

The distribution of the predicted values is even in this model, which is what we are looking for. However the accuracy is still only at 82.2%, which is quite low:

Figure 2.6: Distribution of y values from prediction model

Using 40 hidden neurons has improved accuracy to 82.2% overall and it took over 40 minutes to run on an i5 computer. Model performance for some digits is still not great. If this were a real research or business problem, we might continue trying additional neurons, tuning the decay rate, or modifying features in order to try to boost model performance further, but for now we will move on.

Next, we will take a look at how to train neural networks using the RSNNS package. This package provides an interface to a variety of possible models using the Stuttgart Neural Network Simulator (SNNS) code; however, for a basic, single hidden-layer, feed-forward neural network, we can use the mlp() convenience wrapper function, which stands for multi-layer perceptron. The RSNNS package is a bit trickier to use than the convenience of nnet via the caret package, but one benefit is that it can be far more flexible and allows for many other types of neural network architecture to be trained, including recurrent neural networks, and also has a greater variety of training strategies.

One difference between the nnet and RSNNS packages is that, for multi-class outcomes (such as digits), RSNNS requires a dummy encoding (that is, one-hot encoding), where each possible class is represented as a column coded as 0/1. This is facilitated using the decodeClassLabels() function, as shown in the following code snippet:

head(decodeClassLabels(digits.y))
     0 1 2 3 4 5 6 7 8 9
[1,] 0 0 0 0 0 0 0 0 0 1
[2,] 0 0 0 0 1 0 0 0 0 0
[3,] 1 0 0 0 0 0 0 0 0 0
[4,] 0 0 0 0 0 1 0 0 0 0
[5,] 0 0 0 0 1 0 0 0 0 0
[6,] 0 0 0 1 0 0 0 0 0 0

Since we had reasonably good success with 40 hidden neurons, we will use the same size here. Rather than standard propagation as the learning function, we will use resilient propagation, based on the work of Riedmiller, M., and Braun, H. (1993). Resilient back-propagation is an optimization to standard back-propagation that applies faster weight-update mechanisms. One of the problems that occurs as the neural network increases in complexity is that they take a long time to train. We will discuss this in depth in subsequent chapters, but for now, you just need to know that this neural network is faster because it keeps track of past derivatives and takes bigger steps if they were in the same direction during back-propagation. Note also that, because a matrix of outcomes is passed, although the predicted probability will not exceed 1 for any single digit, the sum of predicted probabilities across all digits may exceed 1 and also may be less than 1 (that is, for some cases, the model may not predict they are very likely to represent any of the digits). The predict function returns a matrix where each column represents a single digit, so we use the encodeClassLabels() function to convert back into a single vector of digit labels to plot and evaluate the model's performance:

set.seed(42) 
tic <- proc.time()
digits.m4 <- mlp(as.matrix(digits.X),
             decodeClassLabels(digits.y),
             size = 40,
             learnFunc = "Rprop",
             shufflePatterns = FALSE,
             maxit = 80)
print(proc.time() - tic)
   user system elapsed 
 179.71 0.08 180.99 

digits.yhat4 <- predict(digits.m4,newdata=digits.test.X)
digits.yhat4 <- encodeClassLabels(digits.yhat4)
accuracy <- 100.0*sum(I(digits.yhat4 - 1)==digits.test.y)/length(digits.test.y)
print(sprintf(" accuracy = %1.2f%%",accuracy))
[1] " accuracy = 81.70%"
barplot(table(digits.yhat4),main="Distribution of y values (model 4)")

The following bar plot shows that the predicted values are relatively evenly distributed among the categories. This matches the distribution of the actual category values:

Figure 2.7: Distribution of y values from prediction model

The accuracy is 81.70% and it ran in 3 minutes on my computer. This is only slightly lower than when we used nnet with 40 hidden nodes, which took 40 minutes on the same machine! This demonstrates the importance of using an optimizer, which we will see in subsequent chapters.