上QQ阅读APP看书，第一时间看更新

Neural network code

While the web application is useful to see the output of the neural network, we can also run the code for the neural network to really see how it works. The code in Chapter3/nnet.R allows us to do just that. This code has the same hyper-parameters as in the web application; this file allows you to run the neural network from the RStudio IDE. The following is the code that loads the data and sets the initial hyper-parameters for the neural network:

source("nnet_functions.R")
data_sel <- "bulls_eye"

........

####################### neural network ######################
hidden <- 3
epochs <- 3000
lr <- 0.5
activation_ftn <- "sigmoid"

df <- getData(data_sel) # from nnet_functions
X <- as.matrix(df[,1:2])
Y <- as.matrix(df$Y)
n_x=ncol(X)
n_h=hidden
n_y=1
m <- nrow(X)

This code should not be too difficult to understand, it loads a dataset and sets some variables. The data is created in the getData function from the Chapter3/nnet_functions.R file. The data is created from functions in the clustersim package. The Chapter3/nnet_functions.R file contains the core functionality of our neural network that we will look at here. Once we load our data, the next step is to initialize our weights and biases. The hidden variable controls the number of nodes in the hidden layer; we set it to 3. We need two sets of weights and biases, one for the hidden layer and one for the output layer:

# initialise weights
set.seed(42)
weights1 <- matrix(0.01*runif(n_h*n_x)-0.005, ncol=n_x, nrow=n_h)
weights2 <- matrix(0.01*runif(n_y*n_h)-0.005, ncol=n_h, nrow=n_y)
bias1 <- matrix(rep(0,n_h),nrow=n_h,ncol=1)
bias2 <- matrix(rep(0,n_y),nrow=n_y,ncol=1)

This creates matrices for the (weights1, bias1) hidden layer and the (weights2, bias2) output layer. We need to ensure our matrices have the correct dimensions. For example, the weights1 matrix should have the same number of columns as the input layer and the same number of rows as the hidden layer. Now we move on to the actual processing loop of the neural network:

for (i in 0:epochs)
{
  activation2 <- forward_prop(t(X),activation_ftn,weights1,bias1, weights2,bias2)
  cost <- cost_f(activation2,t(Y))
  backward_prop(t(X),t(Y),activation_ftn,weights1,weights2, activation1,activation2)
  weights1 <- weights1 - (lr * dweights1)
  bias1 <- bias1 - (lr * dbias1)
  weights2 <- weights2 - (lr * dweights2)
  bias2 <- bias2 - (lr * dbias2)
 
  if ((i %% 500) == 0)
    print (paste(" Cost after",i,"epochs =",cost))
}
[1] " Cost after 0 epochs = 0.693147158995952"
[1] " Cost after 500 epochs = 0.69314587328381"
[1] " Cost after 1000 epochs = 0.693116915341439"
[1] " Cost after 1500 epochs = 0.692486724429629"
[1] " Cost after 2000 epochs = 0.687107068792801"
[1] " Cost after 2500 epochs = 0.660418522655335"
[1] " Cost after 3000 epochs = 0.579832913091798"

We first run the forward-propagation function, then calculate a cost. We then call a backward-propagation step that calculates our derivatives, (dweights1, dbias1, dweights2, dbias2). Then we update the weights and biases, (weights1, bias1, weights2, bias2), using our Learning rate, (lr). We run this loop for the number of epochs (3000) and print out a diagnostic message every 500 epochs. This describes how every neural network and deep learning model works: first call forward-propagation, then calculate costs and derivative values, use those to update the weights through back-propagation and repeat.

Now let's look at some of the functions in the nnet_functions.R file. The following is the forward propagation function:

forward_prop <- function(X,activation_ftn,weights1,bias1,weights2,bias2)
{
  # broadcast hack
  bias1a<-bias1
  for (i in 2:ncol(X))
    bias1a<-cbind(bias1a,bias1)
  bias2a<-bias2
  for (i in 2:ncol(activation1))
    bias2a<-cbind(bias2a,bias2)
  
  Z1 <<- weights1 %*% X + bias1a
  activation1 <<- activation_function(activation_ftn,Z1)
  bias2a<-bias2
  for (i in 2:ncol(activation1))
    bias2a<-cbind(bias2a,bias2)
  Z2 <<- weights2 %*% activation1 + bias2a
  activation2 <<- sigmoid(Z2)
  return (activation2)
}

If you looked at the code carefully, you may have noticed that the assignment to the activation1, activation2, Z1, and Z2 variables uses <<- rather than <-. This makes those variables global in scope; we also want to use these values during back propagation. Using global variables is generally frowned upon and I could have returned a list, but it is acceptable here to use them because this application is for learning purposes.

The two for loops expand the bias vectors into matrices, then repeat the vector n times. The interesting code starts with the Z1 assignment. Z1 is a matrix multiplication, followed by an addition. We call the activation_function function on that value. We then use that output value and perform a similar operation for Z2. Finally, we apply a sigmoid activation to our output layer because our problem is binary classification.

The following is the code for the activation function; the first parameter decides which function to use (sigmoid, tanh, or relu). The second parameter is the value to be used as input:

activation_function <- function(activation_ftn,v)
{
  if (activation_ftn == "sigmoid")
    res <- sigmoid(v)
  else if (activation_ftn == "tanh")
    res <- tanh(v)
  else if (activation_ftn == "relu")
  {
    v[v<0] <- 0
    res <- v
  }
  else
    res <- sigmoid(v)
  return (res)
}

The following is the cost function:

cost_f <- function(activation2,Y)
{
  cost = -mean((log(activation2) * Y)+ (log(1-activation2) * (1-Y)))
  return(cost)
}

As a reminder, the output of the cost function is what we are trying to minimize. There are many types of cost functions; in this application we are using binary cross-entropy. The formula for binary cross-entropy is -1/m ∑ log(ȳ_i) * y_i + (log(1 -ȳ_i) * (1-y_i). Our target values (y_i) are always either 1 or 0, so for instances where y_i = 1, this reduces to ∑log(ȳ_i). If we have two rows where y_i = 1 and suppose that our model predicts 1.0 for the first row and the 0.0001 for the second row, then the costs for the rows are log(1)=0 and log(0.0001)=-9.1, respectively. We can see that the closer to 1 the prediction is for these rows, the lower the cost value. Similarly, for rows where y_i = 0, this reduces to log(1-ȳ_i), so the closer to 0 the prediction is for these rows, the lower the cost value.

If we are trying to maximize accuracy, why don't we just use what during model training? Binary cross-entropy is a better cost function because our model does not just output 0 or 1, but instead outputs continuous values from 0.0 to 1.0. For example, if two input rows had a target value=1 (that is, y=1), and our model gave probabilities of 0.51 and 0.99, then binary cross-entropy would give them a cost of 0.67 and 0.01, respectively. It assigns a higher cost to the first row because the model is unsure about it (the probability is close to 0.5). If instead we just looked at accuracy, we might decide that both rows have the same cost value because they are classified correctly (assuming we assign class=0 where predicted values < 0.5, and class=1 where predicted values >= 0.5).

The following is the code for the backward-propagation function:

backward_prop <- function(X,Y,activation_ftn,weights1,weights2,activation1,activation2)
{
  m <- ncol(Y)
  derivative2 <- activation2-Y
  dweights2 <<- (derivative2 %*% t(activation1)) / m
  dbias2 <<- rowSums(derivative2) / m
  upd <- derivative_function(activation_ftn,activation1)
  derivative1 <- t(weights2) %*% derivative2 * upd
  dweights1 <<- (derivative1 %*% t(X)) / m
  dbias1 <<- rowSums(derivative1) / m
}

Backward propagation processes the network in reverse, starting at the last hidden layer and finishing at the first hidden layer, that is, in the direction of the output layer to the input layer. In our case, we only have one hidden layer, so it first calculates the loss from the output layer and calculates dweight2 and dbias2. It then calculates the derivative of the activation1 value, which was calculated during the forward-propagation step. The derivative function is similar to the activation function, but instead of calling an activation function, it calculates the derivative of that function. For example, the derivative of sigmoid(x) is sigmoid(x) * (1 - sigmoid(x)). The derivative values of simple functions can be found in any calculus reference or online:

derivative_function <- function(activation_ftn,v)
{
  if (activation_ftn == "sigmoid")
   upd <- (v * (1 - v))
  else if (activation_ftn == "tanh")
   upd <- (1 - (v^2))
  else if (activation_ftn == "relu")
   upd <- ifelse(v > 0.0,1,0)
  else
   upd <- (v * (1 - v))
  return (upd)
}

That's it! A working neural network using basic R code. It can fit complex functions and performs better than logistic regression. You might not get all the parts at once, that's OK. The following is a quick recap of the steps:

Run a forward-propagation step, which involves multiplying the weights by the input for each layer and passing the output to the next layer.
Evaluate the output from the final layer using the cost function.
Based on the error rate, use backpropagation to make small adjustments to the weights in the nodes in each layer. The learning rate controls how much of an adjustment we make each time.
Repeat steps 1-3, maybe thousands of times, until the cost function begins to plateau, which indicates our model is trained.