A note on the input and output layers of a neural network
It is important to understand what can be given as inputs to a neural network. Do we feed raw images or raw text data to a neural network? Or are there other ways to provide input to a neural network? In this section, we will learn how a computer really interprets an image to show what exactly can be given as input to a neural network when it is dealing with images (yes, neural networks are pretty great at image processing). We will also learn the ways to show what it takes to feed a neural network with raw text data. But before that, we need to have a clear understanding of how a regular tabular dataset is given as an input to a neural network. Because tabular datasets are everywhere, in the form of SQL tables, server logs, and so on.
We will take the following toy dataset for this purpose:
Take note of the following points regarding this toy dataset:
- It has two predictor variables, x1 and x2, and these predictors are generally called input feature vectors.
- It is common to assign x1 and x2 to a vector, X (more on this later).
- The response variable is y.
- We have 10 instances (containing x1, x2, and y attributes) that are categorized into two classes, 0 and 1.
- Given x1 and x2, our (neural network's) task is to predict y, which essentially makes this a classification task.
When we say that the neural network predicts something, we mean that it is supposed to learn the underlying representations of the input data that best approximate a certain function (we saw what function plotting look like a while ago).
Let's now see how this data is given as inputs to a neural network. As our data has two predictor variables (or two input vectors), the input layer of the neural network has to contain two neurons. We will use the following neural network architecture for this classification task:
The architecture is quite identical to the one that we saw a while ago, but in this case, we have an added input feature vector. The rest is exactly the same.
To keep it simple, we are not considering the data preprocessing that might be needed before we feed the data to the network. Now, let's see how the data is combined with the weights and the bias term, and how the activation function is applied to them.
In this case, the feature vectors and the response variable (which is y) are interpreted separately by the neural network the response variable is used in the later stage in the network's training process. Most importantly, it is used for evaluating how the neural network is performing. The input data is organized as a matrix form, like the following:
The kind of NN architecture that we are using now is a fully connected architecture, which means that all of the neurons in a particular layer are connected to all the other neurons in the next layer.
The weight matrix is defined as follows:
For now, let's not bother about the weight values. The dimensions of the weight matrix is interpreted as the following:
- The number of rows equals the number of feature vectors (x1 and x2, in our case).
- The number of columns equals the number of neurons in the first hidden layer.
There are some suffixes and superscripts associated with each of the weight values in the matrix. If we take the general form of the weight as , then it should be interpreted as follows:
- l denotes the layer from which the weight is coming. In this case, the weight matrix that we just saw is going to be associated with the input layer.
- j denotes the position of the neuron in , whereas k denotes the position of the neuron in the next layer that the value is propagated to.
The weights are generally randomly initialized, which adds a stochastic character to the neural network. Let's randomly initialize a weight matrix for the input layer:
Now we calculate the values that are to be given to the first hidden layer of the NN. This is computed as follows:
The first matrix contains all the instances from the training set (without the response variable y) and the second matrix is the weight matrix that we just defined. The result of this multiplication is stored in a variable, (this variable can be named anything, and the superscript denotes that it is related to the first hidden layer of the network).
We are still left with one more step before we send these results to the neurons in the next layer, where the activation functions will be applied. The sigmoid activation function and the final output from the input layer would look like the following:
Here, a(1) is our final output for the next layer of neurons. Note that the sigmoid function is applied to each and every element of the matrix. The final matrix will have a dimension of 10 X 3, where each row is for each instance from the training set and each column is for each neuron of the first hidden layer.
The whole calculation that we saw is without the bias term, b, that we initially talked about. Well, that is just a matter the of addition of another dimension to the picture. In that case, before we apply the sigmoid function to each of the elements of the matrix, the matrix itself would be changed to something like this:
After this matrix multiplication process, the sigmoid function is applied and the output is sent to the neurons in the next layers, and this whole process repeats for each hidden layer and output layer that we have in the NN. As we proceed, we are supposed to get from the output layer.
The sigmoid activation function outputs values ranging from 0–1, but we are dealing with a binary classification problem, and we only want 0 or 1 as the final output from the NN. We can do this with a little tweak. We can define a threshold at the output layer of the NN—for the values that are less than 0.5 they should be identified as class 0 and the values that are greater than or equal to 0.5 should be identified as class 1. Note that this is called forward pass or forward propagation.