Applying activation functions and the final output
When it comes to RNNs, is a good choice as an activation function. So, after applying , the matrix looks like the following:
We have got the result of . ht acts as for the next time-step. We will now calculate the value of using equation (2). We will require another weight matrix (of shape 4 x 3) that is randomly initialized:
After applying the second equation, the value of becomes a 4 x 1 matrix:
Now, in order to predict what might be the next letter that comes after w (remember, we started all our calculations with the letter w and we still left with the first pass of the RNN) to make a suitable word from the given vocabulary, we will apply the softmax function to . This will output a set of probabilities for each of the letters from the vocabulary:
So, the RNN tells us that the next letter after w is more likely to be an . With this, we finish the initial pass of the RNN. As an exercise, you can play around with the ht value we got from this pass and apply it (along with the next letter h) to the next pass of the RNN to see what happens.
Now, let's get to the most important question—what is the network learning? Again, weights and biases! You might have guessed the next sentence already. These weights are further optimized using backpropagation. Now, this backpropagation is a little bit different from what we have seen earlier. This version of backpropagation is referred to as backpropagation through time. We won't be learning about this. Before finishing off this section, let's summarize the steps (after one-hot encoding of the vocabulary) that were performed during the forward pass of the RNN:
- Initialize the weight matrices randomly.
- Calculate using equation (1).
- Calculate using equation (2).
- Apply the softmax function to to get the probabilities of each of the letters in the vocabulary.
It is good to know that apart from CNNs and RNNs, there are other types of neural networks, such as auto-encoders, generative adversarial networks, capsule networks, and so on. In the previous two sections, we learned about two of the most powerful types of neural network in detail. But when we talk about cutting-edge deep-learning applications, are these networks good enough to be used? Or do we need more enhancements on top of these? It turns out that although these architectures perform well, they fail to scale, hence the need for more sophisticated architectures. We will get to some of these specialized architectures in the next chapters.
We have covered a good amount of theory since Chapter 1, Demystifying Artificial Intelligence and Fundamentals of Machine Learning. In the next few sections, we will be diving into some hands-on examples.