Applied Deep Learning and Computer Vision for Self/Driving Cars
上QQ阅读APP看书,第一时间看更新

Optimizers

Optimizers define how a neural network learns. They define the value of parameters during the training such that the loss function is at its lowest.

Gradient descent is an optimization algorithm for finding the minima of a function or the minimum value of a cost function. This is useful to us as we want to minimize the cost function. So, to find the local minimum, we take steps proportional to the negative of the gradient.

Let's go through a very simple example in one dimension, shown in the following plot:

Fig 2.17: Gradient descent

On the axis, we have the cost (the result of the cost function), and on the axis, we have the particular weight we are trying to choose (we chose the random weight). The weight minimizes the cost function and we can see that, basically, the parameter value is at the bottom of the parabola. We have to minimize the value of the cost function to the minimum value. Finding the minimum is really simple for one dimension, but in our case, we have a lot more parameters, and we can't do this visually. We are going to use linear algebra and a deep learning library, where we can get the best parameters for minimizing the cost function.

Now, let's see how we can quickly adjust the optimal parameters or weights across our entire network. This is where we need backpropagation.

Backpropagation is used to calculate the error contribution from each neuron after a batch of data is processed. It relies heavily on the chain rule to go back through the network and calculate these errors. Backpropagation works by calculating the error at the output and then updates the weight back through the network layers. It requires a known desired output for each input value.

One of the problems with gradient descent is that the weight is only updated after seeing the entire dataset, so the gradient below is typically large and reaching the loss at the minima is really difficult. One of the solutions to this is updating the parameter more frequently, as in the case of another optimizer called stochastic gradient descent. S tochastic gradient descent updates the weight after seeing each data point instead of the whole dataset. It may have noise, however, as it is influenced by every single sample. Due to this, we use mini-batch gradient descent, which updates the parameters after only a few samples. You can read more about optimizers in the An Overview of Gradient Descent Optimization Algorithms paper   ( https://arxiv.org/pdf/1609.04747.pdf ). Another way of decreasing the noise of stochastic gradient descent is to use Adam optimizers. Adam is one of the more popular optimizers; it is an adaptive learning rate method and computes individual learning rates for different parameters. You can check out this paper on Adam optimizers: Adam: A Method for Stochastic Optimization ( https://arxiv.org/abs/1412.6980 ).

In the next section, we will learn about hyperparameters, which help tweak neural networks so that they can learn features more effectively.