The optimizer parameter
Our implementation of neural networks used gradient descent. When researchers started creating more complicated multilayer neural network models, they found that they took an extraordinarily long time to train. This is because the basic gradient-descent algorithm with no optimization is not very efficient; it makes small steps towards its goal in each epoch regardless of what occurred in previous epochs. We can compare it with a guessing game: one person has to guess a number in a range and for each guess, they are told to go higher or lower (assuming they do not guess the correct number!). The higher/lower instruction is similar to the derivative value, it indicates the direction we must travel. Now let's say that the range of possible numbers is 1 to 1,000,000 and the first guess is 1,000. The person is told to go higher, which should they do:
- Try 1001.
- Take the difference between the guess and the max value and divide by 2. Add this value to the previous guess.
The second option is much better and should mean the person gets to the right answer in 20 guesses or fewer. If you have a background in computer science, you may recognize this as the binary-search algorithm. The first option, guessing 1,001, 1,002, ...., 1,000,000, is a terrible choice and will probably fail as one party will give up! But this is similar to how gradient descent works. It moves incrementally towards the target. If you try increasing the learning rate to overcome this problem, you can overshoot the target and the model fails to converge.
Researchers came up with some clever optimizations to speed up training. One of the first optimizers was called momentum, and it does exactly what its name states. It looks at the extent of the derivative and takes bigger steps for each epoch if the previous steps were all in the same direction. It should mean that the model trains much quicker. There are other algorithms that are enhancements of these, such as RMS-Prop and Adam. You don't usually need to know how they work, just that, when you change the optimizer, you may also have to adjust other hyper-parameters, such as the learning rate. In general, look for previous examples done by others and copy those hyper-parameters.
We actually used one of these optimizers in an example in the previous chapter. In that chapter, we had 2 models with a similar architecture (40 hidden nodes). The first model (digits.m3) used the nnet library and took 40 minutes to train. The second model (digits.m3) used resilient backpropagation and took 3 minutes to train. This shows the benefit of using an optimizer in neural networks and deep learning.