The optimizer and its hyperparameters
As mentioned before, the job of the optimizer is to update the network weights in a way that is going to minimize the training loss error. In all deep learning libraries such as TensorFlow, there is only really one family of optimizer used and that is the gradient descent family of optimizers.
The most basic of these is simply called gradient descent (sometimes called vanilla gradient descent), but more complex ones that try to improve on it have been developed. Some popular ones are:
- Gradient descent with momentum
- RMSProp
- Adam
All of TensorFlow’s different optimizers can be found in the tf.train class. For example the Adam optimizer can be used by calling tf.train.AdamOptimizer().
As you may suspect, they all have configurable parameters that control how they work, but usually the most important one to pay attention to and change is as follows:
- Learning rate: Control how quickly your optimizer tries to minimize the loss function. Set it too high and you will have problems converging to a minimum. Set it too small and it will take forever to converge or get trapped in a bad local minimum.
The following image shows the problems of having a badly chosen learning rate can have:
Another important aspect of the learning rate is that as your training progresses and the error drops, the learning rate value that you chose at the beginning of the training might become too big, and so you may start to overshoot the minimum.
To solve this issue, you may schedule a learning rate decay that from time to time decrease the learning rate as you train. This process is called learning rate scheduling, and there are several popular approaches that we will discuss in detail in the next chapter.
An alternative solution is to use one of the adaptive optimizers such as Adam or RMSProp. These optimizers have been designed so that they automatically adjust and decay the learning rates for all your model parameters as you train. This means that in theory you shouldn’t have to worry about scheduling your own learning rate decay.
Ultimately you want to choose the optimizer that will train your network fastest and to the best accuracy. The following image shows how the choice of optimizer can affect the speed at which your network converges. There can be quite a gap between different optimizers, and this might change for different problems, so ideally if you can you should try out all of them and find what works best for your problem.
However, if you don’t have time to do this, then the next best approach is to first try Adam as an optimizer as it generally works very well with little tuning. Then, if you have time, try SGD with Momentum; this one will take a bit more tuning of parameters such as learning rate, but it will generally produce very good results when well tuned.