Feature scaling
In order to make the life of the optimizer algorithms easier, there are some techniques that can and should be applied to your data as an initial step before training and testing.
If the values on different dimensions of your input vector are out of scale with each other, your loss space will be somehow stretched. This will make it harder for the gradient descent algorithm to converge or at least make it slower to converge.
This normally happens when the features of your dataset are out of scale. For example, a dataset about houses might have "number of rooms" as one feature in your input vector that could have values between 1 and 4, whereas another feature might be "house area", and this could be between 1000 and 10000. Obviously, these are hugely out of scale of each other and this can make learning difficult.
In the following picture, we see a simple example of what our loss function might look like if our input features are not all in scale with each other and what it might look like when they are properly scaled. Gradient descent has a harder time reaching the minimum of the loss function when data is badly scaled.
Normally, you would do some standardization of the data, such as subtract the mean and divide by the standard deviation of your dataset before using it. In the case of RGB images, it's usually enough to just subtract 128 from each pixel value to center the data around zero. However, a better approach would be to calculate the mean pixel value for each image channel in your dataset. You now have three values, one for each image channel, which you now take away from your input images. We don’t really need to worry about scaling when dealing with images as all features have the same scale (0-255) to start with.