上QQ阅读APP看书，第一时间看更新

Getting ready

The update of coefficients (W and b) is done proportionally to the negative of the gradient of the loss function. There are three variations of gradient descent depending on the size of the training sample:

Vanilla gradient descent: In vanilla gradient descent (also sometimes called batch gradient descent), the gradient of the loss function is calculated for the entire training set at each epoch. This process can be slow and intractable for very large datasets. It is guaranteed to converge to the global minimum for convex loss function, but for non-convex loss function, it might settle at the local minimum.
Stochastic gradient descent: In stochastic gradient descent, one training sample is presented at a time, the weights and biases are updated such that the gradient of loss function decreases, and then we move to the next training sample. The whole process is repeated for a number of epochs. As it performs one update at a time, it is faster than vanilla, but at the same time, due to frequent updates, there can be a high variance in the loss function.
Mini-batch gradient descent: This combines the best qualities of both the previous ones; here, the parameters are updated for a batch of the training sample.