Activation functions
In order to allow the ANN models to be able to tackle more complex problems, we need to add a nonlinear block just after the neuron dot product. If we then cascade these nonlinear layers, it allows the network to compose different concepts together, making complex problems easier to solve.
The use of nonlinear activations in our neurons is very important. If we didn't use nonlinear activation functions, then no matter how many layers we cascaded we would only ever have something that behaves like a linear model. This is because any linear combination of linear functions collapses down to be a linear function.
There are a wide variety of different activation functions that can be used in our neurons, and some are shown here; the only important thing is that the functions are nonlinear. Each activation function has its own benefits and disadvantages.
Historically, it was Sigmoid and TanH that were the activation functions of choice for neural networks. However, these functions turned out to be bad for reliably training neural networks as they have the undesirable property that their values saturate at either end. This causes the gradients to be zero at these points, which we will find out later, and it is not a good thing when training a neural network.
As a result, one of the more popular activation functions is the ReLU activation or Rectified Linear Unit. ReLU is simply a max operation between an input and 0 - max(x,0). It has the desirable property that gradients (at least at one end) will not become zero, which greatly helps the speed of convergence for neural network training.
This activation function gained popularity after it was used to help train deep CNNs. It's simplicity and effectiveness make it generally the go-to activation function to use.