Implementing Loss Functions
Loss
functions are very important to machine learning algorithms. They measure the distance between the model outputs and the target (truth) values. In this recipe, we show various loss
function implementations in TensorFlow.
Getting ready
In order to optimize our machine learning algorithms, we will need to evaluate the outcomes. Evaluating outcomes in TensorFlow depends on specifying a loss
function. A loss
function tells TensorFlow how good or bad the predictions are compared to the desired result. In most cases, we will have a set of data and a target on which to train our algorithm. The loss
function compares the target to the prediction and gives a numerical distance between the two.
For this recipe, we will cover the main loss
functions that we can implement in TensorFlow.
To see how the different loss
functions operate, we will plot them in this recipe. We will first start a computational graph and load matplotlib
, a python plotting library, as follows:
import matplotlib.pyplot as plt import tensorflow as tf
How to do it…
First we will talk about loss
functions for regression, that is, predicting a continuous dependent variable. To start, we will create a sequence of our predictions and a target as a tensor. We will output the results across 500 x-values
between -1
and 1
. See the next section for a plot of the outputs. Use the following code:
x_vals = tf.linspace(-1., 1., 500) target = tf.constant(0.)
- The L2 norm
loss
is also known as the Euclideanloss
function. It is just the square of the distance to the target. Here we will compute theloss
function as if the target is zero. The L2 norm is a greatloss
function because it is very curved near the target and algorithms can use this fact to converge to the target more slowly, the closer it gets., as follows:l2_y_vals = tf.square(target - x_vals) l2_y_out = sess.run(l2_y_vals)
- The L1 norm
loss
is also known as the absoluteloss
function. Instead of squaring the difference, we take the absolute value. The L1 norm is better for outliers than the L2 norm because it is not as steep for larger values. One issue to be aware of is that the L1 norm is not smooth at the target and this can result in algorithms not converging well. It appears as follows:l1_y_vals = tf.abs(target - x_vals) l1_y_out = sess.run(l1_y_vals)
- Pseudo-Huber loss is a continuous and smooth approximation to the Huber
loss
function. Thisloss
function attempts to take the best of the L1 and L2 norms by being convex near the target and less steep for extreme values. The form depends on an extra parameter, delta, which dictates how steep it will be. We will plot two forms,delta1 = 0.25
anddelta2 = 5
to show the difference, as follows:delta1 = tf.constant(0.25) phuber1_y_vals = tf.mul(tf.square(delta1), tf.sqrt(1. + tf.square((target - x_vals)/delta1)) - 1.) phuber1_y_out = sess.run(phuber1_y_vals) delta2 = tf.constant(5.) phuber2_y_vals = tf.mul(tf.square(delta2), tf.sqrt(1. + tf.square((target - x_vals)/delta2)) - 1.) phuber2_y_out = sess.run(phuber2_y_vals)
- Classification
loss
functions are used to evaluate loss when predicting categorical outcomes. - We will need to redefine our predictions (
x_vals
) and target. We will save the outputs and plot them in the next section. Use the following:x_vals = tf.linspace(-3., 5., 500) target = tf.constant(1.) targets = tf.fill([500,], 1.)
- Hinge loss is mostly used for support vector machines, but can be used in neural networks as well. It is meant to compute a loss between with two target classes,
1
and-1
. In the following code, we are using the target value1
, so the as closer our predictions as near are to1
, the lower the loss value:hinge_y_vals = tf.maximum(0., 1. - tf.mul(target, x_vals)) hinge_y_out = sess.run(hinge_y_vals)
- Cross-entropy loss for a binary case is also sometimes referred to as the logistic
loss
function. It comes about when we are predicting the two classes0
or1
. We wish to measure a distance from the actual class (0
or1
) to the predicted value, which is usually a real number between0
and1
. To measure this distance, we can use the cross entropy formula from information theory, as follows:xentropy_y_vals = - tf.mul(target, tf.log(x_vals)) - tf.mul((1. - target), tf.log(1. - x_vals)) xentropy_y_out = sess.run(xentropy_y_vals)
Sigmoid cross entropy loss
is very similar to the previousloss
function except we transform thex-values
by thesigmoid
function before we put them in the cross entropy loss, as follows:xentropy_sigmoid_y_vals = tf.nn.sigmoid_cross_entropy_with_logits(x_vals, targets) xentropy_sigmoid_y_out = sess.run(xentropy_sigmoid_y_vals)
- Weighted cross entropy loss is a weighted version of the
sigmoid cross entropy loss
. We provide a weight on the positive target. For an example, we will weight the positive target by 0.5, as follows:weight = tf.constant(0.5) xentropy_weighted_y_vals = tf.nn.weighted_cross_entropy_with_logits(x_vals, targets, weight) xentropy_weighted_y_out = sess.run(xentropy_weighted_y_vals)
Softmax cross-entropy
loss operates on non-normalized outputs. This function is used to measure a loss when there is only one target category instead of multiple. Because of this, the function transforms the outputs into a probability distribution via thesoftmax
function and then computes theloss
function from a true probability distribution, as follows:unscaled_logits = tf.constant([[1., -3., 10.]]) target_dist = tf.constant([[0.1, 0.02, 0.88]]) softmax_xentropy = tf.nn.softmax_cross_entropy_with_logits(unscaled_logits, target_dist) print(sess.run(softmax_xentropy)) [ 1.16012561]
- Sparse
softmax cross-entropy
loss is the same as previously, except instead of the target being a probability distribution, it is an index of which category is true. Instead of a sparse all-zero target vector with one value of one, we just pass in the index of which category is the true value, as follows:unscaled_logits = tf.constant([[1., -3., 10.]]) sparse_target_dist = tf.constant([2]) sparse_xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(unscaled_logits, sparse_target_dist) print(sess.run(sparse_xentropy)) [ 0.00012564]
How it works…
Here is how to use matplotlib
to plot the regression loss
functions:
x_array = sess.run(x_vals) plt.plot(x_array, l2_y_out, 'b-', label='L2 Loss') plt.plot(x_array, l1_y_out, 'r--', label='L1 Loss') plt.plot(x_array, phuber1_y_out, 'k-.', label='P-Huber Loss (0.25)') plt.plot(x_array, phuber2_y_out, 'g:', label='P'-Huber Loss (5.0)') plt.ylim(-0.2, 0.4) plt.legend(loc='lower right', prop={'size': 11}) plt.show()
And here is how to use matplotlib
to plot the various classification loss
functions:
x_array = sess.run(x_vals) plt.plot(x_array, hinge_y_out, 'b-', label='Hinge Loss') plt.plot(x_array, xentropy_y_out, 'r--', label='Cross Entropy Loss') plt.plot(x_array, xentropy_sigmoid_y_out, 'k-.', label='Cross Entropy Sigmoid Loss') plt.plot(x_array, xentropy_weighted_y_out, g:', label='Weighted Cross Enropy Loss (x0.5)') plt.ylim(-1.5, 3) plt.legend(loc='lower right', prop={'size': 11}) plt.show()
There's more…
Here is a table summarizing the different loss
functions that we have described:
The remaining classification loss
functions all have to do with the type of cross-entropy loss. The cross-entropy sigmoid loss
function is for use on unscaled logits and is preferred over computing the sigmoid
, and then the cross entropy, because TensorFlow has better built-in ways to handle numerical edge cases. The same goes for softmax cross entropy
and sparse softmax cross entropy
.
There are also many other metrics to look at when evaluating a model. Here is a list of some more to consider: