Creating predictive models with Keras
Our data now contains the following columns:
amount, oldBalanceOrig, newBalanceOrig, oldBalanceDest, newBalanceDest, isFraud, isFlaggedFraud, type_CASH_OUT, type_TRANSFER, isNight
Now that we've got the columns, our data is prepared, and we can use it to create a model.
Extracting the target
To train the model, a neural network needs a target. In our case, isFraud
is the target, so we have to separate it from the rest of the data. We can do this by running:
y_df = df['isFraud'] x_df = df.drop('isFraud',axis=1)
The first step only returns the isFraud
column and assigns it to y_df
.
The second step returns all columns except isFraud
and assigns them to x_df
.
We also need to convert our data from a pandas DataFrame
to NumPy arrays. The pandas DataFrame
is built on top of NumPy arrays but comes with lots of extra bells and whistles that make all the preprocessing we did earlier possible. To train a neural network, however, we just need the underlying data, which we can get by simply running the following:
y = y_df.values X = x_df.values
Creating a test set
When we train our model, we run the risk of overfitting. Overfitting means that our model memorizes the x and y mapping in our training dataset but does not find the function that describes the true relationship between x and y. This is problematic because once we run our model out of sample – that is, on data not in our training set, it might do very poorly. To prevent this, we're going to create a so-called test set.
A test set is a holdout dataset, which we only use to evaluate our model once we think it is doing fairly well in order to see how well it performs on data it has not seen yet. A test set is usually randomly sampled from the complete data. Scikit-learn offers a convenient function to do this, as we can see in the following code:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
The element, train_test_split
will randomly assign rows to either the train or test set. You can specify test_size
, the share of data that goes into the test set (which in our case is 33%), as well as a random state. Assigning random_state
makes sure that although the process is pseudo-random, it will always return the same split, which makes our work more reproducible. Note that the actual choice of number (for example, 42
) does not really matter. What matters is that the same number is used in all experiments.
Creating a validation set
Now you might be tempted to just try out a lot of different models until you get a really high performance on the test set. However, ask yourself this: how would you know that you have not selected a model that by chance works well on the test set but does not work in real life?
The answer is that every time you evaluate on the test set, you incur a bit of "information leakage," that is, information from the test set leaks into your model by influencing your choice of model. Gradually, the test set becomes less valuable. The validation set is a sort of a "dirty test set" that you can use to frequently test your models out of sample performance without worrying. Though it's key to note that we don't want to use the test set too often, but it is still used to measure out-of-sample performance frequently.
To this end, we'll create a "validation set," also known as a development set.
We can do this the same way we created the test set, by just splitting the training data again, as we can see in the following code:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.1, random_state=42)
Oversampling the training data
Remember that in our dataset, only a tiny fraction of transactions were fraudulent, and that a model that is always classifying transactions as genuine would have a very high level of accuracy. To make sure we train our model on true relationships, we can oversample the training data.
This means that we would add data that would be fraudulent to our dataset until we have the same amount of fraudulent transactions as genuine transactions.
Note
Note: A useful library for this kind of task is imblearn
, which includes a SMOTE function. See, http://contrib.scikitlearn.org/imbalanced-learn/.
Synthetic Minority Over-sampling Technique (SMOTE) is a clever way of oversampling. This method tries to create new samples while maintaining the same decision boundaries for the classes. We can oversample with SMOTE by simply running:
From imblearn.over_sampling import SMOTE sm = SMOTE(random_state=42) X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
Building the model
We've successfully addressed several key learning points, and so it's now finally time to build a neural network! As in Chapter 1, Neural Networks and Gradient-Based Optimization, we need to import the required Keras modules using the following code:
from keras.models import Sequential from keras.layers import Dense, Activation
In practice, many structured data problems require very low learning rates. To set the learning rate for the gradient descent optimizer, we also need to import the optimizer. We can do this by running:
from keras.optimizers import SGD
Creating a simple baseline
Before we dive into more advanced models, it is wise to start with a simple logistic regression baseline. This is to make sure that our model can actually train successfully.
To create a simple baseline, we need to run the following code:
model = Sequential() model.add(Dense(1, input_dim=9)) model.add(Activation('sigmoid'))
You can see here a logistic regressor, which is the same as a one-layer neural network:
model.compile(loss='binary_crossentropy', optimizer=SGD(lr=1e-5), metrics=['acc'])
Here, we will compile the model. Instead of just passing SGD
to specify the optimizer for Stochastic Gradient Descent, we'll create a custom instance of SGD in which we set the learning rate to 0.00001. In this example, tracking accuracy is not needed since we evaluate our models using the F1 score. Still, it still reveals some interesting behavior, as you can see in the following code:
model.fit(X_train_res,y_train_res, epochs=5, batch_size=256, validation_data=(X_val,y_val))
Notice how we have passed the validation data into Keras by creating a tuple in which we store data and labels. We will train this model for 5 epochs:
Train on 3331258 samples, validate on 185618 samples Epoch 1/5 3331258/3331258 [==============================] - 20s 6us/step - loss: 3.3568 - acc: 0.7900 - val_loss: 3.4959 - val_acc: 0.7807 Epoch 2/5 3331258/3331258 [==============================] - 20s 6us/step - loss: 3.0356 - acc: 0.8103 - val_loss: 2.9473 - val_acc: 0.8151 Epoch 3/5 3331258/3331258 [==============================] - 20s 6us/step - loss: 2.4450 - acc: 0.8475 - val_loss: 0.9431 - val_acc: 0.9408 Epoch 4/5 3331258/3331258 [==============================] - 20s 6us/step - loss: 2.3416 - acc: 0.8541 - val_loss: 1.0552 - val_acc: 0.9338 Epoch 5/5 3331258/3331258 [==============================] - 20s 6us/step - loss: 2.3336 - acc: 0.8546 - val_loss: 0.8829 - val_acc: 0.9446
Notice a few things here: first, we have trained on about 3.3 million samples, which is more data than we initially had. The sudden increase comes from the oversampling that we did earlier on in this chapter. Secondly, the training set's accuracy is significantly lower than the validation set's accuracy. This is because the training set is balanced, while the validation set is not.
We oversampled the data by adding more fraud cases to the training set than there are in real life, which as we discussed, helped our model detect fraud better. If we did not oversample, our model would be inclined to classify all transactions as genuine since the vast majority of samples in the training set are genuine.
By adding fraud cases, we are forcing the model to learn what distinguishes a fraud case. Yet, we want to validate our model on realistic data. Therefore, our validation set does not artificially contain many fraud cases.
A model classifying everything as genuine would have over 99% accuracy on the validation set, but just 50% accuracy on the training set. Accuracy is a flawed metric for such imbalanced datasets. It is a half-decent proxy and more interpretable than just a loss, which is why we keep track of it in Keras.
To evaluate our model, we should use the F1 score that we discussed at the beginning of this chapter. However, Keras is unable to directly track the F1 score in training since the calculation of an F1 score is somewhat slow and would end up slowing down the training of our model.
Note
Note: Remember that accuracy on an imbalanced dataset can be very high, even if the model is performing poorly.
If the model exhibits a higher degree of accuracy on an imbalanced validation set than compared to that seen with a balanced training set, then it says little about the model performing well.
Compare the training set's performance against the previous training set's performance, and likewise the validation set's performance against the previous validation set's performance. However, be careful when comparing the training set's performance to that of the validation set's performance on highly imbalanced data. However, if your data is equally balanced, then comparing the validation set and the training set is a good way to gauge overfitting.
We are now in a position where we can make predictions on our test set in order to evaluate the baseline. We start by using model.predict
to make predictions on the test set:
y_pred = model.predict(X_test)
Before evaluating our baseline, we need to turn the probabilities given by our model into absolute predictions. In our example, we'll classify everything that has a fraud probability above 50% as fraud. To do this, we need to run the following code:
y_pred[y_pred > 0.5] = 1 y_pred[y_pred < 0.5] = 0
Our F1 score is already significantly better than it was for the heuristic model, which if you go back, you'll see that it only achieved a rate of 0.013131315551742895:
f1_score(y_pred=y_pred,y_true=y_test)
out: 0.054384286716408395
By plotting the confusion matrix, we're able to see that our feature-based model has indeed improved on the heuristic model:
cm = confusion_matrix(y_pred=y_pred,y_true=y_test) plot_confusion_matrix(cm,['Genuine','Fraud'], normalize=False)
This code should produce the following confusion matrix:
A confusion matrix for a simple Keras model
But what if we wanted to build more complex models that can express more subtle relationships, than the one that we've just built? Let's now do that!
Building more complex models
After we have created a simple baseline, we can go on to more complex models. The following code is an example of a two-layer network:
model = Sequential() model.add(Dense(16,input_dim=9)) model.add(Activation('tanh')) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy',optimizer=SGD(lr=1e-5), metrics=['acc']) model.fit(X_train_res,y_train_res, epochs=5, batch_size=256, validation_data=(X_val,y_val)) y_pred = model.predict(X_test) y_pred[y_pred > 0.5] = 1 y_pred[y_pred < 0.5] = 0
After running that code, we'll then again benchmark with the F1 score:
f1_score(y_pred=y_pred,y_true=y_test)
out: 0.087220701988752675
In this case, the more complex model does better than the simple baseline created earlier. It seems as though the function mapping transaction data to fraud is complex and can be approximated better by a deeper network.
In this section we have built and evaluated both simple and complex neural network models for fraud detection. We have been careful to use the validation set to gauge the initial out-of-sample performance.
With all of that, we can build much more complex neural networks (and we will). But first we will have a look at the workhorse of modern enterprise-ready machine learning: tree-based methods.