Preparing the data for the Keras library
In Chapter 1, Neural Networks and Gradient-Based Optimization, we saw that neural networks would only take numbers as inputs. The issue for us in our dataset is that not all of the information in our table is numbers, some of it is presented as characters.
Therefore, in this section, we're going to work on preparing the data for Keras so that we can meaningfully work with it.
Before we start, let's look at the three types of data, Nominal, Ordinal, and Numerical:
- Nominal data: This comes in discrete categories that cannot be ordered. In our case, the type of transfer is a nominal variable. There are four discrete types, but it does not make sense to put them in any order. For instance, TRANSFER cannot be more than CASH_OUT, so instead, they are just separate categories.
- Ordinal data: This also comes in discrete categories, but unlike nominal data, it can be ordered. For example, if coffee comes in large, medium, and small sizes, those are distinct categories because they can be compared. The large size contains more coffee than the small size.
- Numerical data: This can be ordered, but we can also perform mathematical operations on it. An example in our data is the number of funds, as we can both compare the amounts, and also subtract or add them up.
Both nominal and ordinal data are categorical data, as they describe discrete categories. While numerical data works fine with neural networks, only out of the box, categorical data needs special treatment.
One-hot encoding
The most commonly used method to encode categorical data is called one-hot encoding. In one-hot encoding, we create a new variable, a so-called dummy variable for each category. We then set the dummy variable to 1 if the transaction is a member of a certain category and to zero otherwise.
An example of how we could apply this to our data set can be seen as follows:
So, this is what the categorical data would look like before one-hot encoding:
This is what the data would look like after one-hot encoding:
The Pandas software library offers a function that allows us to create dummy variables out of the box. Before doing so, however, it makes sense to add Type_
in front of all actual transaction types. The dummy variables will be named after the category. By adding Type_
to the beginning, we know that these dummy variables indicate the type.
The following line of code does three things. Firstly, df['type'].astype(str)
converts all the entries in the Type column to strings. Secondly, the Type_
prefix is added as a result of combining the strings. Thirdly, the new column of combined strings then replaces the original Type column:
df['type'] = 'Type_' + df['type'].astype(str)
We can now get the dummy variables by running the following code:
dummies = pd.get_dummies(df['type'])
We should note that the get_dummies()
function creates a new data frame. Next we attach this data frame to the main data frame, which can be done by running:
df = pd.concat([df,dummies],axis=1)
The concat()
method, as seen in the preceding code, concatenates two data frames. We concatenate along axis 1 to add the data frame as new columns. Now that the dummy variables are in our main data frame, we can remove the original column by running this:
del df['type']
And, voilà! We have turned our categorical variable into something a neural network will be able to work with.
Entity embeddings
In this section, we're going to walk through making use of both embeddings and the Keras functional API, showing you the general workflow. Both of these topics get introduced and explored fully in Chapter 5, Parsing Textual Data with Natural Language Processing, where we will go beyond the general ideas presented here and where we'll begin discussing topics like implementation.
It's fine if you do not understand everything that is going on just now; this is an advanced section after all. If you want to use both of these techniques, you will be well prepared after reading this book, as we explain different elements of both methods throughout the book.
In this section, we will be creating embedding vectors for categorical data. Before we start, we need to understand that embedding vectors are vectors representing categorical values. We use embedding vectors as inputs for neural networks. We train embeddings together with a neural network, so that we can, over time, obtain more useful embeddings. Embeddings are an extremely useful tool to have at our disposal.
Why are embeddings so useful? Not only do embeddings reduce the number of dimensions needed for encoding over one-hot encoding and thus decrease memory usage, but they also reduce sparsity in input activations, which helps reduce overfitting, and they can encode semantic meanings as vectors. The same advantages that made embeddings useful for text, Chapter 5, Parsing Textual Data with Natural Language Processing, also make them useful for categorical data.
Tokenizing categories
Just as with text, we have to tokenize the inputs before feeding them into the embeddings layer. To do this, we have to create a mapping dictionary that maps categories to a token. We can achieve this by running:
map_dict = {} for token, value in enumerate(df['type'].unique()): map_dict[value] = token
This code loops over all the unique type categories while counting upward. The first category gets token 0, the second 1, and so on. Our map_dict
looks like this:
{'CASH_IN': 4, 'CASH_OUT': 2, 'DEBIT': 3, 'PAYMENT': 0, 'TRANSFER': 1}
We can now apply this mapping to our data frame:
df["type"].replace(map_dict, inplace=True)
As a result, all types will now be replaced by their tokens.
We have to deal with the non-categorical values in our data frame separately. We can create a list of columns that are not the type and not the target like this:
other_cols = [c for c in df.columns if ((c != 'type') and (c != 'isFraud'))]
Creating input models
The model we are creating will have two inputs: one for the types with an embedding layer, and one for all other, non-categorical variables. To combine them with more ease at a later point, we're going to keep track of their inputs and outputs with two arrays:
inputs = [] outputs = []
The model that acts as an input for the type receives a one-dimensional input and parses it through an embedding layer. The outputs of the embedding layer are then reshaped into flat arrays, as we can see in this code:
num_types = len(df['type'].unique()) type_embedding_dim = 3 type_in = Input(shape=(1,)) type_embedding = Embedding(num_types,type_embedding_dim,input_ length=1)(type_in) type_out = Reshape(target_shape=(type_embedding_dim,))(type_embedding) type_model = Model(type_in,type_out) inputs.append(type_in) outputs.append(type_out)
The type
embeddings have three layers here. This is an arbitrary choice, and experimentation with different numbers of dimensions could improve the results.
For all the other inputs, we create another input that has as many dimensions as there are non-categorical variables and consists of a single dense layer with no activation function. The dense layer is optional; the inputs could also be directly passed into the head model. More layers could also be added, including these:
num_rest = len(other_cols) rest_in = Input(shape = (num_rest,)) rest_out = Dense(16)(rest_in) rest_model = Model(rest_in,rest_out) inputs.append(rest_in) outputs.append(rest_out)
Now that we have created the two input models, we can concatenate them. On top of the two concatenated inputs, we will also build our head model. To begin this process, we must first run the following:
concatenated = Concatenate()(outputs)
Then, by running the following code, we can build and compile the overall model:
x = Dense(16)(concatenated) x = Activation('sigmoid')(x) x = Dense(1)(concatenated) model_out = Activation('sigmoid')(x) merged_model = Model(inputs, model_out) merged_model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])
Training the model
In this section we're going to train a model with multiple inputs. To do this, we need to provide a list of X values for each input. So, firstly we must split up our data frame. We can do this by running the following code:
types = df['type'] rest = df[other_cols] target = df['isFraud']
Then, we can train the model by providing a list of the two inputs and the target, as we can see in the following code:
history = merged_model.fit([types.values,rest.values],target.values, epochs = 1, batch_size = 128)
out: Epoch 1/1 6362620/6362620 [==============================] - 78s 12us/step - loss: 0.0208 - acc: 0.9987