Machine Learning for Finance
上QQ阅读APP看书,第一时间看更新

The data

The dataset we will work with is a synthetic dataset of transactions generated by a payment simulator. The goal of this case study and the focus of this chapter is to find fraudulent transactions within a dataset, a classic machine learning problem many financial institutions deal with.

Note

Note: Before we go further, a digital copy of the code, as well as an interactive notebook for this chapter are accessible online, via the following two links:

An interactive notebook containing the code for this chapter can be found under https://www.kaggle.com/jannesklaas/structured-data-code

The code can also be found on GitHub, in this book's repository: https://github.com/PacktPublishing/Machine-Learning-for-Finance

The dataset we're using stems from the paper PaySim: A financial mobile money simulator for fraud detection, by E. A. Lopez-Rojas, A. Elmir, and S. Axelsson. The dataset can be found on Kaggle under this URL: https://www.kaggle.com/ntnu-testimon/paysim1.

Before we break it down on the next page, let's take a minute to look at the dataset that we'll be using in this chapter. Remember, you can download the data with the preceding link.

As seen in the first row, the dataset has 11 columns. Let's explain what each one represents before we move on:

  • step: Maps time, with each step corresponding to one hour.
  • type: The type of the transaction, which can be CASH_IN, CASH_OUT, DEBIT, PAYMENT, or TRANSFER.
  • amount: The amount of the transaction.
  • nameOrig: The origin account that started the transaction. C relates to customer accounts, while M is the account of merchants.
  • oldbalanceOrig: The old balance of the origin account.
  • newbalanceOrig: The new balance of the origin account after the transaction amount has been added.
  • nameDest: The destination account.
  • oldbalanceDest: The old balance of the destination account. This information is not available for merchant accounts whose names start with M.
  • newbalanceDest: The new balance of the destination account. This information is not available for merchant accounts.
  • isFraud: Whether the transaction was fraudulent.
  • isFlaggedFraud: Whether the old system has flagged the transaction as fraud.

In the preceding table, we can see 10 rows of data. It's worth noting that there are about 6.3 million transactions in our total dataset, so what we've seen is a small fraction of the total amount. As the fraud we're looking at only occurs in transactions marked as either TRANSFER or CASH_OUT, all other transactions can be dropped, leaving us with around 2.8 million examples to work with.