Machine Learning for Finance
上QQ阅读APP看书,第一时间看更新

Heuristic, feature-based, and E2E models

Before we dive into developing models to detect fraud, let's take a second to pause and ponder over the different kinds of models we could build.

  • A heuristic-based model is a simple "rule of thumb" developed purely by humans. Usually, the heuristic model stems from having an expert knowledge of the problem.
  • A feature-based model relies heavily on humans modifying the data to create new and meaningful features, which are then fed into a (simple) machine learning algorithm. This approach mixes expert knowledge with learning from data.
  • An E2E model learns purely from raw data. No human expertise is used, and the model learns everything directly from observations.

In our case, a heuristic-based model could be created to mark all transactions with the TRANSFER transaction type and an amount over $200,000 as fraudulent. Heuristic-based models have the advantage that they are both fast to develop and easy to implement; however, this comes with a pay-off, their performance is often poor, and fraudsters can easily play the system. Let's imagine that we went with the preceding heuristic-based model, fraudsters transferring only $199,999, under the fraudulent limit, would evade detection.

An important heuristic in the field of trading is the momentum strategy. Momentum strategies involve betting that a stock that's on the rise will continue to rise, with people then buying that stock. While this strategy sounds too simple to be any good, it is in fact, a reasonably successful strategy that many high-frequency trading and quantitative outlets are using today.

To create features, experts craft indicators that can distinguish fraudulent transactions from those that are genuine. This is often done using statistical data analysis, and when compared to the heuristic-based model that we proposed early on, it will take longer, but with the benefit of better results.

Feature engineering-based models are a midway between data and humans shaping rules, where human knowledge and creativity are exploited to craft good features, and data and machine learning are used to create a model from those features.

E2E models learn purely from collected data without using expert knowledge. As discussed before, this often yields much better results, but at the cost of taking a lot of time to complete. This method also has some additional elements worth considering. For instance, collecting the large amount of data that will be needed is an expensive task, as humans have to label millions of records.

Though for many people in the industry right now, they take the view that shipping a poor model is often better than not shipping anything at all. After all, having some protection against fraud is better than simply having none.

Using a heuristic approach that lets through half of all fraudulent transactions is better than having no fraud detection at all. The graph shows us the performance of the three models we introduced earlier on, against the time taken to implement them.

The methods used and the performance of the system during development

The best method is to use a combination of all three. If we deploy a heuristic model that meets the basic requirements of the task that it set out to achieve, then it can be shipped. By employing this method, the heuristic then becomes the baseline that any other approach has to beat. Once your heuristic model is deployed, then all your efforts should then be directed toward building a feature-based model, which as soon as it beats the initially deployed heuristic model, can then be deployed while you continue to refine the model.

As we've discussed before, feature-based models often deliver pretty decent performance on structured data tasks; this gives companies the time to undertake the lengthy and expensive task of building an E2E model, which can be shipped once it beats the feature-based model. Now that we understand the type of models we're going to build, let's look at the software we need to build them.