A brief primer on tree-based methods
No chapter on structured data would be complete without mentioning tree-based methods, such as random forests or XGBoost.
It is worth knowing about them because, in the realm of predictive modeling for structured data, tree-based methods are very successful. However, they do not perform as well on more advanced tasks, such as image recognition or sequence-to-sequence modeling. This is the reason why the rest of the book does not deal with tree-based methods.
Note
Note: For a deeper dive into XGBoost, check out the tutorials on the XGBoost documentation page: http://xgboost.readthedocs.io. There is a nice explanation of how tree-based methods and gradient boosting work in theory and practice under the Tutorials section of the website.
A simple decision tree
The basic idea behind tree-based methods is the decision tree. A decision tree splits up data to create the maximum difference in outcomes.
Let's assume for a second that our isNight
feature is the greatest predictor of fraud. A decision tree would split our dataset according to whether the transactions happened at night or not. It would look at all the night-time transactions, looking for the next best predictor of fraud, and it would do the same for all day-time transactions.
Scikit-learn has a handy decision tree module. We can create one for our data by simply running the following code:
from sklearn.tree import DecisionTreeClassifier dtree=DecisionTreeClassifier() dtree.fit(X_train,y_train)
The resulting tree will look like this:
A decision tree for fraud detection
Simple decision trees, like the one we've produced, can give a lot of insight into data. For example, in our decision tree, the most important feature seems to be the old balance of the origin account, given that it is the first node in the tree.
A random forest
A more advanced version of a simple decision tree is a random forest, which is a collection of decision trees. A forest is trained by taking subsets of the training data and training decision trees on those subsets.
Often, those subsets do not include every feature of the training data. By doing it this way, the different decision trees can fit different aspects of the data and capture more information on aggregate. After a number of trees have been created, their predictions are averaged to create the final prediction.
The idea is that the errors presented by the trees are not correlated, and so by using multiple trees you cancel out the error. You can create and train a random forest classifier like this:
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=10,n_jobs=-1) rf.fit(X_train_res,y_train_res)
You'll notice that with the code we've just generated, random forests have far fewer knobs to tune than neural networks. In this case, we just specify the number of estimators, that is, the number of trees we would like our forest to have.
The n_jobs
argument tells the random forest how many trees we would like to train in parallel. Note that -1
stands for "as many as there are CPU cores":
y_pred = rf.predict(X_test) f1_score(y_pred=y_pred,y_true=y_test)
out: 0.8749502190362406
The random forest does an order of magnitude better than the neural network as its F1 score is close to 1, which is the maximum score. Its confusion plot, seen as follows, shows that the random forest significantly reduced the number of false positives:
A confusion matrix for the random forest
A shallow learning approach, such as a random forest, often does better than deep learning on relatively simple problems. The reason for this is that simple relationships with low-dimensional data can be hard to learn for a deep learning model, which has to fit multiple parameters exactly in order to match the simple function.
As we will see in later chapters of this book, as soon as relationships do get more complex, deep learning gets to shine.
XGBoost
XGBoost stands for eXtreme Gradient Boosting. The idea behind gradient boosting is to train a decision tree, and then to train a second decision tree on the errors that the first decision tree made.
Through this method, multiple layers of decision trees can be added, which slowly reduces the total number of model errors. XGBoost is a popular library that implements gradient boosting very efficiently.
Note
Note: XGBoost is installed on Kaggle kernels by default. If you are running these examples locally, see the XGBoost manual for installation instructions and more information: http://xgboost.readthedocs.io/.
Gradient boosting classifiers can be created and trained just like random forests from sklearn
, as can be seen in the following code:
import xgboost as xgb booster = xgb.XGBClassifier(n_jobs=-1) booster = booster.fit(X_train,y_train) y_pred = booster.predict(X_test) f1_score(y_pred=y_pred,y_true=y_test)
out: 0.85572959604286891
The gradient booster performs at almost the same level as a random forest on this task. A common approach that is used is to take both a random forest and a gradient booster and to average the predictions in order to get an even better model.
The bulk of machine learning jobs in business today are done on relatively simple structured data. The methods we have learned today, random forests and gradient boosting, are therefore the standard tools that most practitioners use in the real world.
In most enterprise machine learning applications, value creation does not come from carefully tweaking a model or coming up with cool architectures, but from massaging data and creating good features. However, as tasks get more complex and more semantic understanding of unstructured data is needed, these tools begin to fail.