Random forests
Now, let's move from a single decision tree to a random forest. If you wanted to guess who the next President will be, how would you go about predicting this? Let's see the different kinds of questions that we would ask to predict this:
- How many candidates are there? Who are they?
- Who is the current President?
- How are they performing?
- Which party do they belong to?
- Is there any current movement against that party?
- In how many states the political party has probability to win
- Were they the incumbent President?
- What are the major voting issues?
Many questions like this will come to our mind and we will attach different weights/importance to them.
Each person's prediction to the preceding questions may be different. There are too many factors to take into account, and the possibility are, each person's guess will be different. Every person comes to these questions with different backgrounds and knowledge levels, and may interpret the question differently.
So there is chance of having a high variance for the answers. If we take all the predictions given by different individuals separately and then average them out, it becomes a random forest.
A random forest combines many decision trees into a single model. Individually, predictions made by decision trees (or humans) may not be accurate, but, when combined, the predictions will be closer to the mark, on average.
The following diagram will help us understand the voting prediction using the random forest algorithm:
The following diagram gives a flowchart view of the previous diagram:
Let's look at why a random forest is better than a decision tree:
- A random forest is a combination of many decision trees and, hence, there is a greater probability that there would be many viewpoints to arrive at the final prediction.
- If only a single decision tree is considered for prediction, there is less information considered for prediction. But, in a random forest, when there are many trees involved, there is more information and it is more diverse.
- The random forest may not be biased, as may be the case with the decision tree, since it is not dependent on a single source.
Why the name random forest? Well, as much as people might rely on different sources to make a prediction, each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points. This increases diversity in the forest, leading to more robust overall predictions and hence, the name random forest.