Tree-based Methods (II): Bagging, Random Forests, and Boosting
Reference: An introduction to statistical learning
All pictures shared in this post is from the reference.
An ensemble method is an approach to aggregate models, mainly to reduce predicted variance with the collection of sub-models. Four tree-based ensemble methods are: bagging, random forests, and boosting.
The basics of tree-based methods is presented in Tree-based Methods (I): The Basics of Decision Trees.
Bagging
Bagging uses bootstrap aggregation, a typical approach to reduce model variance by resampling that generates a big dataset from a small sample.
We generate B bootstrapped training sets, then trained for models, and the predictions of these models are summed in terms of B.
When bagging using regression trees, we can simply average the predicted outcome by B.
When bagging using regression trees, we can take the majority vote from B trees, making the most commonly predicted label as the outcome.
It should be noticed that the number of bootstrapped training sets/trees (B) is not a factor of overfitting. In fact, a large B could lead to low variance, and B=100 is generally sufficient for good performance.
Random Forests
Unlike bagging where trees are generated with different “data/observations”, a random forests generate trees with different sets of “predictors”.
Typically, each tree uses m predictor among the full set of p predictors.
The difference between bagging and random forests can be highlighted in a scenario where a strong predictor exists in a data set. In the bagging approach, all bootstrapped trees will use this predictor as the top node — bagging can create highly correlated trees that average the result may not give a substantial reduction in variance. On the other hand, in the random forest approach, since trees use different subsets of predictors, some of which do not contain the strong predictor; random forests decorrelate the trees by forcing each tree to only consider a subset of predictors.
Random forests derive predictions for regression and classification by averaging and majority vote like bagging.
In Figure 2, the performance of m as the square root of p outperforms the other two models.
Take classification trees as an example, in Python sklearn package, feature importance can be obtained using the following code
clf = RandomForestClassifier(criterion=’gini’)
clf.fit(X, y)
print(clf.feature_importances_)
Boosting
Boosting generates trees based on the performance of previous trees. Therefore, the boosting approach learns slowly, and thus we also call it gradient boosting.
Take regression trees for example. We first fit a decision tree using the original dataset, residuals (loss) of this tree is calculated. A second tree is fit on a modified version of the original dataset using the information of the first tree.
In a boosting method, parameters to set include B (number of trees) and the number of nodes for each tree.
eXtreme Gradient Boosting (XGBoost)
XGBoost is a boosting approach with regularization when fitting a tree. Thus, it is less prone to overfit. In the XGBoost package (Python package see here) the lambda value of L2 regularization is set to 1, which means the when you use XGBoost, by default, you already penalize the parameters.
This is a summary of a part of Chapter 8 in the book of “An Introduction to Statistical Learning”. I hope you enjoy the article and the book.