Tree-based Methods (I): The Basics of Decision Trees

YaLinChen (Amber)
8 min readJan 2, 2022

--

Reference: An introduction to statistical learning
All pictures shared in this post is from the reference.

Source: Unsplash

Decision trees are non-parametric methods that can be applied to classification and regression problems. The tree can visualized with feature importance for us to understand the decision process.

A simulated example of hypertension prediction

Figure 1. An example decision tree with classification outcomes

Let’s say we want to predict hypertension risk for individuals. The tree starts from a node (Obesity) that splits the data into YES and NO. The
YES branch is terminated as these people are predicted to have hypertension. People in the NO branch are further split with having stress or not. Similarly, those who have stress are predicted as YES and those who don’t are further split with whether over 50 years of age. Finally, those who are over 50 years of age are predicted with YES and those who are not are predicted with NO.

The above example is not valid scientifically but very intuitive. It is like the way we make decision every day — we contemplate our situation step by step and eventually reach a conclusion with some conditions (like nodes in a tree).

Tree analogy

A tree has internal nodes that splits the tree (data) to different terminal nodes or leaves (Rn: region). Segments within the tree, such as those who are assigned to NO Obesity that are further split by Stress, are called branches.

Classification Trees

Prediction

In a classification tree, at each terminal node, we predict the most commonly occurring class of the training observations within the region. Therefore, the performance of the node can be determined by the class proportions among the observations in that region.

Loss calculation

The error rate of a classification tree is called “classification error rate”, meaning the “fraction of the observations in the region that do not belong to the most common class”.

Figure 2. Classification error rate
represents the proportion of training observations in the mth region that are from the kth class. max means to find the maximum probability among the regions (pmk)

It turns out that the classification error rate is not sensitive for tree-growing. Other two measures are in fact more preferable.

Gini index

Figure 3. Gini index

Gini index is a measure of total variance across the K classes. This formula is actually comparable to the variance of a binary distribution, which equals to np(1-p), where n is the number of observations and p is the probability for a predefined outcome. Note that since each node in a tree is a binary classification, the binomial distribution applies here.

From formula 8.6, Gini index would take on a small value if all pmk are close to 0 or 1 (when pmk ~ 0, G ~ 0; when pmk ~ 1, 1-pmk~0, G ~ 0), meaning that the node is capable of classifying the observations in to the right region; this node is called a “pure node”.

Entropy

Figure 3. Entropy

An alternatice to Gini index is entropy. Similarly, entropy will approach a small value if all pmk are close to 0 or 1.

Gini index and entropy are referred to as measures of “node impurity”. That is to say, Gini index and entropy are better than classification error rate in terms of node purity since the previous two measures take all regions into consideration where as classification error rate only considers max(pmk) — the best one region — for the node performance.

Figure 4. An example of node purity

For the RestECG<1 node, the two splits give a YES prediction. However, 7/11 observations in the left leaf are yes and 9/9 in the right leaf are yes. Both leaves are labeld as Yes since the majority of the class in each leaf is Yes. In this case, the classification error rate is 0, as the max(pmk) = 1, whereas Gini index is >0 since it calculates all regions from the node where impurity (7/11) exists in one region.

(See my previous post Sensitivity, specificity, type I error, type II error, and? for a typical machine learning binary outcome evaluation)

Regression Trees

Prediction

Like a classification tree, a regression tree assigns outcomes to training observations. But instead of assigning the yes/no label based on the class proportions of the leaf/region, the mean response of the region is assigned. For the Hitters data in Figure 5, Years and Hits are two nodes that derive three prediction regions for Salary.

Figure 5. A regression tree
Figure 6. A three-region partition for the tree in Figure 5

The process of developing such as tree includes two steps:

  1. Define a set of possible values (e.g. Hits < 117.5) that divide the training observations into distinct and non-overlapping regions.
  2. Make the prediction of each region (Salary) by mean response of the training data in the region.

The “possible values” are the nodes we use to partition the data. Similar to a classification tree, we partition data based on the model loss (error) the node reduces for the model.

Loss calculation

A regression tree minimizes the model loss by residual sum of squares (RSS), a classical concept in linear regression (See my previous post Linear model selection methods — subset selection for details about RSS).

The goal for a regression tree is to find a combination of regions that minimize the RSS in formula 8.1.

Figure 6. RSS of a regression tree

How a regression tree differs from linear regression?

A linear regression is a parametric method that contains multiple assumptions such as the normal distribution of the outcome. However, the robustness of its use also gives less variance compared to a tree. On the other hand, a regression tree is a nonparametric approach with no assumptions needed, easy to visualize, and yet suffers from high variance. Which method to use really depends on the decision boundary of the data.

Figure 7. When a true decision boundary is linear (top row), linear regression is preferred over a regression tree. On the other hand, when the boundary is non-linear (bottom row), a regression tree is better than linear regression.

Ways to grow a tree

Recursive binary splitting

Figure 1, Figure 4, and Figure 5 illustrate a top-down, greedy method to grow a tree — known as recursive binary splitting. Top-down means the tree begins at the top and successively splits the data afterwards; it is greedy because the model makes the best fit at the node rather than considering how this node would contribute to the following splits.

Therefore, the recursive binary splitting may produce good performance for its complex structure trained from the training data and hence overfit on the test data.

A smaller tree by a top-down greedy approach

So you may think, we can just create a smaller tree with parameters, such as limit the depth of a tree, to avoid the complicated structure. This method can be considered with its lower variance yet higher bias (compared to the true response).

Pruning

Probably a better way is to grow a very large tree by recursive binary splitting, and then prune it back to obtain a subtree as the final model that gives the lowest test error. Typically, test errors can be examined via validation (cross validation). However, it would be computationally expensive to calculate the test errors for each possible subtree!

Cost complexity pruning

Instead of considering all subtrees, we use a tuning parameter (⍺) to limit the number of node (T) of the tree.

Figure 8. Cost complexity pruning

For each value of ⍺, a T is determined — ⍺ controls the tradeoff of the model fit (RSS) and the complexity of the tree (T). This concept is similar to the model regularization.

By default in Python sklearn, the top-down greedy approach is used. To use cost complexity pruning (sklearn reference), one needs to use the following code:

clf = DecisionTreeClassifier(criterion='gini')
clf.cost_complexity_pruning_path(X, y)

Feature importance

People are usually interested in the feature importances of the features (nodes) so we know the priority of the decision steps (which features to be considered first). In a top-down approach, each split gives the lowest model loss (e.g. Gini index for a classification tree and RSS for a regression tree). Nodes closer to the top are more influential on the data split as they can classify more data into two whereas lower node can only classify a subset of data into two — topper nodes are more important.

Recall the Hitters data and the tree in Figure 5. We can interpret the tree as follows: Years is the most important features for predicting the outcome Salary. For those who have experience ≥ 4.5 years, Hits is the second important features to determine the Salary. On the other hand, for those whose experience is < 4.5 years, the number of Hits is not so important for the Salary prediction and therefore it reaches a terminal node.

In Python package, sklearn, DecisionTreeClassifier and DecisionTreeRegressor can be used. In DecisionTreeClassifier, the default criterion for feature importance is Gini, and another option is entropy. In DecisionTreeRegressor, the default criterion is squared_error, and other options include friedman_mse, absolute_error, and poisson.

#X: training input, y: training output#DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)
print(clf.feature_importances_)
#DecisionTreeRegressor
reg = DecisionTreeRegressor(criterion='squared_error')
reg.fit(X, y)
print(reg.feature_importances_)

Pros and cons of decision trees

Pros

Trees mimic how humans think. Therefore, we make the trees visualized and be explained to people easily.

From a machine learning perspective, trees can handle qualitative predictors without the need to create dummy variables (as this is needed in a linear model)

Cons

Unfortunately, trees usually do not perform better than other nonlinear prediction models since a single tree suffers from high variance — a small change in the data or split can yield very different predictions.

The disadvantages of a decision tree can typically be overcome by aggregated trees, such as bagging, random forests, and boosting. These ensemble methods are covered in the next post.

This is a summary of a part of Chapter 8 in the book of “An Introduction to Statistical Learning”. I hope you enjoy the article and the book.

--

--