Decision Tree
A decision tree is a non-parametric model.
Contents
Description
A tree is composed of a root node, usually several split nodes or internal nodes, and at least two terminal nodes or leaf nodes.
Each terminal node is associated with a uniform outcome or prediction.
Every root and split node has a split. Splits can be formulated for
- binary decisions
discrete decisions (i.e., x1 in (1,2,3))
continuous decisions (i.e., x1 <= s)
Each split can be thought of as partitioning the space into two disjoint regions, R1 and R2.
Estimation
To estimate a tree model, a loss function must also be given to constrain calculation of splits. Given this minimization goal, the full set of possible splits can be evaluated to arrive at the best single split.
This can be iterated: split R1 and R2 independently.
Note that this is a greedy algorithm because it does not look ahead to deeper splits, it only optimizes for the single split.
The choice of loss function is primarily guided by the outcome.
Number of Possible Splits
Binary predictors have only 1 possible split.
Categorical predictors have 2C-1-1 possible splits given C classes. Note that this holds true for the binary case as well.
The introduction of a loss function has an important consequence for continuous space splits. Theoretically there are infinitely many possible splits, but loss is calculated from observations, so it can only be sensitive to the observed values. This reduces the effective number of possible splits to N.
Furthermore, the 'first' such split (which places all observations into R2, leaving R1 empty) is disregarded because it is fundamentally not informative.
By convention, continuous space splits are placed at averages between observed values.
Depth
If allowed to split fully, the tree will end when all observations are uniquely classified and loss will trivially be 0. This is almost necessarily overfitting. Limits are generally given as maximum tree depth.
In certain contexts, especially random forests, deep trees are considered allowable because overfitting is mitigated by other controls.
Alternatively, the CART method applies backward pruning. After building a deep tree, the bottom splits that contribute least to minimization of loss are pruned iteratively.
Regression Trees
Regression trees deal with continuous predictions. A node carries a prediction p̂ as the average observed outcome.
Loss functions such as squared error are appropriate.
Classification Trees
Classification trees deal with discrete class predictions. True values are called labels. They can support either hard or soft predictions.
The simplest case is a binary class prediction. The soft prediction is p̂, the proportion of observed 1s in the terminal node. The hard prediction is to apply a threshold to select either 0 or 1 using p̂. The most intuitive threshold is 0.5, a simple majority, but any threshold is possible.
This scales neatly for multiple classes. The soft prediction is the probability distribution across classes (such as: (p̂A = 0.6, p̂B = 0.3, p̂C = 0.1) for three classes: A, B, or C). The hard prediction is the most probable class using the distribution.
Note that, given a binary class prediction where 1 is the correct label, the error rate is 1-p̂. This can be abstracted as a summation across C classes and N observations as:
where yi,c is an indicator function for whether the class c was the correct label and, if so, whether the label was correctly predicted for observation i.
Given a classification problem, splits are evaluated on the purity of the created nodes. Take a trivial example of two classification trees with equivalent error rates where the second tree holds clear predictive advantages:
Both candidates have an overall error rate of 25%, but the second candidate has a pure R2.
Loss functions such as log loss or entropy are appropriate. These are applied to nodes separately, then a weighted average (weighted by number of observations) is used for overall scoring.