Decision Tree

A decision tree is a non-parametric model.


Description

A tree is composed of a root node, usually several split nodes or internal nodes, and at least two terminal nodes or leaf nodes.

Each terminal node is associated with a uniform outcome or prediction.

Every root and split node has a split. Splits can be formulated for

Each split can be thought of as partitioning the space into two disjoint regions, R1 and R2.


Estimation

To estimate a tree model, a loss function must also be given to guide choices of splits. That is, which dimensions (variables) should be split on and where should the split be?

The choice of loss function is primarily guided by the outcome.

Regression Trees

Regression trees deal with continuous predictions.

Loss functions such as squared error are appropriate.

Given a continuous dimension, a split can be theoretically placed anywhere. However, the loss function is only sensitive at n thresholds: the observations themselves. Furthermore, the 'first' split (which places all observations into R2, leaving R1 empty) can be disregarded because it is fundamentally not informative.

In other words, given n observations and k dimensions, to derive the best single-split regression tree model, calculate the loss for (n-1)k possibilities and select the minimum.

By convention, because the actual split level does not matter outside of its relation to observations, the splits are placed at halfway levels.

This can be iterated: derive the best single-split within R1 to serve as the first split node to the left, and do the same for R2 to serve as the first split node to the right.

If allowed to split indefinitely, all observations will eventually be uniquely classified and the loss will be trivially reduced to 0. This is almost necessarily overfitting. Limits are generally given as maximum tree depth.

Alternatively, the CART method applies backward pruning. After building an extremely deep decision tree, the least meaningful splits (also in terms of the loss function) are pruned iteratively.

Classification Trees

Classification trees deal with discrete class predictions. They can support either hard or soft predictions.

Loss functions such as log loss or entropy are appropriate.

The simplest case is a binary prediction. In this case, the hard prediction would be 0 or 1, while the soft prediction would be some between 0 and 1.

The framework easily extends to multiple classes. The soft prediction is the probability distribution across classes (such as: (p̂A = 0.6, p̂B = 0.3, p̂C = 0.1) for three classes: A, B, or C). The hard prediction is the most probable class among the distribution (in this example: class A because A = 0.6 is the maximum)

Given a classification problem, splits are evaluated on the purity of their descendant nodes, not by overall loss. Take a trivial example of two classification trees with equivalent overall loss where the second tree holds clear predictive advantages:

candidates.svg

Both candidates have an overall error rate of 25%, but the second candidate has a pure R2.


CategoryRicottone