= Decision Tree = A '''decision tree''' is a non-parametric model. <> ---- == Description == A tree is composed of a '''root node''', usually several '''split nodes''' or '''internal nodes''', and at least two '''terminal nodes''' or '''leaf nodes'''. Each terminal node is associated with a uniform outcome or prediction. Every root and split node has a '''split'''. Splits can be formulated for * binary decisions * discrete decisions (i.e., ''x,,1,, in (1,2,3)'') * continuous decisions (i.e., ''x,,1,, <= s'') Each split can be thought of as partitioning the space into two disjoint regions, ''R,,1,,'' and ''R,,2,,''. ---- == Estimation == To estimate a tree model, a '''loss function''' must also be given to constrain calculation of splits. Given this minimization goal, the full set of possible splits can be evaluated to arrive at the best single split. This can be iterated: split ''R,,1,,'' and ''R,,2,,'' independently. Note that this is a greedy algorithm because it does not look ahead to deeper splits, it only optimizes for the single split. The choice of loss function is primarily guided by the outcome. === Number of Possible Splits === Binary predictors have only 1 possible split. Categorical predictors have ''2^C-1^-1'' possible splits given ''C'' classes. Note that this holds true for the binary case as well. The introduction of a loss function has an important consequence for continuous space splits. ''Theoretically'' there are infinitely many possible splits, but loss is calculated from observations, so it can only be sensitive to the observed values. This reduces the effective number of possible splits to ''N''. Furthermore, the 'first' such split (which places all observations into ''R,,2,,'', leaving ''R,,1,,'' empty) is disregarded because it is fundamentally not informative. By convention, continuous space splits are placed at averages between observed values. === Depth === If allowed to split fully, the tree will end when all observations are uniquely classified and loss will trivially be 0. This is almost necessarily overfitting. Limits are generally given as maximum tree depth. In certain contexts, especially [[Statistics/RandomForest|random forests]], deep trees are considered allowable because overfitting is mitigated by other controls. Alternatively, the '''CART method''' applies backward pruning. After building a deep tree, the bottom splits that contribute least to minimization of loss are pruned iteratively. === Regression Trees === '''Regression trees''' deal with continuous predictions. A node carries a prediction ''p̂'' as the average observed outcome. Loss functions such as squared error are appropriate. === Classification Trees === '''Classification trees''' deal with discrete class predictions. True values are called '''labels'''. They can support either hard or soft predictions. The simplest case is a binary class prediction. The soft prediction is ''p̂'', the proportion of observed 1s in the terminal node. The hard prediction is to apply a threshold to select either 0 or 1 using ''p̂''. The most intuitive threshold is 0.5, a simple majority, but any threshold is possible. This scales neatly for multiple classes. The soft prediction is the probability distribution across classes (such as: ''(p̂,,A,, = 0.6, p̂,,B,, = 0.3, p̂,,C,, = 0.1)'' for three classes: A, B, or C). The hard prediction is the most probable class using the distribution. Note that, given a binary class prediction where 1 is the correct label, the error rate is ''1-p̂''. This can be abstracted as a summation across ''C'' classes and ''N'' observations as: {{attachment:errorrate.svg}} where ''y,,i,c,,'' is an indicator function for whether the class ''c'' was the correct label and, if so, whether the label was correctly predicted for observation ''i''. Given a classification problem, splits are evaluated on the '''purity''' of the created nodes. Take a trivial example of two classification trees with equivalent error rates where the second tree holds clear predictive advantages: {{attachment:candidates.svg||width=400}} Both candidates have an overall error rate of 25%, but the second candidate has a pure ''R,,2,,''. Loss functions such as log loss or entropy are appropriate. These are applied to nodes separately, then a weighted average (weighted by number of observations) is used for overall scoring. ---- CategoryRicottone