How classification Tree creates Rectangle in Predictor Space

- Criterion to choose which predictor to Split on
- Predictor can be continuous, binary, categorical
- Analysis in this post applies to first two, categorical predictor is handled in separate blog post
- In case of continuous predictor we sort it first and choose the midpoint to split on

- Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest child nodes
- Two Popular Criterion are Gini and Entropy
- Note this criterion are not directed used to select predictor to split on
- Rather we take a difference of them before and after splitting
- While computing gini/entropy after splitting we apply weight to both splits

## Entropy

- Formula:
- H = ∑ -p * log(p)

- About :
- Randomness
- Information Carries
- Highest when p=0.5
- Higher Entropy => Higher Randomness => Carries more information
- After splitting we want entropy to reduce

- Initial :
- n0 positive and m0 negative sample
- g0 = (n+m) total samples

- After Splitting
- Group 1:
- n1 positive
- m1 negative
- g1 = (n1 + m1) total

- Group 2:
- n2 positive
- m2 negative
- g2 = (n2 + m2) total

- Group 1:

- Before Entropy :
- H_before = -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)

- After Entropy :
- H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
- H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
- H_after = (g1/g0) * H1 + (g2/g0)*H2

- diff = H_before – H_after
- Select a predictor for which diff is highest and split on it
- Which means we are selecting a predictor which reduces randomness more
- Which also mean we are selecting a predictor which reduces information more
- So we are selecting a variable which carries more information
- More important feature

- So we are selecting a variable which carries more information

## Gini

- Gini for 2 class :
- G = p1 * (1-p1) + p2 * (1 – p2)
- Using (p1 + p2 = 1) we can derive that G = 2*p1*p2

- Gini in general:
- G = ∑ p * (1 – p) = 1 – ∑ p²

- Like entropy gini is maximum when p1 = p2 = 0.5
- In this case nodes are least pure

- Gini is a measure of impurity which we want to reduce by splitting on predictor
- diff = G_before – G_after
- We will select a predictor for which diff is highest

## ROC Curve

- ROC curve can be constructed easily for classifier which outputs ranking.
- For example decision tree outputs probability where we can change the threshold.

- On the contrary decision tree outputs label
- However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
- Then we can change the threshold and plot ROC.

## Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

- limit max. depth of trees
- Cost complexity pruning
- ensembles / bag more than just 1 tree
- set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

## Reference

Applied predictive modeling by Max Kuhn and Kjell Johnson