## Classification Trees

How classification Tree creates Rectangle in Predictor Space

• Criterion to choose which predictor to Split on
• Predictor can be continuous, binary, categorical
• Analysis in this post applies to first two, categorical predictor is handled in separate blog post
• In case of continuous predictor we sort it first and choose the midpoint to split on
• Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest child nodes
• Two Popular Criterion are Gini and Entropy
• Note this criterion are not directed used to select predictor to split on
• Rather we take a difference of them before and after splitting
• While computing gini/entropy after splitting we apply weight to both splits

## Entropy

•  Formula:
• H = ∑ -p * log(p)
• Randomness
• Information Carries
• Highest when p=0.5
• Higher Entropy => Higher Randomness => Carries more information
• After splitting we want entropy to reduce
• Initial :
• n0 positive and m0 negative sample
• g0 = (n+m) total samples
• After Splitting
• Group 1:
• n1 positive
• m1 negative
• g1 = (n1 + m1) total
• Group 2:
• n2 positive
• m2 negative
• g2 = (n2 + m2) total
• Before Entropy :
• H_before =  -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
• After Entropy :
• H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
• H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
• H_after = (g1/g0) * H1 + (g2/g0)*H2
• diff = H_before – H_after
• Select a predictor for which diff is highest and split on it
• Which means we are selecting a predictor which reduces randomness more
• Which also mean we are selecting a predictor which reduces information more
• So we are selecting a variable which carries more information
• More important feature

## Gini

• Gini for 2 class :
• G = p1 * (1-p1) + p2 * (1 – p2)
• Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
• Gini in general:
• G = ∑ p * (1 – p) = 1 – ∑ p²
• Like entropy gini is maximum when p1 = p2 = 0.5
• In this case nodes are least pure
• Gini is a measure of impurity which we want to reduce by splitting on predictor
• diff = G_before – G_after
• We will select a predictor for which diff is highest

## ROC Curve

• ROC curve can be constructed easily for classifier which outputs ranking.
• For example decision tree outputs probability where we can change the threshold.
• On the contrary decision tree outputs label
• However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
• Then we can change the threshold and plot ROC.

## Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

1. limit max. depth of trees
2. Cost complexity pruning
3. ensembles / bag more than just 1 tree
4. set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

## Reference

Applied predictive modeling by Max Kuhn and Kjell Johnson

## Classification – One vs Rest and One vs One

In the blog post on Cost Function And Hypothesis for LR we noted that LR (Logistic Regression) inherently models binary classification. Here we will describe two approaches used to extend it for multiclass classification.

One vs Rest approach takes one class as positive and rest all as negative and trains the classifier. So for the data having n-classes it trains n classifiers. Now in the classification phase the n-classifier predicts probability of particular class and class with highest probability is selected.

One vs One considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class. (This is contrast to one vs rest where each classifier predicts probability). And the class which has been predicted most is the answer.

### Example

For example consider four class problem having classes A, B, C, and D.

One vs Rest

• Models classifiers_A, classifier_B, classifier_C and classifier_D
• During prediction here is the probability we get:
• classifier_A = 40%
• classifier_B = 30%
• classifier_C = 60%
• classifier_D = 50%
• We assign it class B

One vs One

• We train total six classifier with subset of data containing classes involved
• classifier_AB
• classifier_AC
• classifier_BC
• classifier_BD
• classifier_CD
• And during classification
• classifier_AB assigns class A
• classifier_AC assigns class A
• classifier_AD assigns class A
• classifier_BC assigns class B
• classifier_BD assigns class D
• classifier_CD assigns class C
• We assign it to class A

### More notes

• One vs rest trains less no of classifier and hence is faster overall and hence is usually prefered
• Single classifier in one vs one uses subset of data, so single classifier is faster for one vs one
• One vs one is less prone to imbalance in dataset (dominance of particular classes)

### Inconsistency

• What if two class gets equal vote in the case of one vs one case
• What if probability are almost close to equal in case of one vs rest
• We will discuss this issue in further blog posts