Maximum margin classifiers
 Also known as optimal separating hyperplane
 Margin is the distance between hyperplane and closest training data point
 We want to select a hyperplane for which this distance is maximum
 Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
 Such point are called support vectors
 These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move
Training
 Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
 Equation assumes y has two values (+1) and (1)
Support Vector classifier
 Maximum margin classifier does not work when supporting hyperplane does not exist.
 Support vector classifier relaxes optimization objective to get that work
 Unlike maximum margin classifier this one is less prone to overfit as well
 The formal one is very sensitive to change in single observation
 Also know as soft margin classifier
 \epsilon variable allows training point to be on wrong side of margin
 If \epsilon > 0 it is on the wrong side of margin
 If \epsilon > 1 it is on the wrong side of hyperplane
 Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
 C is selected with cross validation and controls bias variance trade off
 Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
 Because these points affects the choice of hyperplane
 And this is the property which makes it robust to outliers
 LDA calculates mean of all the observation
 However LR is less sensitivity to outliers
 Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
 This will be very important when we discuss support vector machine in next section
Support vector machine
 Above two classifier does not work when desired decision boundary is not linear
 One solution is to create polynomial features (as we generally do for LR)
 But fundamental problem with this approach is that how many and which terms you should create
 Also creating large number of feature raises computational problem

 For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.
 Kernel acts as similarity function
 Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
 Second one is polynomial kernel and last one is radial kernel
 This video shows visualization of kernel trick
Multiclass SVM
 One vs all and one vs rest is used for multiclass classification using SVM
References
 An Introduction to Statistical Learning – http://wwwbcf.usc.edu/~gareth/ISL/
Advertisements