Models

# Support Vector Machines

## Maximum margin classifiers

• Also known as optimal separating hyperplane
• Margin is the distance between hyperplane and closest training data point
• We want to select a hyperplane for which this distance is maximum
• Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
• Such point are called support vectors
• These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move

### Training

• Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
• Equation assumes y has two values (+1) and (-1)

## Support Vector classifier

• Maximum margin classifier does not work when supporting hyperplane does not exist.
• Support vector classifier relaxes optimization objective to get that work
• Unlike maximum margin classifier this one is less prone to overfit as well
• The formal one is very sensitive to change in single observation
• Also know as soft margin classifier

• \epsilon variable allows training point to be on wrong side of margin
• If \epsilon > 0 it is on the wrong side of margin
• If \epsilon > 1 it is on the wrong side of hyperplane
• Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
• C is selected with cross validation and controls bias variance trade off
• Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
• Because these points affects the choice of hyperplane
• And this is the property which makes it robust to outliers
• LDA calculates mean of all the observation
• However LR is less sensitivity to outliers
• Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
• This will be very important when we discuss support vector machine in next section

## Support vector machine

• Above two classifier does not work when desired decision boundary is not linear
• One solution is to create polynomial features (as we generally do for LR)
• But fundamental problem with this approach is that how many and which terms you should create
• Also creating large number of feature raises computational problem
• For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.
• Kernel acts as similarity function
• Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
• Second one is polynomial kernel and last one is radial kernel
• This video shows visualization of kernel trick