Support Vector Machines

Maximum margin classifiers

  • Also known as optimal separating hyperplane
  • Margin is the distance between hyperplane and closest training data point
  • We want to select a hyperplane for which this distance is maximum
  • Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
  • Such point are called support vectors
    • These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move



  • Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
  • Equation assumes y has two values (+1) and (-1)


Support Vector classifier

  • Maximum margin classifier does not work when supporting hyperplane does not exist.
  • Support vector classifier relaxes optimization objective to get that work
  • Unlike maximum margin classifier this one is less prone to overfit as well
    • The formal one is very sensitive to change in single observation
  • Also know as soft margin classifier


  • \epsilon variable allows training point to be on wrong side of margin
    • If \epsilon > 0 it is on the wrong side of margin
    • If \epsilon > 1 it is on the wrong side of hyperplane
  • Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
  • C is selected with cross validation and controls bias variance trade off
  • Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
    • Because these points affects the choice of hyperplane
    • And this is the property which makes it robust to outliers
      • LDA calculates mean of all the observation
      • However LR is less sensitivity to outliers
  • Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
    • This will be very important when we discuss support vector machine in next section



Support vector machine

  • Above two classifier does not work when desired decision boundary is not linear
  • One solution is to create polynomial features (as we generally do for LR)
  • But fundamental problem with this approach is that how many and which terms you should create
  • Also creating large number of feature raises computational problem
    • For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.svm4svm5svm6svm7
    • Kernel acts as similarity function
    • Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
    • Second one is polynomial kernel and last one is radial kernel
    • This video shows visualization of kernel trick


Multiclass SVM