Parametric and Nonparametric tests

We rarely heard of nonparametric tests while reading standard statistical books. However there are some scenarios where they should be used instead of parametric tests. [1] has beautiful blog about it, I am putting just a summary from that.


Different Tests

Table below displays various tests, I have verified that all of these tests are available in python stats package.


When to Use Parametric Tests

  • Parametric tests can perform well with skewed and nonnormal distributions
    • It is important to follow guidance in the sample size of data as shown in table below
  • Parametric tests can perform well when the spread/variance of each group is different
  • It has Statistical power


Reasons to Use Nonparametric Tests

  • Your area of study is better represented by the median
    • Income distribution is skewed and median is more useful than mean
    • Few billionaires can boost up the mean significantly
  • You have a very small sample size
    • Even less than what is mentioned in table above
  • You have ordinal data, ranked data, or outliers that you can’t remove




[1] :


Lagrange Multiplier and Constrained Optimization

Lagrange multipliers helps us to solve constrained optimization problem.

An example would to maximize f(x, y) with the constraint of g(x, y) = 0.

Geometrical intuition is that points on g where f either maximizes or minimizes would be will have a parallel gradient of f and g

∇ f(x, y)  =  λ ∇  g(x, y)

We need three equations to solve for x, y and λ.

Solving above gradient with respect to x and y gives two equation and third is g(x, y) = 0

These will give us the point where f is either maximum or minimum and then we can calculate f manually to find out point of interest.

Lagrange is a function to wrap above in single equation.

L(x, y, λ) = f(x, y) – λ g(x, y)

And then we solve ∇ L(x, y, λ) = 0


Budget Intuition

Let f represent the revenue R and g represent cost C.

Also let g(x, y) = b where b is maximum cost that we can afford.

Let M* be the optimized value obtained by solving lagrangian and value of λ we got was λ*.


So λ is the rate with which maximum revenue will increase with unit change in b.


Inequality Constraint


Here is the problem of finding maximum of f(x) [2]



Above conditions are known as Karush-Kuhn-Tucker (KKT) conditions.

First – primal feasibility

Second – dual feasibility

Third  –  complementary slackness conditions

For the problem of finding minimum value of f(x) we minimize following with same KKT conditions:



This can be extended to arbitrary number of constraints.




  • This is relates to solving the dual of a problem for convex function as we had in Boyd’s book
  • For unconstrained optimization computing hessian of a matrix is one way of solving if it is continuous  [1]






[2] Pattern Recognition and machine learning by Bishop


On multivariate Gaussian


Formula for multivariate gaussian distribution


Formula of univariate gaussian distribution



  • There is normality constant in both equations
  • Σ being a positive definite ensure quadratic bowl is downwards
  • σ2 also being positive ensure that parabola is downwards


On Covariance Matrix

Definition of covariance between two vectors:


When we have more than two variable we present them in matrix form. So covariance matrix will look like


  • Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
  • For some data above demands might not meet



Following derivations  are available at [0]:

  • We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
  • It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
  • Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)


Linear Transformation Interpretation


This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix


Step-2 : Change of variables, which we apply to density function









Quick Note on Probability Rules

p(X, Y) is joint distribution

p(X/Y) is conditional distribution

p(X) is marginal distribution (Y is marginalized out).


You can not get conditional distribution from joint distribution with just by integration. There is no such relationship.

There are just two rules for probability. Sum rule and product rules. And then there is Bayes theorem.




We might want to look at a table like below and calculate joint and conditional distribution and marginalized out one of the variable. [1]



Further reading : 

[0] :



Support Vector Machines

Maximum margin classifiers

  • Also known as optimal separating hyperplane
  • Margin is the distance between hyperplane and closest training data point
  • We want to select a hyperplane for which this distance is maximum
  • Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
  • Such point are called support vectors
    • These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move



  • Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
  • Equation assumes y has two values (+1) and (-1)


Support Vector classifier

  • Maximum margin classifier does not work when supporting hyperplane does not exist.
  • Support vector classifier relaxes optimization objective to get that work
  • Unlike maximum margin classifier this one is less prone to overfit as well
    • The formal one is very sensitive to change in single observation
  • Also know as soft margin classifier


  • \epsilon variable allows training point to be on wrong side of margin
    • If \epsilon > 0 it is on the wrong side of margin
    • If \epsilon > 1 it is on the wrong side of hyperplane
  • Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
  • C is selected with cross validation and controls bias variance trade off
  • Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
    • Because these points affects the choice of hyperplane
    • And this is the property which makes it robust to outliers
      • LDA calculates mean of all the observation
      • However LR is less sensitivity to outliers
  • Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
    • This will be very important when we discuss support vector machine in next section



Support vector machine

  • Above two classifier does not work when desired decision boundary is not linear
  • One solution is to create polynomial features (as we generally do for LR)
  • But fundamental problem with this approach is that how many and which terms you should create
  • Also creating large number of feature raises computational problem
    • For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.svm4svm5svm6svm7
    • Kernel acts as similarity function
    • Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
    • Second one is polynomial kernel and last one is radial kernel
    • This video shows visualization of kernel trick


Multiclass SVM



Comparing LDA and LR


Few Points:

  • LR model probability with logistic function
  • LDA models probability with multivariate gaussian function
  • LR find maximum likelihood solution
  • LDA find maximum a posterior using bayes’ formula

When classes are well separated

When the classes are well-separated, the parameter estimates for logistic regression are surprisingly unstable. Coefficients may go to infinity. LDA doesn’t suffer from this problem.

LR gets unstable in the case of perfect separation

If there are covariate values that can predict the binary outcome perfectly then the algorithm of logistic regression, i.e. Fisher scoring, does not even converge.

If you are using R or SAS you will get a warning that probabilities of zero and one were computed and that the algorithm has crashed.

This is the extreme case of perfect separation but even if the data are only separated to a great degree and not perfectly, the maximum likelihood estimator might not exist and even if it does exist, the estimates are not reliable.

The resulting fit is not good at all.

Math behind LR


For example suppose y = 0 for x=0 and u=1 for x = 1. To maximize the likelihood of the observed data, the “S”-shaped logistic regression curve has to model h(theta) as 0 and 1. This will lead \beta to reach infinite, which cases the stability. Logistic function is between 0 and 1 and asymptotically reaches


Few terms:

Complete Separation – when x completely predicts both zero and 1

Quasi-Complete separation – when x completely predicts either 0 or 1

When can LDA fail

It can fail if either the between or within covariance matrix(Sigma) is singular but that is a rather rare instance.


In fact, If there is complete or quasi-complete separation then all the better because the discriminant is more likely to be successful.


LDA is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

In the post on LDA, QDA we had said that LDA is generalization of Fisher’s discriminant analysis (which involves project data on lower dimension to that achieves maximum separation).


LDA may result in information loss

The low-dimensional representation has a problem that it can result in loss of information. This is less of a problem when the data are linearly separable but if they are not the loss of information might be substantial and the classifier will perform poorly

Another assumption of LDA that it assumes equal covariance matrix for all classes, in which case we might go for QDA. Blog post on LDA, QDA list more consideration about the same.





Linear Discriminant Analysis (LDA)

In LR, we estimate the posterior probability directly. In LDA we estimate likelihood and then use Bayes theorem. Calculating posterior using bayes theorem is easy in case of classification because hypothesis space is limited.


Equation 4 is derived from equation 3 only. Probability(k) would be highest for the class for which Delta(k) will be highest.

LDA estimates mean and variance from data and uses equation 4 for classification.


Assumptions made:

  • f(x) is normal
  • Variance(sigma) is same for all classes


When more than one predictor, we go for multivariate gaussian



Quadratic Descriminant Analysis (QDA)

Unlike LDA, QDA assumes that each class has its own covariance matrix. It is called quadratic because below function is quadratic of x.


When to use LDA, QDA

  • This is related to bias variance trade-off
  • For p predict and k classes
    • LDA estimates k*p parameters
    • QDA estimates additional k*p*(p+1)/2 parameters
  • So LDA has much lower variance and classifier built can suffer from high bias
  • LDA should be used when number of training sample are less, because we want to avoid high variance problem
  • QDA has high variance, so it should be used when number of training samples are more
    • Another scenario would the case when common covariance matrix among K classes is untenable


A note on Fisher’s Linear Discriminant Analysis

  • It is simply LDA in case of two classes.
  • We can derive this similarity mathematically.
  • In literature we found it from the perspective that it project data on a line which achieves maximum separation
  • We can state without loss of generality that LDA also provides low dimensional view on data



  • We want to project 2-D data on a line which
    • maximizes the difference between projected mean
    • minimizes within class variance
  • Such a direction (w) can be found by maximizing fisher criterion (J)






Classification – One vs Rest and One vs One


In the blog post on Cost Function And Hypothesis for LR we noted that LR (Logistic Regression) inherently models binary classification. Here we will describe two approaches used to extend it for multiclass classification.

One vs Rest approach takes one class as positive and rest all as negative and trains the classifier. So for the data having n-classes it trains n classifiers. Now in the classification phase the n-classifier predicts probability of particular class and class with highest probability is selected.

One vs One considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class. (This is contrast to one vs rest where each classifier predicts probability). And the class which has been predicted most is the answer.


For example consider four class problem having classes A, B, C, and D.

One vs Rest

  • Models classifiers_A, classifier_B, classifier_C and classifier_D
  • During prediction here is the probability we get:
    • classifier_A = 40%
    • classifier_B = 30%
    • classifier_C = 60%
    • classifier_D = 50%
  • We assign it class B

One vs One

  • We train total six classifier with subset of data containing classes involved
    • classifier_AB
    • classifier_AC
    • classifier_AD
    • classifier_BC
    • classifier_BD
    • classifier_CD
  • And during classification
    • classifier_AB assigns class A
    • classifier_AC assigns class A
    • classifier_AD assigns class A
    • classifier_BC assigns class B
    • classifier_BD assigns class D
    • classifier_CD assigns class C
  • We assign it to class A


More notes

  • One vs rest trains less no of classifier and hence is faster overall and hence is usually prefered
  • Single classifier in one vs one uses subset of data, so single classifier is faster for one vs one
  • One vs one is less prone to imbalance in dataset (dominance of particular classes)



  • What if two class gets equal vote in the case of one vs one case
  • What if probability are almost close to equal in case of one vs rest
  • We will discuss this issue in further blog posts

On Classification Accuracy

Some Scenarios

  • In finance default is failure to meet the legal obligation of loan. Given some data we want to classify whether the person will be defaulter or not.
    • Suppose our training data-set is imbalanced. Out of 10k samples only 300 are defaulters. (3%)
    • Classifier in the following table is good at classifying non defaulters but not good at classifying defaulters (which is more important for credit card company)
    • Assume what if out of 300 defaulters 250 are classified as non defaulters and given the credit card
  • Doctors want to conduct a test whether a patient has cancer or not.
    • Popular terms in medical field are sensitivity and specificity
    • Instead of trying to classify person as defaulter, here we classify if patient has cancer.
    • Sensitivity = 81/333 = 24 %
    • Specificity = 9644/9667 = 99 %
    • Every medical test thrives to achieve 100% in both sensitivity and specificity.
  • In information retriever we want to know how many % of relevant pages we were able to retrieve.
    • TP = 81
    • FP = 23
    • TN = 9644
    • FN = 252
    • Precision = 81/104= 77%
    • Recall = 81/333= sensitivity = 24%



  • Formulas:
    • Precision = TP/(TP+FP)
    • Recall = TP/(TP+FN)
    • Sensitivity = TP/(TP+FN)
    • Specificity = TN/(TN+FP)
    • Recall and sensitivity are same

Solution is to change the threshold

  • Earlier we were assigning person to default if probability is more than 50%
  • Now we want to assign more person as defaulter
  • So we will assign them to defaulter when probability is more than 20%
  • This will incorrectly classify non-defaulters to defaulters but that is less concerned compared to assigning defaulter to non-defaulter
    • This will also increase the overall error rate, which is still okay


  • We can always increase sensitive by classifying all samples as positive
  • We can increase specificity by classifying all samples as negative
  • ROC plot (sensitivity) vs (1-specificity)
    • That is TP vs FP
  • ROC = Reciever operating charateristic
  • It is good to have ROC curve on top left
    • Better classifier
    • Accurate test
  • And ROC curve close to 45 degree represents less accurate test
  • AUC = Area Under Curve
    • Area under ROC curve
  • Ideal value for AUC is 1
  • AUC of 0.5 represents a random classifier


Threshold selection

  • Unless there is special business requirement (as in credit card defaulters) we want to select a threshold which maximizes TP while minimizing FP
  • There are two methods to do that:
    • Point which is closest to (0, 1) in ROC curve
    • Youden Index
      • Point which maximizes vertical distance from line of equality (45 degree line)
      • We can derive that this is the point which maximizes (sensitivity + specificity)


AUC vs overall accuracy as comparison metric

  • AUC helps us understand how much our classifier is away from random guess, which accuracy can not tell
  • Accuracy is measured at particular threshold while AUC requires moving threshold from 0 to 1


F score

  • We know that recall and sensitivity are same, but precision and specificity are not same
  • While medical field is more concerned about specificity, information retrieval is more concerned about precision
  • So they came up with F score which is harmonic mean of precision and recall
  • AUC helps us maximizing sensitivity and specificity simultaneously while F score helps us maximizing precision and recall simultaneously



An Introduction to Statistical Learning –