On multivariate Gaussian


Formula for multivariate gaussian distribution


Formula of univariate gaussian distribution



  • There is normality constant in both equations
  • Σ being a positive definite ensure quadratic bowl is downwards
  • σ2 also being positive ensure that parabola is downwards


On Covariance Matrix

Definition of covariance between two vectors:


When we have more than two variable we present them in matrix form. So covariance matrix will look like


  • Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
  • For some data above demands might not meet



Following derivations  are available at [0]:

  • We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
  • It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
  • Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)


Linear Transformation Interpretation


This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix


Step-2 : Change of variables, which we apply to density function




[0] http://cs229.stanford.edu/section/gaussians.pdf





Quick Note on Probability Rules

p(X, Y) is joint distribution

p(X/Y) is conditional distribution

p(X) is marginal distribution (Y is marginalized out).


You can not get conditional distribution from joint distribution with just by integration. There is no such relationship.

There are just two rules for probability. Sum rule and product rules. And then there is Bayes theorem.




We might want to look at a table like below and calculate joint and conditional distribution and marginalized out one of the variable. [1]



Further reading : 

[0] :http://www.utm.utoronto.ca/~w3act/act245h5f/pr4.pdf



Support Vector Machines

Maximum margin classifiers

  • Also known as optimal separating hyperplane
  • Margin is the distance between hyperplane and closest training data point
  • We want to select a hyperplane for which this distance is maximum
  • Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
  • Such point are called support vectors
    • These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move



  • Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
  • Equation assumes y has two values (+1) and (-1)


Support Vector classifier

  • Maximum margin classifier does not work when supporting hyperplane does not exist.
  • Support vector classifier relaxes optimization objective to get that work
  • Unlike maximum margin classifier this one is less prone to overfit as well
    • The formal one is very sensitive to change in single observation
  • Also know as soft margin classifier


  • \epsilon variable allows training point to be on wrong side of margin
    • If \epsilon > 0 it is on the wrong side of margin
    • If \epsilon > 1 it is on the wrong side of hyperplane
  • Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
  • C is selected with cross validation and controls bias variance trade off
  • Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
    • Because these points affects the choice of hyperplane
    • And this is the property which makes it robust to outliers
      • LDA calculates mean of all the observation
      • However LR is less sensitivity to outliers
  • Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
    • This will be very important when we discuss support vector machine in next section



Support vector machine

  • Above two classifier does not work when desired decision boundary is not linear
  • One solution is to create polynomial features (as we generally do for LR)
  • But fundamental problem with this approach is that how many and which terms you should create
  • Also creating large number of feature raises computational problem
    • For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.svm4svm5svm6svm7
    • Kernel acts as similarity function
    • Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
    • Second one is polynomial kernel and last one is radial kernel
    • This video shows visualization of kernel trick


Multiclass SVM



Comparing LDA and LR


Few Points:

  • LR model probability with logistic function
  • LDA models probability with multivariate gaussian function
  • LR find maximum likelihood solution
  • LDA find maximum a posterior using bayes’ formula

When classes are well separated

When the classes are well-separated, the parameter estimates for logistic regression are surprisingly unstable. Coefficients may go to infinity. LDA doesn’t suffer from this problem.

LR gets unstable in the case of perfect separation

If there are covariate values that can predict the binary outcome perfectly then the algorithm of logistic regression, i.e. Fisher scoring, does not even converge.

If you are using R or SAS you will get a warning that probabilities of zero and one were computed and that the algorithm has crashed.

This is the extreme case of perfect separation but even if the data are only separated to a great degree and not perfectly, the maximum likelihood estimator might not exist and even if it does exist, the estimates are not reliable.

The resulting fit is not good at all.

Math behind LR


For example suppose y = 0 for x=0 and u=1 for x = 1. To maximize the likelihood of the observed data, the “S”-shaped logistic regression curve has to model h(theta) as 0 and 1. This will lead \beta to reach infinite, which cases the stability. Logistic function is between 0 and 1 and asymptotically reaches


Few terms:

Complete Separation – when x completely predicts both zero and 1

Quasi-Complete separation – when x completely predicts either 0 or 1

When can LDA fail

It can fail if either the between or within covariance matrix(Sigma) is singular but that is a rather rare instance.


In fact, If there is complete or quasi-complete separation then all the better because the discriminant is more likely to be successful.


LDA is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

In the post on LDA, QDA we had said that LDA is generalization of Fisher’s discriminant analysis (which involves project data on lower dimension to that achieves maximum separation).


LDA may result in information loss

The low-dimensional representation has a problem that it can result in loss of information. This is less of a problem when the data are linearly separable but if they are not the loss of information might be substantial and the classifier will perform poorly

Another assumption of LDA that it assumes equal covariance matrix for all classes, in which case we might go for QDA. Blog post on LDA, QDA list more consideration about the same.








Linear Discriminant Analysis (LDA)

In LR, we estimate the posterior probability directly. In LDA we estimate likelihood and then use Bayes theorem. Calculating posterior using bayes theorem is easy in case of classification because hypothesis space is limited.


Equation 4 is derived from equation 3 only. Probability(k) would be highest for the class for which Delta(k) will be highest.

LDA estimates mean and variance from data and uses equation 4 for classification.


Assumptions made:

  • f(x) is normal
  • Variance(sigma) is same for all classes


When more than one predictor, we go for multivariate gaussian



Quadratic Descriminant Analysis (QDA)

Unlike LDA, QDA assumes that each class has its own covariance matrix. It is called quadratic because below function is quadratic of x.


When to use LDA, QDA

  • This is related to bias variance trade-off
  • For p predict and k classes
    • LDA estimates k*p parameters
    • QDA estimates additional k*p*(p+1)/2 parameters
  • So LDA has much lower variance and classifier built can suffer from high bias
  • LDA should be used when number of training sample are less, because we want to avoid high variance problem
  • QDA has high variance, so it should be used when number of training samples are more
    • Another scenario would the case when common covariance matrix among K classes is untenable


A note on Fisher’s Linear Discriminant Analysis

  • It is simply LDA in case of two classes.
  • We can derive this similarity mathematically.
  • In literature we found it from the perspective that it project data on a line which achieves maximum separation
  • We can state without loss of generality that LDA also provides low dimensional view on data



  • We want to project 2-D data on a line which
    • maximizes the difference between projected mean
    • minimizes within class variance
  • Such a direction (w) can be found by maximizing fisher criterion (J)






Classification – One vs Rest and One vs One


In the blog post on Cost Function And Hypothesis for LR we noted that LR (Logistic Regression) inherently models binary classification. Here we will describe two approaches used to extend it for multiclass classification.

One vs Rest approach takes one class as positive and rest all as negative and trains the classifier. So for the data having n-classes it trains n classifiers. Now in the classification phase the n-classifier predicts probability of particular class and class with highest probability is selected.

One vs One considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class. (This is contrast to one vs rest where each classifier predicts probability). And the class which has been predicted most is the answer.


For example consider four class problem having classes A, B, C, and D.

One vs Rest

  • Models classifiers_A, classifier_B, classifier_C and classifier_D
  • During prediction here is the probability we get:
    • classifier_A = 40%
    • classifier_B = 30%
    • classifier_C = 60%
    • classifier_D = 50%
  • We assign it class B

One vs One

  • We train total six classifier with subset of data containing classes involved
    • classifier_AB
    • classifier_AC
    • classifier_AD
    • classifier_BC
    • classifier_BD
    • classifier_CD
  • And during classification
    • classifier_AB assigns class A
    • classifier_AC assigns class A
    • classifier_AD assigns class A
    • classifier_BC assigns class B
    • classifier_BD assigns class D
    • classifier_CD assigns class C
  • We assign it to class A


More notes

  • One vs rest trains less no of classifier and hence is faster overall and hence is usually prefered
  • Single classifier in one vs one uses subset of data, so single classifier is faster for one vs one
  • One vs one is less prone to imbalance in dataset (dominance of particular classes)



  • What if two class gets equal vote in the case of one vs one case
  • What if probability are almost close to equal in case of one vs rest
  • We will discuss this issue in further blog posts

On Classification Accuracy

Some Scenarios

  • In finance default is failure to meet the legal obligation of loan. Given some data we want to classify whether the person will be defaulter or not.
    • Suppose our training data-set is imbalanced. Out of 10k samples only 300 are defaulters. (3%)
    • Classifier in the following table is good at classifying non defaulters but not good at classifying defaulters (which is more important for credit card company)
    • Assume what if out of 300 defaulters 250 are classified as non defaulters and given the credit card
  • Doctors want to conduct a test whether a patient has cancer or not.
    • Popular terms in medical field are sensitivity and specificity
    • Instead of trying to classify person as defaulter, here we classify if patient has cancer.
    • Sensitivity = 81/333 = 24 %
    • Specificity = 9644/9667 = 99 %
    • Every medical test thrives to achieve 100% in both sensitivity and specificity.
  • In information retriever we want to know how many % of relevant pages we were able to retrieve.
    • TP = 81
    • FP = 23
    • TN = 9644
    • FN = 252
    • Precision = 81/104= 77%
    • Recall = 81/333= sensitivity = 24%



  • Formulas:
    • Precision = TP/(TP+FP)
    • Recall = TP/(TP+FN)
    • Sensitivity = TP/(TP+FN)
    • Specificity = TN/(TN+FP)
    • Recall and sensitivity are same

Solution is to change the threshold

  • Earlier we were assigning person to default if probability is more than 50%
  • Now we want to assign more person as defaulter
  • So we will assign them to defaulter when probability is more than 20%
  • This will incorrectly classify non-defaulters to defaulters but that is less concerned compared to assigning defaulter to non-defaulter
    • This will also increase the overall error rate, which is still okay


  • We can always increase sensitive by classifying all samples as positive
  • We can increase specificity by classifying all samples as negative
  • ROC plot (sensitivity) vs (1-specificity)
    • That is TP vs FP
  • ROC = Reciever operating charateristic
  • It is good to have ROC curve on top left
    • Better classifier
    • Accurate test
  • And ROC curve close to 45 degree represents less accurate test
  • AUC = Area Under Curve
    • Area under ROC curve
  • Ideal value for AUC is 1
  • AUC of 0.5 represents a random classifier


Threshold selection

  • Unless there is special business requirement (as in credit card defaulters) we want to select a threshold which maximizes TP while minimizing FP
  • There are two methods to do that:
    • Point which is closest to (0, 1) in ROC curve
    • Youden Index
      • Point which maximizes vertical distance from line of equality (45 degree line)
      • We can derive that this is the point which maximizes (sensitivity + specificity)


AUC vs overall accuracy as comparison metric

  • AUC helps us understand how much our classifier is away from random guess, which accuracy can not tell
  • Accuracy is measured at particular threshold while AUC requires moving threshold from 0 to 1


F score

  • We know that recall and sensitivity are same, but precision and specificity are not same
  • While medical field is more concerned about specificity, information retrieval is more concerned about precision
  • So they came up with F score which is harmonic mean of precision and recall
  • AUC helps us maximizing sensitivity and specificity simultaneously while F score helps us maximizing precision and recall simultaneously







An Introduction to Statistical Learning – http://www-bcf.usc.edu/~gareth/ISL/



Cost Function And Hypothesis for LR


We want a hypothesis that is bounded between zero and one, regression hypothesis line extends beyond this limits. Hypothesis here also represents probability of observing an outcome.


Hypothesis by ISLR and Andrew N.G :

Odds and log-odds/logit

In regression beta1 given average change in y for unit change in x. But here it says unit increase in x changes log-odds by beta1. It multiplies odds by exp(beta1) and hence depends on current value of odds and therefor is not linear.

Cost Function

For ISLR perspective it is likelihood that we want to maximize.


Andre N.G looks it from the perspective of modifying cost function of linear regression.


As we can see that Andrew N.G cost function is same as maximizing log likelihood of ISLR.

Least square in case of linear regression is special case of maximum likelihood. We know that derivation where we assume likelihood to be gaussian.



Hypothesis and T-Distribution

  • We calculate t-score using hypothesis data
  • We also get degrees of freedom from hypothesis data
  • We supply this value to a function which gives us the probability of hypothesis being true.

formula to calculate t for a 1-sample t-test

  • We can see that t test is a ratio, something like signal to noise ratio.
  • Numerator allows us to center it around zero
  • Denominator is Standard error of mean (SEM) = s/sqrt(n)
    • Where s is standard deviation of samples
    • Which is the denominator in above equation
    • Derivation can be found here
  • t-score is number indicating how many SEM is current mean away from mean given in hypothesis
  • If it is far away that means there is law probability of hypothesis-mean being true and we can reject the null hypothesis


  • In engineering we are used to the fact that mean, and standard deviation are given and true. We generally compute the probability of observing the sample
  • But here we are testing whether given mean is true or not, given the small no of samples
  • To do this we need a distribution that changes itself based on no of observations
    • It should widen when there are less samples
    • T distribution does this job for us, as it is dependent on no of sample

Type of T-tests

  • One sample T-test
    • Discussion so far is for one sample test
  • Two sample T-Test
    • To compare means of two independent groups
    • Scores of student who get 8 hour sleep vs four hour sleep
    • Question we want to answer is are there any significant difference in there scores?
    • In one sample test (In numerator of t-score) we are comparing sample mean with population mean
    • In two sample test it compares means of two independently drawn sample
    • And in denominator as well SEM formula is modified    
  • Paired T-Test
    • Same samples are used in two different conditions
    • 10 people before medication and same 10 people after medication
      • We want to check if medication has any effect
    • Different time points are used for market calculation
    • This essentially is a one sample T-test on the differences of value at two different conditions

Q – Q Plots

Since few days I was coming across to Q-Q plots very often and thought to learn more about it.

Many a times we want our data to be normal, this is because we normality is an assumption behind many statistical models. Now how to test normality. Wikipedia has an article about this which lists many method, one of them is Q-Q plots.

Here is how to create Q-Q plot manually (This steps will show the theory behind it):

  1. Sort your samples (Call it Y)
  2. Draw n samples from standard normal distribution to divide it in (n+1) equal areas
    1. Standard normal distribution is a distribution with mean = 0 and standard deviation = 1
  3. Call above X
  4. Plot Y against X
  5. For normal distribution it would be approximately straight line
    1. However considering probability, outliers and no of smaples have role to play


Here is the example code and plots in pythons:


Reference :

  1. You tube video