IID Assumption

This is an interesting thread on the topic : https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning


Some notes:


1) Method of cross validation changes based in i.i.d assumption.



Looking at the last three equation, p(D/h) = p((d1, d2, d3)/h) can be reduced to multiplication only when d1, d2, d3 are independent.



Parametric and Nonparametric tests

We rarely heard of nonparametric tests while reading standard statistical books. However there are some scenarios where they should be used instead of parametric tests. [1] has beautiful blog about it, I am putting just a summary from that.


Different Tests

Table below displays various tests, I have verified that all of these tests are available in python stats package.


When to Use Parametric Tests

  • Parametric tests can perform well with skewed and nonnormal distributions
    • It is important to follow guidance in the sample size of data as shown in table below
  • Parametric tests can perform well when the spread/variance of each group is different
  • It has Statistical power


Reasons to Use Nonparametric Tests

  • Your area of study is better represented by the median
    • Income distribution is skewed and median is more useful than mean
    • Few billionaires can boost up the mean significantly
  • You have a very small sample size
    • Even less than what is mentioned in table above
  • You have ordinal data, ranked data, or outliers that you can’t remove




[1] : http://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test

Hypothesis and T-Distribution

  • We calculate t-score using hypothesis data
  • We also get degrees of freedom from hypothesis data
  • We supply this value to a function which gives us the probability of hypothesis being true.

formula to calculate t for a 1-sample t-test

  • We can see that t test is a ratio, something like signal to noise ratio.
  • Numerator allows us to center it around zero
  • Denominator is Standard error of mean (SEM) = s/sqrt(n)
    • Where s is standard deviation of samples
    • Which is the denominator in above equation
    • Derivation can be found here
  • t-score is number indicating how many SEM is current mean away from mean given in hypothesis
  • If it is far away that means there is law probability of hypothesis-mean being true and we can reject the null hypothesis


  • In engineering we are used to the fact that mean, and standard deviation are given and true. We generally compute the probability of observing the sample
  • But here we are testing whether given mean is true or not, given the small no of samples
  • To do this we need a distribution that changes itself based on no of observations
    • It should widen when there are less samples
    • T distribution does this job for us, as it is dependent on no of sample

Type of T-tests

  • One sample T-test
    • Discussion so far is for one sample test
  • Two sample T-Test
    • To compare means of two independent groups
    • Scores of student who get 8 hour sleep vs four hour sleep
    • Question we want to answer is are there any significant difference in there scores?
    • In one sample test (In numerator of t-score) we are comparing sample mean with population mean
    • In two sample test it compares means of two independently drawn sample
    • And in denominator as well SEM formula is modified    
  • Paired T-Test
    • Same samples are used in two different conditions
    • 10 people before medication and same 10 people after medication
      • We want to check if medication has any effect
    • Different time points are used for market calculation
    • This essentially is a one sample T-test on the differences of value at two different conditions

Q – Q Plots

Since few days I was coming across to Q-Q plots very often and thought to learn more about it.

Many a times we want our data to be normal, this is because we normality is an assumption behind many statistical models. Now how to test normality. Wikipedia has an article about this which lists many method, one of them is Q-Q plots.

Here is how to create Q-Q plot manually (This steps will show the theory behind it):

  1. Sort your samples (Call it Y)
  2. Draw n samples from standard normal distribution to divide it in (n+1) equal areas
    1. Standard normal distribution is a distribution with mean = 0 and standard deviation = 1
  3. Call above X
  4. Plot Y against X
  5. For normal distribution it would be approximately straight line
    1. However considering probability, outliers and no of smaples have role to play


Here is the example code and plots in pythons:


Reference :

  1. You tube video 



Probability Distribution

We have learned various probability distribution during high school and  engineering courses. However at times we forget them, so here I am providing simple practical scenarios for each distribution with no theories involved.

Bernoulli Distribution

  • When the random variable has just two outcomes
  • Probability of Drug/Medicine will be approved by government is p = 0.65
    • Probability that it will not approve is 0.35
  • Below formula works when we have probability available, in real life we estimate them from data :
    • Mean = p
    • Variance (Sigma Square) = p*(1-p)

Binomial Distribution

  • When you perform the Bernoulli experiment multiple times and want to see how many times certain outcome appears.
  • For example you flip a coin(fair/biased) 10 time and probability that head will appear for x (1, 2, …..10) times.
  • Another more practical example :
    • Suppose oil price can increase by 3 bucks or decreased by 1 buck each day
    • Probability of increasing p = 0.65, and that of decreasing = 0.35
    • What price can we expect after three days
    • Note (Increase, Increase, Decrease) and (Increase, Decrease, Increase) will give same price.
  • From another point of view it count no of successes in an experiment :
    • No of patient responding to treatment
    • Binary classification problem
  • Below formula works when we have probability available, in real life we estimate them from data :
    • n = no of times experience is performed
    • Mean = n*p
    • Variance (Sigma Square) = n*p*(1-p)

Normal Distribution

  • Very popular distribution
  • Observed very often because of central limit theorem (CLT)
  • Example :
    • % change in a stock price of google from a previous day
    • Heights and weights of persons
    • Exam scores
  • It is good to remember empirical numbers for normal distribution :
    • 68 % – one standard deviation
    • 95 % – two standard deviation
    • 99.7 % – three standard deviation
  • We use Z score as a distance in the unit of standard deviation from mean

Poisson Distribution

  • mean = variance = lambda (average no of events)
  • For a fix region if we know the average no of events, it helps formulate probability for no of events.
  • PDF(Probability Distribution Function) is a skewed curve
  • There is just one parameter (lambda)
    • While normal has two parameters (u and sigma)
    • Bernoulli has just one parameter (p)
    • Binomial has two parameters (n and p)


T Distribution

  • It has just one parameter called df (Degrees of Freedom)
  • mean = 0
  • std. deviation= sqrt(df/(df-2))
  • As df increases it moves more and more toward standard normal curve
  • In general it is more wider than bell curve.
    • Reason being from above formula std. deviation is always greater than 1
    • For standard bell curve std. deviation = 1
  • Area under t distribution is 1

Fitting the Distribution?

Fitting the distribution means, we are using some distribution as the model and we want to estimate the parameters. In case of Gaussian/Normal we estimate u and sigma, in case of poisson we estimate lambda.

What is probabilistic models ?

Models that propagate uncertainty of input to target variables are probabilistic models. Examples are :

  • Regression
  • Probability Trees
  • Monte Carlo Simulations
  • Markov chains

Further Reference :


Interpreting Statistical Values

In the previous post we started exploring statistical domain and will dive in more deeply today. So basically we will try to see what all the values in summary(model) in R suggest.

Here is a screenshot of how this summary looks :



Significant of Residue?

  • We want our residues to be normally distributed and centered around zero
  • It is like throwing at the arrow-board
    • If it is missing in just one direction there is a scope of improvement
    • If it is missing equally in all directions than we can try to reduce standard deviation
    • Irreducible error should always be observed in all the directions simultaneously.
  • Residues quantile gives us the first look at symmetric
  • And R also gives standard deviation of residuals known as RSE : residue standard error


What is the relationship between t value and p-value in the coefficient section?

  • With this values what R is trying to test is if variable has any relationship with the output
    • This is preset statistical question (Null hypothesis) and you cannot change it.
  • If coefficient is zero then it is not contributing, otherwise it is.
  • So t values is number of standard deviation mean is away from zero.
  • Larger the t value more the significance of variable.
  • Actually all these is related to probability and sampling.
    • You keep taking samples from larger population.
    • For each sample there will be different coefficient.
    • For some sample it can be zero as well.
  • So in the result which R displays we have a mean and standard deviation.
  • Coefficient is probabilistic variable centered at mean (Estimate in R summary).
  • Mean is away from zero by t standard deviation.
  • What is the probability of observing coefficient beyond t standard deviation?
  • This probability is given by p-value, which is Probability (coefficient > [t deviation from mean])

formulasRole of R^2

How to interpret R^2?

  • It shows how much of the variance is explained by the model. See formulas for greater understanding.

Why use R^2 over RSE?

  • R^2 has an advantage over RSE because it is always between 0 and 1

What can be considered as good value of R^2?

  • Good value of R^2 depends on problem setting. In physics when we are sure that data comes from linear model it is close to 1. While in marketing domin very small proportion of the variance can be explained by predictor. So R^2 = 0.1 is also realistic.

Difference between absolute and adjusted R^2?

  • R^2 always increases with no of variables, but adjusted R^2 decrease if added variable is not significant
  • Formula of adjusted R^2 somehow contains no of variables, so when the variables is added and gain is not significant result actually deceases.
  • Sometimes RSE increases while RSS decreases in the below formula
  • Not RSS and RSE are not related to R^2, this is just to show possible formula


F Statistics

Significance of F-score?

  • T test tells us if single variable is significant, while f-test tells us if a group of variables are jointly significant.
  • F-statistics also has a p value associated to it.
  • Null hypothesis for F test is H0: Intercept only model and your model are equal.
  • While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. Later is given by F-test.

Next question comes is why we need F-statistics when we have p values of individual coefficient?

  • It seems that when one of the coefficient is significant (has good p-value), overall model will also be significant.
  • However, this is violated when no of variable p is very large.

Good values of F-statistic?

  • It depends on value of n and p
    • n = no of observations in training set
    • p = no of independent variables
  • When n is large F-value little greater than 1 is enough to reject null hypothesis.
  • But it is good to take decision based on corresponding p value, which takes into account both n and p


What is degrees of freedom?

Although not highlighted in the screenshot, just want to share that degrees of freedom is the difference between n and no of non zero coefficient, intercept included.

Significance Score ***  in coefficient section?

R indicates whether p value is good or bad by showing stars against it.


Thanks for reading out, hope it helps.


Edit : Found the formula for adjusted R2 here :


Statistical Regression

During my recent flight I got a chance to talk to Samarth, who builds model for quantitative finance. He suggested that behind every machine learning model there are several statistical assumptions and you need to see if your data meets those assumptions. If not you need to make certain transformation.

So started to read ISLR book and here I am writing some notes about simple linear regression.

Earlier I had taken courses on statistics but was never able to figure it out how mean and standard deviation related to linear regression. It is very well explained that we find the same for the model parameters. This parameters also have confidence interval which is similar to standard deviation of mean.

Also linear regression can be used to test hypothesis like whether x depends on y. This also lead to the t-statistics and p-value discussion. P-value is area under curve of t-distribution so both things are related in a way. We can find confidence interval both using t-score and z-score. T-distribution is used when data is less.

Then there was discussion about Residue Standard Error(RSE) and R^2 statistics, which measures the inherent error in selecting the model. Reason for this error can be the fact that outcome also depends on some other variable, or our linear model is not a good fit in this case etc. There were formula for both of them. R^2 is independent of outcome variable and ranges from 0 to 1. It is a measure of how much variance is reduced by model or how much model explains the Data.

Letter on it was said that R^2 equals to square of pearson’s correlation coefficient. So for single regression R^2 does not have specific role. But it will be playing greater role in multiple linear regression.

There are also some Lab exercises in R, will be presenting it soon with some practical example.