Cost Function And Hypothesis for LR


We want a hypothesis that is bounded between zero and one, regression hypothesis line extends beyond this limits. Hypothesis here also represents probability of observing an outcome.


Hypothesis by ISLR and Andrew N.G :

Odds and log-odds/logit

In regression beta1 given average change in y for unit change in x. But here it says unit increase in x changes log-odds by beta1. It multiplies odds by exp(beta1) and hence depends on current value of odds and therefor is not linear.

Cost Function

For ISLR perspective it is likelihood that we want to maximize.


Andre N.G looks it from the perspective of modifying cost function of linear regression.


As we can see that Andrew N.G cost function is same as maximizing log likelihood of ISLR.

Least square in case of linear regression is special case of maximum likelihood. We know that derivation where we assume likelihood to be gaussian.



Hypothesis and T-Distribution

  • We calculate t-score using hypothesis data
  • We also get degrees of freedom from hypothesis data
  • We supply this value to a function which gives us the probability of hypothesis being true.

formula to calculate t for a 1-sample t-test

  • We can see that t test is a ratio, something like signal to noise ratio.
  • Numerator allows us to center it around zero
  • Denominator is Standard error of mean (SEM) = s/sqrt(n)
    • Where s is standard deviation of samples
    • Which is the denominator in above equation
    • Derivation can be found here
  • t-score is number indicating how many SEM is current mean away from mean given in hypothesis
  • If it is far away that means there is law probability of hypothesis-mean being true and we can reject the null hypothesis


  • In engineering we are used to the fact that mean, and standard deviation are given and true. We generally compute the probability of observing the sample
  • But here we are testing whether given mean is true or not, given the small no of samples
  • To do this we need a distribution that changes itself based on no of observations
    • It should widen when there are less samples
    • T distribution does this job for us, as it is dependent on no of sample

Type of T-tests

  • One sample T-test
    • Discussion so far is for one sample test
  • Two sample T-Test
    • To compare means of two independent groups
    • Scores of student who get 8 hour sleep vs four hour sleep
    • Question we want to answer is are there any significant difference in there scores?
    • In one sample test (In numerator of t-score) we are comparing sample mean with population mean
    • In two sample test it compares means of two independently drawn sample
    • And in denominator as well SEM formula is modified    
  • Paired T-Test
    • Same samples are used in two different conditions
    • 10 people before medication and same 10 people after medication
      • We want to check if medication has any effect
    • Different time points are used for market calculation
    • This essentially is a one sample T-test on the differences of value at two different conditions

Q – Q Plots

Since few days I was coming across to Q-Q plots very often and thought to learn more about it.

Many a times we want our data to be normal, this is because we normality is an assumption behind many statistical models. Now how to test normality. Wikipedia has an article about this which lists many method, one of them is Q-Q plots.

Here is how to create Q-Q plot manually (This steps will show the theory behind it):

  1. Sort your samples (Call it Y)
  2. Draw n samples from standard normal distribution to divide it in (n+1) equal areas
    1. Standard normal distribution is a distribution with mean = 0 and standard deviation = 1
  3. Call above X
  4. Plot Y against X
  5. For normal distribution it would be approximately straight line
    1. However considering probability, outliers and no of smaples have role to play


Here is the example code and plots in pythons:


Reference :

  1. You tube video 




Probability Distribution

We have learned various probability distribution during high school and  engineering courses. However at times we forget them, so here I am providing simple practical scenarios for each distribution with no theories involved.

Bernoulli Distribution

  • When the random variable has just two outcomes
  • Probability of Drug/Medicine will be approved by government is p = 0.65
    • Probability that it will not approve is 0.35
  • Below formula works when we have probability available, in real life we estimate them from data :
    • Mean = p
    • Variance (Sigma Square) = p*(1-p)

Binomial Distribution

  • When you perform the Bernoulli experiment multiple times and want to see how many times certain outcome appears.
  • For example you flip a coin(fair/biased) 10 time and probability that head will appear for x (1, 2, …..10) times.
  • Another more practical example :
    • Suppose oil price can increase by 3 bucks or decreased by 1 buck each day
    • Probability of increasing p = 0.65, and that of decreasing = 0.35
    • What price can we expect after three days
    • Note (Increase, Increase, Decrease) and (Increase, Decrease, Increase) will give same price.
  • From another point of view it count no of successes in an experiment :
    • No of patient responding to treatment
    • Binary classification problem
  • Below formula works when we have probability available, in real life we estimate them from data :
    • n = no of times experience is performed
    • Mean = n*p
    • Variance (Sigma Square) = n*p*(1-p)

Normal Distribution

  • Very popular distribution
  • Observed very often because of central limit theorem (CLT)
  • Example :
    • % change in a stock price of google from a previous day
    • Heights and weights of persons
    • Exam scores
  • It is good to remember empirical numbers for normal distribution :
    • 68 % – one standard deviation
    • 95 % – two standard deviation
    • 99.7 % – three standard deviation
  • We use Z score as a distance in the unit of standard deviation from mean

Poisson Distribution

  • mean = variance = lambda (average no of events)
  • For a fix region if we know the average no of events, it helps formulate probability for no of events.
  • PDF(Probability Distribution Function) is a skewed curve
  • There is just one parameter (lambda)
    • While normal has two parameters (u and sigma)
    • Bernoulli has just one parameter (p)
    • Binomial has two parameters (n and p)


T Distribution

  • It has just one parameter called df (Degrees of Freedom)
  • mean = 0
  • std. deviation= sqrt(df/(df-2))
  • As df increases it moves more and more toward standard normal curve
  • In general it is more wider than bell curve.
    • Reason being from above formula std. deviation is always greater than 1
    • For standard bell curve std. deviation = 1
  • Area under t distribution is 1

Fitting the Distribution?

Fitting the distribution means, we are using some distribution as the model and we want to estimate the parameters. In case of Gaussian/Normal we estimate u and sigma, in case of poisson we estimate lambda.

What is probabilistic models ?

Models that propagate uncertainty of input to target variables are probabilistic models. Examples are :

  • Regression
  • Probability Trees
  • Monte Carlo Simulations
  • Markov chains

Further Reference :



Interpreting Statistical Values

In the previous post we started exploring statistical domain and will dive in more deeply today. So basically we will try to see what all the values in summary(model) in R suggest.

Here is a screenshot of how this summary looks :



Significant of Residue?

  • We want our residues to be normally distributed and centered around zero
  • It is like throwing at the arrow-board
    • If it is missing in just one direction there is a scope of improvement
    • If it is missing equally in all directions than we can try to reduce standard deviation
    • Irreducible error should always be observed in all the directions simultaneously.
  • Residues quantile gives us the first look at symmetric
  • And R also gives standard deviation of residuals known as RSE : residue standard error


What is the relationship between t value and p-value in the coefficient section?

  • With this values what R is trying to test is if variable has any relationship with the output
    • This is preset statistical question (Null hypothesis) and you cannot change it.
  • If coefficient is zero then it is not contributing, otherwise it is.
  • So t values is number of standard deviation mean is away from zero.
  • Larger the t value more the significance of variable.
  • Actually all these is related to probability and sampling.
    • You keep taking samples from larger population.
    • For each sample there will be different coefficient.
    • For some sample it can be zero as well.
  • So in the result which R displays we have a mean and standard deviation.
  • Coefficient is probabilistic variable centered at mean (Estimate in R summary).
  • Mean is away from zero by t standard deviation.
  • What is the probability of observing coefficient beyond t standard deviation?
  • This probability is given by p-value, which is Probability (coefficient > [t deviation from mean])

formulasRole of R^2

How to interpret R^2?

  • It shows how much of the variance is explained by the model. See formulas for greater understanding.

Why use R^2 over RSE?

  • R^2 has an advantage over RSE because it is always between 0 and 1

What can be considered as good value of R^2?

  • Good value of R^2 depends on problem setting. In physics when we are sure that data comes from linear model it is close to 1. While in marketing domin very small proportion of the variance can be explained by predictor. So R^2 = 0.1 is also realistic.

Difference between absolute and adjusted R^2?

  • R^2 always increases with no of variables, but adjusted R^2 decrease if added variable is not significant
  • Formula of adjusted R^2 somehow contains no of variables, so when the variables is added and gain is not significant result actually deceases.
  • Sometimes RSE increases while RSS decreases in the below formula
  • Not RSS and RSE are not related to R^2, this is just to show possible formula


F Statistics

Significance of F-score?

  • T test tells us if single variable is significant, while f-test tells us if a group of variables are jointly significant.
  • F-statistics also has a p value associated to it.
  • Null hypothesis for F test is H0: Intercept only model and your model are equal.
  • While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. Later is given by F-test.

Next question comes is why we need F-statistics when we have p values of individual coefficient?

  • It seems that when one of the coefficient is significant (has good p-value), overall model will also be significant.
  • However, this is violated when no of variable p is very large.

Good values of F-statistic?

  • It depends on value of n and p
    • n = no of observations in training set
    • p = no of independent variables
  • When n is large F-value little greater than 1 is enough to reject null hypothesis.
  • But it is good to take decision based on corresponding p value, which takes into account both n and p


What is degrees of freedom?

Although not highlighted in the screenshot, just want to share that degrees of freedom is the difference between n and no of non zero coefficient, intercept included.

Significance Score ***  in coefficient section?

R indicates whether p value is good or bad by showing stars against it.


Thanks for reading out, hope it helps.


Edit : Found the formula for adjusted R2 here :



Statistical Regression

During my recent flight I got a chance to talk to Samarth, who builds model for quantitative finance. He suggested that behind every machine learning model there are several statistical assumptions and you need to see if your data meets those assumptions. If not you need to make certain transformation.

So started to read ISLR book and here I am writing some notes about simple linear regression.

Earlier I had taken courses on statistics but was never able to figure it out how mean and standard deviation related to linear regression. It is very well explained that we find the same for the model parameters. This parameters also have confidence interval which is similar to standard deviation of mean.

Also linear regression can be used to test hypothesis like whether x depends on y. This also lead to the t-statistics and p-value discussion. P-value is area under curve of t-distribution so both things are related in a way. We can find confidence interval both using t-score and z-score. T-distribution is used when data is less.

Then there was discussion about Residue Standard Error(RSE) and R^2 statistics, which measures the inherent error in selecting the model. Reason for this error can be the fact that outcome also depends on some other variable, or our linear model is not a good fit in this case etc. There were formula for both of them. R^2 is independent of outcome variable and ranges from 0 to 1. It is a measure of how much variance is reduced by model or how much model explains the Data.

Letter on it was said that R^2 equals to square of pearson’s correlation coefficient. So for single regression R^2 does not have specific role. But it will be playing greater role in multiple linear regression.

There are also some Lab exercises in R, will be presenting it soon with some practical example.




Learning Curves

Andrew N.G talks about plotting learning curves to check if your model is suffering from high bias or a variance problem.

So I tried to plot this curve. One thing I noticed that this curves were very noisy hence I have applied moving average to it.

My data has around 1450 examples and I am checking at the step of 20 data points.

testError = []
trainError = []
x = []
for i in range(10, 1450, 20):
X_trainSubSet = X_train[1 :i]
Y_trainSubSet = y_train[1:i]

logistic = LogisticRegression(C=300)
logistic.fit(X_trainSubSet, Y_trainSubSet)
y = logistic.score(X_test, y_test)
y = logistic.score(X_trainSubSet, Y_trainSubSet)

testErrorSeries = pandas.Series(testError)
testErrorMV = testErrorSeries.rolling(window=20,center=False).mean()
trainErrorSeries = pandas.Series(trainError)
trainErrorMV = trainErrorSeries.rolling(window=20,center=False).mean()
plt.plot(x, testErrorMV, color = 'r', label="Test Error")
plt.plot(x, trainErrorMV, color = 'g', label="Train Error")



Here is the output of the curve (We have used all 50 features here) :



Now I am using just 20 samples. I am selecting this sample using sklearn’s univariate feature selection.


testError = []
trainError = []
x = []
for i in range(10, 1450, 20):

#Selecting 20 features
tranformer = SelectKBest(f_classif, k=20).fit(X_train, y_train)
X_train_New = pandas.DataFrame(tranformer.transform(X_train))
X_test_New = pandas.DataFrame(tranformer.transform(X_test))

X_trainSubSet = X_train_New[1 :i]
Y_trainSubSet = y_train[1:i]

logistic = LogisticRegression(C=300)
logistic.fit(X_trainSubSet, Y_trainSubSet)
y = logistic.score(X_test_New, y_test)
y = logistic.score(X_trainSubSet, Y_trainSubSet)

testErrorSeries = pandas.Series(testError)
testErrorMV = testErrorSeries.rolling(window=20,center=False).mean()
trainErrorSeries = pandas.Series(trainError)
trainErrorMV = trainErrorSeries.rolling(window=20,center=False).mean()
plt.plot(x, testErrorMV, color = 'r', label="Test Error")
plt.plot(x, trainErrorMV, color = 'g', label="Train Error")


Here is the plot for 20 features and then for 35 features.

With 20 Feature
With 35 Features


So conclusion is that this is giving some indication for bias/variance problem.

What is still not clear is how to fix it, which I will describe in upcoming post.


Edit :

  1. Just realized sklearn has inbuilt function to plot learning curve. Click here for the example.

Cross-validation in sklearn

When should we split data in three parts (namely train, cv, test) or just two parts (train and test)?

Ref : http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

When you apply the model to real world data, you want a number which says how much confident you are, say 80%.

What if you divide data in two parts? Train, Test
  • In case if you don’t need to choose an appropriate model from several rivaling approaches, you can just re-partition your set that you basically have only training set and test set, without performing the validation of your trained model. I personally partition them 70/30 then.
What if you divide the data in three parts? Train, CV, Test
  • We use test data to forecast how would model perform in real world
  • We can not do it just by CV, because
    • We were applying multiple models on CV
    • So our model would be somewhat bias for CV

Here are some sample code examples in sklearn :

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
X_train.shape, y_train.shape
X_test.shape, y_test.shape
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
from sklearn import linear_model, datasets
logreg = linear_model.LogisticRegression(C=1)
logreg.fit(X_train, y_train)
knn.score(X_test, y_test) // Ans = 0.76791277258566981
logreg.score(X_test, y_test) // Ans = 0.78348909657320875

Dimentionality Reduciton – Correlation

So sometimes we get many many feature variable, say 50 for now. (Can be much more in practice). And model is always good with small features unless they are necessary. Technically it is knows as dimensionality reduction and keeps model from over-fitting.

Here is the great blog about this. It talks about correlation and mutual information.

And I was wondering how to do with python and here I am pasting some code for Titanic data-set.

Following code produces heatmap without values and is useful when we have more variables to get a quick glance.


sampleDF = trainDF.ix[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Sib Sp', 'Parch', 'Fare']]
sns.heatmap(cm, square=True)



Following code gives heatmap along with the values.

Upon the first plot it may seem that there is a significant correlation survived and age. But when we look at the no it comes out to be just 0.54.

sampleDF = trainDF.ix[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Sib Sp', 'Parch', 'Fare']]


However first plot also has its own use when we have 50 odd variables, here is one such plot.




Exploratory Analysis

Recently I was wondering about a question what do you do first when you get the data ?

I found this really nice article which helped me get some idea. Then I tried to apply it on Titanic dataset and this and upcoming blogs contains my learnings and thoughts about the same.

So one thing I learned was about knowing your variables. But how do you know it? A safe way to start it out is plotting the histogram, which gives rough idea about its distribution, spread etc. Histogram also helps you get it’s effect on target variable (which in case of our dataset is whether people survived or not). For that we plot two overlapping histograms.

pyplot.hist(trainDF.Sex, label='Total')
pyplot.hist(trainDF[trainDF['Survived']==1].Sex, label='Survived')
pyplot.legend(loc='upper center')
pyplot.xlabel('Sex 0=Femal 1=Male')
pyplot.ylabel('No of Passengers')

Next images shows output of above code. As we can see from the figure below all though number of female passengers are less there survival rate is pretty high.



Another problem I faced was matplotlib does not understand categorical variables and we need to assign it some integer for plotting histograms. Here is the example of doing it.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
trainDF.Sex=le.transform(trainDF.Sex) #Coverts to numerric values
trainDF.Sex=le.inverse_transform(trainDF.Sex) #Transfers back to Categorical Variable


Also pyplot hist is not handling NaN values, but pandas’ histogram method does.

pyplot.legend(loc='upper center')
pyplot.ylabel('No of Passengers')