We want a hypothesis that is bounded between zero and one, regression hypothesis line extends beyond this limits. Hypothesis here also represents probability of observing an outcome.

Hypothesis by ISLR and Andrew N.G :

Odds and log-odds/logit

In regression beta1 given average change in y for unit change in x. But here it says unit increase in x changes log-odds by beta1. It multiplies odds by exp(beta1) and hence depends on current value of odds and therefor is not linear.

Cost Function

For ISLR perspective it is likelihood that we want to maximize.

Andre N.G looks it from the perspective of modifying cost function of linear regression.

As we can see that Andrew N.G cost function is same as maximizing log likelihood of ISLR.

Least square in case of linear regression is special case of maximum likelihood. We know that derivation where we assume likelihood to be gaussian.

t-score is number indicating how many SEM is current mean away from mean given in hypothesis

If it is far away that means there is law probability of hypothesis-mean being true and we can reject the null hypothesis

=======

In engineering we are used to the fact that mean, and standard deviation are given and true. We generally compute the probability of observing the sample

But here we are testing whether given mean is true or not, given the small no of samples

To do this we need a distribution that changes itself based on no of observations

It should widen when there are less samples

T distribution does this job for us, as it is dependent on no of sample

Type of T-tests

One sample T-test

Discussion so far is for one sample test

Two sample T-Test

To compare means of two independent groups

Scores of student who get 8 hour sleep vs four hour sleep

Question we want to answer is are there any significant difference in there scores?

In one sample test (In numerator of t-score) we are comparing sample mean with population mean

In two sample test it compares means of two independently drawn sample

And in denominator as well SEM formula is modified

Paired T-Test

Same samples are used in two different conditions

10 people before medication and same 10 people after medication

We want to check if medication has any effect

Different time points are used for market calculation

This essentially is a one sample T-test on the differences of value at two different conditions

Since few days I was coming across to Q-Q plots very often and thought to learn more about it.

Many a times we want our data to be normal, this is because we normality is an assumption behind many statistical models. Now how to test normality. Wikipedia has an article about this which lists many method, one of them is Q-Q plots.

Here is how to create Q-Q plot manually (This steps will show the theory behind it):

Sort your samples (Call it Y)

Draw n samples from standard normal distribution to divide it in (n+1) equal areas

Standard normal distribution is a distribution with mean = 0 and standard deviation = 1

Call above X

Plot Y against X

For normal distribution it would be approximately straight line

However considering probability, outliers and no of smaples have role to play

We have learned various probability distribution during high school and engineering courses. However at times we forget them, so here I am providing simple practical scenarios for each distribution with no theories involved.

Bernoulli Distribution

When the random variable has just two outcomes

Probability of Drug/Medicine will be approved by government is p = 0.65

Probability that it will not approve is 0.35

Below formula works when we have probability available, in real life we estimate them from data :

Mean = p

Variance (Sigma Square) = p*(1-p)

Binomial Distribution

When you perform the Bernoulli experiment multiple times and want to see how many times certain outcome appears.

For example you flip a coin(fair/biased) 10 time and probability that head will appear for x (1, 2, …..10) times.

Another more practical example :

Suppose oil price can increase by 3 bucks or decreased by 1 buck each day

Probability of increasing p = 0.65, and that of decreasing = 0.35

What price can we expect after three days

Note (Increase, Increase, Decrease) and (Increase, Decrease, Increase) will give same price.

From another point of view it count no of successes in an experiment :

No of patient responding to treatment

Binary classification problem

Below formula works when we have probability available, in real life we estimate them from data :

n = no of times experience is performed

Mean = n*p

Variance (Sigma Square) = n*p*(1-p)

Normal Distribution

Very popular distribution

Observed very often because of central limit theorem (CLT)

Example :

% change in a stock price of google from a previous day

Heights and weights of persons

Exam scores

It is good to remember empirical numbers for normal distribution :

68 % – one standard deviation

95 % – two standard deviation

99.7 % – three standard deviation

We use Z score as a distance in the unit of standard deviation from mean

Poisson Distribution

mean = variance = lambda (average no of events)

For a fix region if we know the average no of events, it helps formulate probability for no of events.

PDF(Probability Distribution Function) is a skewed curve

There is just one parameter (lambda)

While normal has two parameters (u and sigma)

Bernoulli has just one parameter (p)

Binomial has two parameters (n and p)

T Distribution

It has just one parameter called df (Degrees of Freedom)

mean = 0

std. deviation= sqrt(df/(df-2))

As df increases it moves more and more toward standard normal curve

In general it is more wider than bell curve.

Reason being from above formula std. deviation is always greater than 1

For standard bell curve std. deviation = 1

Area under t distribution is 1

Fitting the Distribution?

Fitting the distribution means, we are using some distribution as the model and we want to estimate the parameters. In case of Gaussian/Normal we estimate u and sigma, in case of poisson we estimate lambda.

What is probabilistic models ?

Models that propagate uncertainty of input to target variables are probabilistic models. Examples are :

In the previous post we started exploring statistical domain and will dive in more deeply today. So basically we will try to see what all the values in summary(model) in R suggest.

Here is a screenshot of how this summary looks :

Significant of Residue?

We want our residues to be normally distributed and centered around zero

It is like throwing at the arrow-board

If it is missing in just one direction there is a scope of improvement

If it is missing equally in all directions than we can try to reduce standard deviation

Irreducible error should always be observed in all the directions simultaneously.

Residues quantile gives us the first look at symmetric

And R also gives standard deviation of residuals known as RSE : residue standard error

What is the relationship between t value and p-value in the coefficient section?

With this values what R is trying to test is if variable has any relationship with the output

This is preset statistical question (Null hypothesis) and you cannot change it.

If coefficient is zero then it is not contributing, otherwise it is.

So t values is number of standard deviation mean is away from zero.

Larger the t value more the significance of variable.

Actually all these is related to probability and sampling.

You keep taking samples from larger population.

For each sample there will be different coefficient.

For some sample it can be zero as well.

So in the result which R displays we have a mean and standard deviation.

Coefficient is probabilistic variable centered at mean (Estimate in R summary).

Mean is away from zero by t standard deviation.

What is the probability of observing coefficient beyond t standard deviation?

This probability is given by p-value, which is Probability (coefficient > [t deviation from mean])

Role of R^2

How to interpret R^2?

It shows how much of the variance is explained by the model. See formulas for greater understanding.

Why use R^2 over RSE?

R^2 has an advantage over RSE because it is always between 0 and 1

What can be considered as good value of R^2?

Good value of R^2 depends on problem setting. In physics when we are sure that data comes from linear model it is close to 1. While in marketing domin very small proportion of the variance can be explained by predictor. So R^2 = 0.1 is also realistic.

Difference between absolute and adjusted R^2?

R^2 always increases with no of variables, but adjusted R^2 decrease if added variable is not significant

Formula of adjusted R^2 somehow contains no of variables, so when the variables is added and gain is not significant result actually deceases.

Sometimes RSE increases while RSS decreases in the below formula

Not RSS and RSE are not related to R^2, this is just to show possible formula

F Statistics

Significance of F-score?

T test tells us if single variable is significant, while f-test tells us if a group of variables are jointly significant.

F-statistics also has a p value associated to it.

Null hypothesis for F test is H0: Intercept only model and your model are equal.

While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. Later is given by F-test.

Next question comes is why we need F-statistics when we have p values of individual coefficient?

It seems that when one of the coefficient is significant (has good p-value), overall model will also be significant.

However, this is violated when no of variable p is very large.

Good values of F-statistic?

It depends on value of n and p

n = no of observations in training set

p = no of independent variables

When n is large F-value little greater than 1 is enough to reject null hypothesis.

But it is good to take decision based on corresponding p value, which takes into account both n and p

What is degrees of freedom?

Although not highlighted in the screenshot, just want to share that degrees of freedom is the difference between n and no of non zero coefficient, intercept included.

Significance Score *** in coefficient section?

R indicates whether p value is good or bad by showing stars against it.

During my recent flight I got a chance to talk to Samarth, who builds model for quantitative finance. He suggested that behind every machine learning model there are several statistical assumptions and you need to see if your data meets those assumptions. If not you need to make certain transformation.

So started to read ISLR book and here I am writing some notes about simple linear regression.

Earlier I had taken courses on statistics but was never able to figure it out how mean and standard deviation related to linear regression. It is very well explained that we find the same for the model parameters. This parameters also have confidence interval which is similar to standard deviation of mean.

Also linear regression can be used to test hypothesis like whether x depends on y. This also lead to the t-statistics and p-value discussion. P-value is area under curve of t-distribution so both things are related in a way. We can find confidence interval both using t-score and z-score. T-distribution is used when data is less.

Then there was discussion about Residue Standard Error(RSE) and R^2 statistics, which measures the inherent error in selecting the model. Reason for this error can be the fact that outcome also depends on some other variable, or our linear model is not a good fit in this case etc. There were formula for both of them. R^2 is independent of outcome variable and ranges from 0 to 1. It is a measure of how much variance is reduced by model or how much model explains the Data.

Letter on it was said that R^2 equals to square of pearson’s correlation coefficient. So for single regression R^2 does not have specific role. But it will be playing greater role in multiple linear regression.

There are also some Lab exercises in R, will be presenting it soon with some practical example.

When you apply the model to real world data, you want a number which says how much confident you are, say 80%.

What if you divide data in two parts? Train, Test

In case if you don’t need to choose an appropriate model from several rivaling approaches, you can just re-partition your set that you basically have only training set and test set, without performing the validation of your trained model. I personally partition them 70/30 then.

What if you divide the data in three parts? Train, CV, Test

We use test data to forecast how would model perform in real world

We can not do it just by CV, because

We were applying multiple models on CV

So our model would be somewhat bias for CV

Here are some sample code examples in sklearn :

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
X_train.shape, y_train.shape
X_test.shape, y_test.shape

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

from sklearn import linear_model, datasets
logreg = linear_model.LogisticRegression(C=1)
logreg.fit(X_train, y_train)

knn.score(X_test, y_test) // Ans = 0.76791277258566981
logreg.score(X_test, y_test) // Ans = 0.78348909657320875

So sometimes we get many many feature variable, say 50 for now. (Can be much more in practice). And model is always good with small features unless they are necessary. Technically it is knows as dimensionality reduction and keeps model from over-fitting.

Here is the great blog about this. It talks about correlation and mutual information.

And I was wondering how to do with python and here I am pasting some code for Titanic data-set.

Following code produces heatmap without values and is useful when we have more variables to get a quick glance.

Recently I was wondering about a question what do you do first when you get the data ?

I found this really nice article which helped me get some idea. Then I tried to apply it on Titanic dataset and this and upcoming blogs contains my learnings and thoughts about the same.

So one thing I learned was about knowing your variables. But how do you know it? A safe way to start it out is plotting the histogram, which gives rough idea about its distribution, spread etc. Histogram also helps you get it’s effect on target variable (which in case of our dataset is whether people survived or not). For that we plot two overlapping histograms.

Next images shows output of above code. As we can see from the figure below all though number of female passengers are less there survival rate is pretty high.

Another problem I faced was matplotlib does not understand categorical variables and we need to assign it some integer for plotting histograms. Here is the example of doing it.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(trainDF.Sex.unique())
trainDF.Sex=le.transform(trainDF.Sex) #Coverts to numerric values
trainDF.Sex=le.inverse_transform(trainDF.Sex) #Transfers back to Categorical Variable

Also pyplot hist is not handling NaN values, but pandas’ histogram method does.

trainDF.Age.hist(label='Total')
trainDF[trainDF['Survived']==1].Age.hist(label='Survived')
pyplot.legend(loc='upper center')
pyplot.xlabel('Age')
pyplot.ylabel('No of Passengers')