We want a hypothesis that is bounded between zero and one, regression hypothesis line extends beyond this limits. Hypothesis here also represents probability of observing an outcome.
Hypothesis by ISLR and Andrew N.G :
Odds and log-odds/logit
In regression beta1 given average change in y for unit change in x. But here it says unit increase in x changes log-odds by beta1. It multiplies odds by exp(beta1) and hence depends on current value of odds and therefor is not linear.
For ISLR perspective it is likelihood that we want to maximize.
Andre N.G looks it from the perspective of modifying cost function of linear regression.
As we can see that Andrew N.G cost function is same as maximizing log likelihood of ISLR.
Least square in case of linear regression is special case of maximum likelihood. We know that derivation where we assume likelihood to be gaussian.
Since few days I was coming across to Q-Q plots very often and thought to learn more about it.
Many a times we want our data to be normal, this is because we normality is an assumption behind many statistical models. Now how to test normality. Wikipedia has an article about this which lists many method, one of them is Q-Q plots.
Here is how to create Q-Q plot manually (This steps will show the theory behind it):
Sort your samples (Call it Y)
Draw n samples from standard normal distribution to divide it in (n+1) equal areas
Standard normal distribution is a distribution with mean = 0 and standard deviation = 1
Call above X
Plot Y against X
For normal distribution it would be approximately straight line
However considering probability, outliers and no of smaples have role to play
We have learned various probability distribution during high school and engineering courses. However at times we forget them, so here I am providing simple practical scenarios for each distribution with no theories involved.
When the random variable has just two outcomes
Probability of Drug/Medicine will be approved by government is p = 0.65
Probability that it will not approve is 0.35
Below formula works when we have probability available, in real life we estimate them from data :
Mean = p
Variance (Sigma Square) = p*(1-p)
When you perform the Bernoulli experiment multiple times and want to see how many times certain outcome appears.
For example you flip a coin(fair/biased) 10 time and probability that head will appear for x (1, 2, …..10) times.
Another more practical example :
Suppose oil price can increase by 3 bucks or decreased by 1 buck each day
Probability of increasing p = 0.65, and that of decreasing = 0.35
What price can we expect after three days
Note (Increase, Increase, Decrease) and (Increase, Decrease, Increase) will give same price.
From another point of view it count no of successes in an experiment :
No of patient responding to treatment
Binary classification problem
Below formula works when we have probability available, in real life we estimate them from data :
n = no of times experience is performed
Mean = n*p
Variance (Sigma Square) = n*p*(1-p)
Very popular distribution
Observed very often because of central limit theorem (CLT)
% change in a stock price of google from a previous day
Heights and weights of persons
It is good to remember empirical numbers for normal distribution :
68 % – one standard deviation
95 % – two standard deviation
99.7 % – three standard deviation
We use Z score as a distance in the unit of standard deviation from mean
mean = variance = lambda (average no of events)
For a fix region if we know the average no of events, it helps formulate probability for no of events.
PDF(Probability Distribution Function) is a skewed curve
There is just one parameter (lambda)
While normal has two parameters (u and sigma)
Bernoulli has just one parameter (p)
Binomial has two parameters (n and p)
It has just one parameter called df (Degrees of Freedom)
mean = 0
std. deviation= sqrt(df/(df-2))
As df increases it moves more and more toward standard normal curve
In general it is more wider than bell curve.
Reason being from above formula std. deviation is always greater than 1
For standard bell curve std. deviation = 1
Area under t distribution is 1
Fitting the Distribution?
Fitting the distribution means, we are using some distribution as the model and we want to estimate the parameters. In case of Gaussian/Normal we estimate u and sigma, in case of poisson we estimate lambda.
What is probabilistic models ?
Models that propagate uncertainty of input to target variables are probabilistic models. Examples are :
In the previous post we started exploring statistical domain and will dive in more deeply today. So basically we will try to see what all the values in summary(model) in R suggest.
Here is a screenshot of how this summary looks :
Significant of Residue?
We want our residues to be normally distributed and centered around zero
It is like throwing at the arrow-board
If it is missing in just one direction there is a scope of improvement
If it is missing equally in all directions than we can try to reduce standard deviation
Irreducible error should always be observed in all the directions simultaneously.
Residues quantile gives us the first look at symmetric
And R also gives standard deviation of residuals known as RSE : residue standard error
What is the relationship between t value and p-value in the coefficient section?
With this values what R is trying to test is if variable has any relationship with the output
This is preset statistical question (Null hypothesis) and you cannot change it.
If coefficient is zero then it is not contributing, otherwise it is.
So t values is number of standard deviation mean is away from zero.
Larger the t value more the significance of variable.
Actually all these is related to probability and sampling.
You keep taking samples from larger population.
For each sample there will be different coefficient.
For some sample it can be zero as well.
So in the result which R displays we have a mean and standard deviation.
Coefficient is probabilistic variable centered at mean (Estimate in R summary).
Mean is away from zero by t standard deviation.
What is the probability of observing coefficient beyond t standard deviation?
This probability is given by p-value, which is Probability (coefficient > [t deviation from mean])
Role of R^2
How to interpret R^2?
It shows how much of the variance is explained by the model. See formulas for greater understanding.
Why use R^2 over RSE?
R^2 has an advantage over RSE because it is always between 0 and 1
What can be considered as good value of R^2?
Good value of R^2 depends on problem setting. In physics when we are sure that data comes from linear model it is close to 1. While in marketing domin very small proportion of the variance can be explained by predictor. So R^2 = 0.1 is also realistic.
Difference between absolute and adjusted R^2?
R^2 always increases with no of variables, but adjusted R^2 decrease if added variable is not significant
Formula of adjusted R^2 somehow contains no of variables, so when the variables is added and gain is not significant result actually deceases.
Sometimes RSE increases while RSS decreases in the below formula
Not RSS and RSE are not related to R^2, this is just to show possible formula
Significance of F-score?
T test tells us if single variable is significant, while f-test tells us if a group of variables are jointly significant.
F-statistics also has a p value associated to it.
Null hypothesis for F test is H0: Intercept only model and your model are equal.
While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. Later is given by F-test.
Next question comes is why we need F-statistics when we have p values of individual coefficient?
It seems that when one of the coefficient is significant (has good p-value), overall model will also be significant.
However, this is violated when no of variable p is very large.
Good values of F-statistic?
It depends on value of n and p
n = no of observations in training set
p = no of independent variables
When n is large F-value little greater than 1 is enough to reject null hypothesis.
But it is good to take decision based on corresponding p value, which takes into account both n and p
What is degrees of freedom?
Although not highlighted in the screenshot, just want to share that degrees of freedom is the difference between n and no of non zero coefficient, intercept included.
Significance Score *** in coefficient section?
R indicates whether p value is good or bad by showing stars against it.
During my recent flight I got a chance to talk to Samarth, who builds model for quantitative finance. He suggested that behind every machine learning model there are several statistical assumptions and you need to see if your data meets those assumptions. If not you need to make certain transformation.
So started to read ISLR book and here I am writing some notes about simple linear regression.
Earlier I had taken courses on statistics but was never able to figure it out how mean and standard deviation related to linear regression. It is very well explained that we find the same for the model parameters. This parameters also have confidence interval which is similar to standard deviation of mean.
Also linear regression can be used to test hypothesis like whether x depends on y. This also lead to the t-statistics and p-value discussion. P-value is area under curve of t-distribution so both things are related in a way. We can find confidence interval both using t-score and z-score. T-distribution is used when data is less.
Then there was discussion about Residue Standard Error(RSE) and R^2 statistics, which measures the inherent error in selecting the model. Reason for this error can be the fact that outcome also depends on some other variable, or our linear model is not a good fit in this case etc. There were formula for both of them. R^2 is independent of outcome variable and ranges from 0 to 1. It is a measure of how much variance is reduced by model or how much model explains the Data.
Letter on it was said that R^2 equals to square of pearson’s correlation coefficient. So for single regression R^2 does not have specific role. But it will be playing greater role in multiple linear regression.
There are also some Lab exercises in R, will be presenting it soon with some practical example.
When you apply the model to real world data, you want a number which says how much confident you are, say 80%.
What if you divide data in two parts? Train, Test
In case if you don’t need to choose an appropriate model from several rivaling approaches, you can just re-partition your set that you basically have only training set and test set, without performing the validation of your trained model. I personally partition them 70/30 then.
What if you divide the data in three parts? Train, CV, Test
We use test data to forecast how would model perform in real world
We can not do it just by CV, because
We were applying multiple models on CV
So our model would be somewhat bias for CV
Here are some sample code examples in sklearn :
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
from sklearn import linear_model, datasets
logreg = linear_model.LogisticRegression(C=1)
knn.score(X_test, y_test) // Ans = 0.76791277258566981
logreg.score(X_test, y_test) // Ans = 0.78348909657320875
So sometimes we get many many feature variable, say 50 for now. (Can be much more in practice). And model is always good with small features unless they are necessary. Technically it is knows as dimensionality reduction and keeps model from over-fitting.
Here is the great blog about this. It talks about correlation and mutual information.
And I was wondering how to do with python and here I am pasting some code for Titanic data-set.
Following code produces heatmap without values and is useful when we have more variables to get a quick glance.
Recently I was wondering about a question what do you do first when you get the data ?
I found this really nice article which helped me get some idea. Then I tried to apply it on Titanic dataset and this and upcoming blogs contains my learnings and thoughts about the same.
So one thing I learned was about knowing your variables. But how do you know it? A safe way to start it out is plotting the histogram, which gives rough idea about its distribution, spread etc. Histogram also helps you get it’s effect on target variable (which in case of our dataset is whether people survived or not). For that we plot two overlapping histograms.
Next images shows output of above code. As we can see from the figure below all though number of female passengers are less there survival rate is pretty high.
Another problem I faced was matplotlib does not understand categorical variables and we need to assign it some integer for plotting histograms. Here is the example of doing it.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
trainDF.Sex=le.transform(trainDF.Sex) #Coverts to numerric values
trainDF.Sex=le.inverse_transform(trainDF.Sex) #Transfers back to Categorical Variable
Also pyplot hist is not handling NaN values, but pandas’ histogram method does.
pyplot.ylabel('No of Passengers')