Panel Data And Analysis

From Wikipedia

In statistics and econometrics, panel data or longitudinal data[1][2] are multi-dimensional data involving measurements over time. Panel data contain observations of multiple phenomena obtained over multiple time periods for the same firms or individuals

Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only (one panel member or individual for the former, one time point for the latter).

A study that uses panel data is called a longitudinal study or panel study.

 

Cross sectional data:

Cross-sectional data, or a cross section of a study population, in statistics and econometrics is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.

Analysis of cross-sectional data usually consists of comparing the differences among the subjects.

 

Examples

 

Panel Data

panel

Cross Sectional Data

cs

Time Series

ts

Advertisements

Gradient Descent vs Netwon’s Method

Newton’s method for finding roots:

3

In optimization we are essentially finding roots of derivative.

 

1

2

We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.

4

 

Andrew N.G Lecture [2]

 Gradient Descent  Newton
 Simpler  Slightly more complex (Requires computing and inverting hessian)
 Needs choice of learning rate alpha  No parameters (third point in image is optional )
 Needs more iteration  Needs fewer iteration
 Each iteration is cheaper O(n) where n is no of features  Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
 Use when no of features are less (n<1000)  Use when (n > 10,000)

 

 

References:

*[1] https://www.youtube.com/watch?v=42zJ5xrdOqo

*[2] https://www.youtube.com/watch?v=iwO0JPt59YQ

 

No Free Lunch

When averaged across all possible situations, every algorithm performs equally well 🙂

 

This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracy of multiple models of varying complexity to find the best model. A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

 

http://www.no-free-lunch.org/

http://scholar.google.com/scholar?q=”no+free+lunch”+Wolpert&#8221;

http://www.statsblogs.com/2014/01/25/machine-learning-lesson-of-the-day-the-no-free-lunch-theorem/

 

 

 

Classification – One vs Rest and One vs One

 

In the blog post on Cost Function And Hypothesis for LR we noted that LR (Logistic Regression) inherently models binary classification. Here we will describe two approaches used to extend it for multiclass classification.

One vs Rest approach takes one class as positive and rest all as negative and trains the classifier. So for the data having n-classes it trains n classifiers. Now in the classification phase the n-classifier predicts probability of particular class and class with highest probability is selected.

One vs One considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class. (This is contrast to one vs rest where each classifier predicts probability). And the class which has been predicted most is the answer.

Example

For example consider four class problem having classes A, B, C, and D.

One vs Rest

  • Models classifiers_A, classifier_B, classifier_C and classifier_D
  • During prediction here is the probability we get:
    • classifier_A = 40%
    • classifier_B = 30%
    • classifier_C = 60%
    • classifier_D = 50%
  • We assign it class B

One vs One

  • We train total six classifier with subset of data containing classes involved
    • classifier_AB
    • classifier_AC
    • classifier_AD
    • classifier_BC
    • classifier_BD
    • classifier_CD
  • And during classification
    • classifier_AB assigns class A
    • classifier_AC assigns class A
    • classifier_AD assigns class A
    • classifier_BC assigns class B
    • classifier_BD assigns class D
    • classifier_CD assigns class C
  • We assign it to class A

 

More notes

  • One vs rest trains less no of classifier and hence is faster overall and hence is usually prefered
  • Single classifier in one vs one uses subset of data, so single classifier is faster for one vs one
  • One vs one is less prone to imbalance in dataset (dominance of particular classes)

 

Inconsistency

  • What if two class gets equal vote in the case of one vs one case
  • What if probability are almost close to equal in case of one vs rest
  • We will discuss this issue in further blog posts

Dimentionality Reduciton – Correlation

So sometimes we get many many feature variable, say 50 for now. (Can be much more in practice). And model is always good with small features unless they are necessary. Technically it is knows as dimensionality reduction and keeps model from over-fitting.

Here is the great blog about this. It talks about correlation and mutual information.

And I was wondering how to do with python and here I am pasting some code for Titanic data-set.

Following code produces heatmap without values and is useful when we have more variables to get a quick glance.

 

sampleDF = trainDF.ix[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Sib Sp', 'Parch', 'Fare']]
cm=sampleDF.corr()
sns.heatmap(cm, square=True)

f1

 

Following code gives heatmap along with the values.

Upon the first plot it may seem that there is a significant correlation survived and age. But when we look at the no it comes out to be just 0.54.

sampleDF = trainDF.ix[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Sib Sp', 'Parch', 'Fare']]
sns.corrplot(sampleDF)

f2

However first plot also has its own use when we have 50 odd variables, here is one such plot.

 

 

Exploratory Analysis

Recently I was wondering about a question what do you do first when you get the data ?

I found this really nice article which helped me get some idea. Then I tried to apply it on Titanic dataset and this and upcoming blogs contains my learnings and thoughts about the same.

So one thing I learned was about knowing your variables. But how do you know it? A safe way to start it out is plotting the histogram, which gives rough idea about its distribution, spread etc. Histogram also helps you get it’s effect on target variable (which in case of our dataset is whether people survived or not). For that we plot two overlapping histograms.

pyplot.hist(trainDF.Sex, label='Total')
pyplot.hist(trainDF[trainDF['Survived']==1].Sex, label='Survived')
pyplot.legend(loc='upper center')
pyplot.xlabel('Sex 0=Femal 1=Male')
pyplot.ylabel('No of Passengers')

Next images shows output of above code. As we can see from the figure below all though number of female passengers are less there survival rate is pretty high.

 

survival

Another problem I faced was matplotlib does not understand categorical variables and we need to assign it some integer for plotting histograms. Here is the example of doing it.


from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(trainDF.Sex.unique())
trainDF.Sex=le.transform(trainDF.Sex) #Coverts to numerric values
trainDF.Sex=le.inverse_transform(trainDF.Sex) #Transfers back to Categorical Variable

 

Also pyplot hist is not handling NaN values, but pandas’ histogram method does.

trainDF.Age.hist(label='Total')
trainDF[trainDF['Survived']==1].Age.hist(label='Survived')
pyplot.legend(loc='upper center')
pyplot.xlabel('Age')
pyplot.ylabel('No of Passengers')