On Clustering

K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:

  • You need to specify value of k
  • Can cluster non-clustered data
  • Sensitive to scale
  • Even on perfect data sets, it can get stuck in a local minimum
  • Means are continuous
  • Hidden assumption: SSE is worth minimizing
  • K-means serves more as quantification


In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.

Scikit also supports variety of clustering algorithms including DBSCAN and list which one suits when. http://scikit-learn.org/stable/modules/clustering.html






No Free Lunch

When averaged across all possible situations, every algorithm performs equally well 🙂


This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracy of multiple models of varying complexity to find the best model. A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.









Classification – One vs Rest and One vs One


In the blog post on Cost Function And Hypothesis for LR we noted that LR (Logistic Regression) inherently models binary classification. Here we will describe two approaches used to extend it for multiclass classification.

One vs Rest approach takes one class as positive and rest all as negative and trains the classifier. So for the data having n-classes it trains n classifiers. Now in the classification phase the n-classifier predicts probability of particular class and class with highest probability is selected.

One vs One considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class. (This is contrast to one vs rest where each classifier predicts probability). And the class which has been predicted most is the answer.


For example consider four class problem having classes A, B, C, and D.

One vs Rest

  • Models classifiers_A, classifier_B, classifier_C and classifier_D
  • During prediction here is the probability we get:
    • classifier_A = 40%
    • classifier_B = 30%
    • classifier_C = 60%
    • classifier_D = 50%
  • We assign it class B

One vs One

  • We train total six classifier with subset of data containing classes involved
    • classifier_AB
    • classifier_AC
    • classifier_AD
    • classifier_BC
    • classifier_BD
    • classifier_CD
  • And during classification
    • classifier_AB assigns class A
    • classifier_AC assigns class A
    • classifier_AD assigns class A
    • classifier_BC assigns class B
    • classifier_BD assigns class D
    • classifier_CD assigns class C
  • We assign it to class A


More notes

  • One vs rest trains less no of classifier and hence is faster overall and hence is usually prefered
  • Single classifier in one vs one uses subset of data, so single classifier is faster for one vs one
  • One vs one is less prone to imbalance in dataset (dominance of particular classes)



  • What if two class gets equal vote in the case of one vs one case
  • What if probability are almost close to equal in case of one vs rest
  • We will discuss this issue in further blog posts

Dimentionality Reduciton – Correlation

So sometimes we get many many feature variable, say 50 for now. (Can be much more in practice). And model is always good with small features unless they are necessary. Technically it is knows as dimensionality reduction and keeps model from over-fitting.

Here is the great blog about this. It talks about correlation and mutual information.

And I was wondering how to do with python and here I am pasting some code for Titanic data-set.

Following code produces heatmap without values and is useful when we have more variables to get a quick glance.


sampleDF = trainDF.ix[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Sib Sp', 'Parch', 'Fare']]
sns.heatmap(cm, square=True)



Following code gives heatmap along with the values.

Upon the first plot it may seem that there is a significant correlation survived and age. But when we look at the no it comes out to be just 0.54.

sampleDF = trainDF.ix[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Sib Sp', 'Parch', 'Fare']]


However first plot also has its own use when we have 50 odd variables, here is one such plot.




Exploratory Analysis

Recently I was wondering about a question what do you do first when you get the data ?

I found this really nice article which helped me get some idea. Then I tried to apply it on Titanic dataset and this and upcoming blogs contains my learnings and thoughts about the same.

So one thing I learned was about knowing your variables. But how do you know it? A safe way to start it out is plotting the histogram, which gives rough idea about its distribution, spread etc. Histogram also helps you get it’s effect on target variable (which in case of our dataset is whether people survived or not). For that we plot two overlapping histograms.

pyplot.hist(trainDF.Sex, label='Total')
pyplot.hist(trainDF[trainDF['Survived']==1].Sex, label='Survived')
pyplot.legend(loc='upper center')
pyplot.xlabel('Sex 0=Femal 1=Male')
pyplot.ylabel('No of Passengers')

Next images shows output of above code. As we can see from the figure below all though number of female passengers are less there survival rate is pretty high.



Another problem I faced was matplotlib does not understand categorical variables and we need to assign it some integer for plotting histograms. Here is the example of doing it.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
trainDF.Sex=le.transform(trainDF.Sex) #Coverts to numerric values
trainDF.Sex=le.inverse_transform(trainDF.Sex) #Transfers back to Categorical Variable


Also pyplot hist is not handling NaN values, but pandas’ histogram method does.

pyplot.legend(loc='upper center')
pyplot.ylabel('No of Passengers')