Clustering Metrics

Here are some metric available for validating clustering, explanation of each one is available on sklearn. [0]

If ground truth labels are available:

  • Adjusted Rand Index
  • Mutual Information Based scores
  • Homogeneity, completeness and V-measure
  • Fowlkes-Mallows scores

If not available :

  • Silhouette Coefficient
    • Range (-1,1)
    • 1 means it is similar to data-points in each cluster
  • Calinski-Harabaz Index
  • Davies-Bouldin Index
  • Contingency Matrix

 

Calculating SSE

It is a sum of distance between each point and its cluster center.

c1

Silhouette Score

It is calculated for each point and then we take an average of it.

c2

c4           c3

a(i) is average distance of a point to other points in same cluster.

b(i) is minimum of above average in for point in other cluster. It given the distance to nearest cluster.

s(i) close to 1 means data point is appropriately clustered. -1 means it is very bad clustered.

Setting s(i) to 0 when cluster size is one ensures that curve is not monotonically decreasing.

 

Elbow method and Silhouette Analysis

Notebook is available at https://github.com/arcarchit/datastories/blob/master/Silhouette.ipynb

sil1         sil2

Rand Index

  • When cluster labels are available we can use this matrix
  • It basically checks the similarity between two cluster assignments
    • Labels can also be seen as one type of cluster assignment
    • Score basically tells us how similar to cluster assignments are
  • This works by taking pair of points
    • Out of all pairs how many pairs are agreed in both clusters mechanism
    • Agree mean both
      • They are in same cluster in both mechanism
      • They are in different cluster in both mechanism
  • The Rand index has a value between 0 and 1, with 0 indicating that the two data clustering do not agree on any pair of points and 1 indicating that the data clustering are exactly the same

rand_index

  • One drawback of Rand index is that it can given non zero value for random assignment of clusters. To mitigate that there is matrix called Adjusted Rand Index. [2]
    • It specifically does not work when no of clusters are high

 

 

Reference

[0] : https://scikit-learn.org/stable/modules/clustering.html

[1] : https://github.com/anthonyng2/udemy-the-complete-machine-learning-course-with-python

[2] : https://davetang.org/muse/2017/09/21/adjusted-rand-index/

 

Probabilistic Clustering

This post serves as a lecture summary of a course offered by the University of Washington, which can be accessed at the following link: University of Washington – ML: Clustering and Retrieval.

Sample code related to the course can be found on GitHub at: Sample Code for EM Algorithm.

In this lecture, we will cover several important topics:

  1. Why we need probabilistic clustering: We will explore the motivations behind using probabilistic clustering methods, which allow for soft assignments to clusters rather than rigid assignments.
  2. Mixture models: We will dive into the concept of mixture models, which are a probabilistic approach to clustering. Mixture models allow data points to belong to multiple clusters with different probabilities.
  3. Expectation Maximization (EM): We will study the Expectation Maximization algorithm, a powerful iterative algorithm used to estimate the parameters of mixture models. EM alternates between the expectation step, where cluster assignments are updated based on current parameter estimates, and the maximization step, where the parameters are updated based on current cluster assignments.

Why we need probabilistic clustering ?

To understand the need for probabilistic clustering, let’s consider the problem of determining user preferences based on their feedback. Suppose users provide feedback on whether they liked or disliked an article. We want to gather this feedback and group it into clusters to gain insights into their preferences.

For instance, imagine a user who likes an article about the first human landing on Mars. This article could be relevant to both world news and science news categories. In such cases, assigning articles to clusters with a hard assignment (where an article belongs to only one cluster) may not capture the complete picture. Soft assignment to clusters, where an article can belong to multiple clusters with different probabilities, is more suitable.

Now, let’s explore some scenarios where the traditional K-means algorithm fails:

  1. Clusters with Different Spreads: When clusters have varying spreads, simply deciding based on a single sample may not provide an accurate representation. Instead, it is important to consider the number of data points within each cluster. Assigning an ambiguous point to a cluster that has more data points can be a more meaningful approach.
  2. Overlapping Clusters: In cases where clusters overlap, soft assignment becomes crucial. It allows for assigning data points to multiple clusters based on the degree of membership.
  3. Clusters with Different Shapes or Orientations: When clusters have distinct shapes or orientations, measuring the distance to the cluster center alone is insufficient. Considering the shape of the clusters is necessary for accurate clustering.

One approach to address these challenges is using K-means with scaled Euclidean distance. This modification results in clusters represented by axis-aligned ellipses, accommodating different spreads in each dimension. However, determining appropriate weights for each dimension in the calculation remains a question that needs to be addressed. Shape is still axis-aligned ellipses, which is a limitation in itself.

Motivating application

Let’s consider the task of clustering images based on their color composition. One approach is to compute the average values of the red (R), green (G), and blue (B) channels for each image. This results in representing each image as a single 3-dimensional vector [R, G, B], capturing the average color values.

By applying clustering algorithms to these image vectors, we can group similar images together based on their dominant color characteristics. Although our data is unlabeled, we can observe potential clusters such as:

  1. Sky Cluster (Blue Dominated): Images that predominantly contain blue colors, such as those depicting clear skies, can be grouped into this cluster.
  2. Sunset Cluster (Red Dominated): Images showcasing vibrant sunsets with warm hues and red tones can be grouped into this cluster.
  3. Forest Cluster (Green Dominated): Images featuring lush green forests, landscapes, or vegetation will likely be grouped into this cluster due to their dominant green color composition.

Mixture Models

  • Here is how distribution of blue might look for all images
blue

We can model image clustering using a mixture of Gaussian distributions. The model consists of three Gaussian components, each with its own parameters (μ and σ). The final distribution is a convex combination of these components: π1g1 + π2g2 + π3*g3, where 0 ≤ π ≤ 1 and the sum of all π values is equal to 1.

Each mixture component represents a unique cluster, characterized by its own set of parameters (πk, μk, σk). In the case of multidimensional data, such as image vectors, each individual distribution is a multivariate Gaussian distribution.

Without examining the image vector, we can determine the probability of it belonging to cluster k using the πk value. Given an image vector x, we can estimate the probability of it belonging to cluster k as P(x | params) = P(x | μk, Σk), where Σ represents the covariance matrix in the multivariate Gaussian.

Clustering with soft assignment involves calculating a responsibility vector for each data point. The length of the responsibility vector corresponds to the number of clusters, which in this case is three (sky, forest, sunset). Each element of the responsibility vector represents the probability of a data point belonging to a specific cluster.

In the case of document clustering, where the dimensionality is high (proportional to the number of words in a vocabulary), keeping track of parameters for each Gaussian becomes computationally expensive. To address this, we simplify the model by considering only diagonal values in the covariance matrix, effectively restricting the model to axis-aligned ellipses. This approach is similar to K-means with scaled Euclidean distance, which also allows for axis-aligned ellipses.

However, the advantage of the mixture model is twofold:

  1. We do not need to specify weights for each dimension in advance; the model learns them from the data.
  2. Different clusters can have varying weights for each dimension, allowing for more flexible modeling of the data.

EM

Responsibility vector calculation with known cluster parameters

  • The responsibility vector can be calculated if we know the cluster parameters: π_k, μ_k, and ∑_k.
  • The responsibility vector is different from the π values.
  • Given a data point, we can find its corresponding responsibility vector by estimating the probabilities in that cluster.
  • The responsibility vector needs to be normalized after evaluating the probabilities.
a
  • This process is a consequence of Bayes’ rule.

Estimating cluster parameters with known responsibility vectors

  • It is similar to parameter estimation of a multivariate Gaussian.
  • Each data point, x_i, is multiplied, and there is no need to worry about the denominator since it is also not N (the number of data points).
  • b1         b2      nk_soft
  • N_k_soft implies how many points belong to cluster 1
    • Also it is a soft assignment

EM Algorithm

  • E-step: Estimation of the responsibility vector given parameters.
  • M-step: Maximum likelihood estimation of the parameters given the responsibility vector.
  • These two steps are repeated until convergence.
  • The EM algorithm can be derived as a coordinate ascent algorithm and converges to a local mode.
  • The biased coin example provides a good illustration of the EM algorithm. Check the image below

mj0nb

Challenges

  • The probability becomes infinite when the variance becomes zero.
  • This can occur when there is only one data point in a cluster, and the mean is itself with zero variance.
  • In high-dimensional spaces, this often happens, such as when a document does not have a specific word at all.
  • To solve this issue, a small value is added to the diagonal elements of the covariance matrix, similar to adding a prior in Bayesian approaches.

Relationship to K-means:

  • When a Gaussian has the same variance in all dimensions and shrinks to zero, the likelihood becomes either 0 or 1.
  • This behavior is similar to hard assignment, resembling the K-means algorithm.

Initialisation

  • Choose k observations at random and assign observations to nearest centroid
  • Above via k-means++
  • Solution of k-means
  • Grow mixture model by splitting cluster (sometime merging) until K clusters are formed

I have coded simple EM at [0]

References

[0] : https://github.com/arcarchit/datastories/blob/master/EM.ipynb

K-Means Clustering

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Application

We want to discover groups of articles (e.g., sports, world news) without knowing the labels in advance. We can assign labels after finding clusters. Our goal is to learn user preferences, whether she likes sports/technology etc.

Users provide feedback on whether they liked an article or not. We can accumulate this feedback within specific clusters to understand preferences.

Contrast this with building a classification model that predicts whether a user will like a specific article. For that, we would need to build a classifier for each user and score each (article, user) pair.

Alternatively, we can determine the types of articles each user prefers and show nearby articles using K-nearest neighbors (KNN). Clustering offers a more elegant and simplistic approach.

Unsupervised Task

Clustering is an unsupervised task since there are no input-output labels. Unsupervised learning poses challenges:

  • Defining clusters can be easy or hard, depending on the case.
  • Our algorithm works well for cases with intermediate difficulty.
  • There are still unsolved clustering problems.
unsolved_clustering

Clustering in General

A cluster is defined by its center or centroid and its shape, often represented by an ellipsoid.

K-Means

Scoring in K-means involves assigning a point to a given cluster. The scoring mechanism is the distance to the cluster center.

K-means alternates between two steps:

  1. Assign data to clusters.
  2. Update cluster centers.
step1
step2

Both steps are minimization steps, making K-means resemble a coordinate descent algorithm. We alternate between minimizing in two directions: μ (cluster assignment) and z (cluster center).

Convergence

K-means is a non-convex optimization problem, so the global minimum is not guaranteed. We aim to find zᵢ and μᵢ that minimize the objective function.

k_means_objective

By optimizing via coordinate descent, K-means can converge to a local minimum. However, convergence depends on the initialization. To improve convergence, we use smart initialization as described below.

Smart Initialization

  1. Pick the first center randomly.
  2. Calculate the distance to the nearest center for all data points.
  3. Choose the next center with a probability proportional to the squared distance.

To perform probability sampling, calculate the cumulative probability as described in [1]. When K-means is followed by this initialization, it is known as K-means++.

Trade-offs

K-means++ takes more time for initialization but converges faster. Additionally, the converged solution is better than that of regular K-means.

Assessing Quality

Quality can be assessed by evaluating the objective value after convergence, which is referred to as cluster heterogeneity. A smaller heterogeneity indicates a better algorithm. Tighter clusters are more homogeneous.

Beware of overfitting: as K increases, heterogeneity always decreases.

Choosing K

To choose K, you can heuristically plot heterogeneity for various values of K and select the one at the elbow. Alternatively, you can use hierarchical clustering, which doesn’t require specifying K in advance. We can also perform Silhouette Analysis.

References

[1]: Stack Overflow: How exactly does k-means++ work?

On Clustering

K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:

  • You need to specify value of k
  • Can cluster non-clustered data
  • Sensitive to scale
  • Even on perfect data sets, it can get stuck in a local minimum
  • Means are continuous
  • Hidden assumption: SSE is worth minimizing
  • K-means serves more as quantification

 

In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.

Scikit also supports variety of clustering algorithms including DBSCAN and lists which one suits when. http://scikit-learn.org/stable/modules/clustering.html

 

References:

https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means