Explained: 4 Clustering Algorithms Every Data Scientist Must Know

Naveen Benny

3 years ago

In data science, clustering or cluster analysis is the task of grouping data objects into clusters, such that the objects within one cluster are similar to each other, based on some trait, and dissimilar to objects within other clusters. The aim of clustering is to segregate objects based on the similarity of the identified traits. Within a cluster the variance of the objects should be low, while the variance should be high across the clusters. In other words, there should be a high degree of intra-class similarity, and a low level of inter-class similarity.

Clustering is a form of unsupervised learning, meaning that the data set is unlabeled, and you aren’t aware of patterns within the data. The algorithm you choose would do its best to discover those patterns and form clusters accordingly.

Jump to

Why is clustering important?

A prime use of clustering is customer segmentation, for instance, in marketing, when data scientists employ algorithms to discover unique groups within a customer base. Marketeers can then devise targeted marketing strategies based on the hidden patterns discovered.

Some other practical uses of clustering are:

Insurance: To identify health policy holders with a high average claim cost
Town planning: To segregate houses as per location, type, value, etc.
Fake news: To weed out sensationalized and clickbait articles
Spam email: To flag emails based on sender, headline, contents
Medicine: To differentiate between types of tissue in medical imaging

Based on the data set at hand, the right choice of clustering technique can generate impressive results. Here are X of the most common clustering algorithms every data scientist must know.

K-means clustering

It is perhaps the most common clustering algorithm and being described in 1967, it is one of the oldest. The aim of the K-means algorithm is to segregate n data points into k clusters in such a way that the sum of the distances between the data points and their cluster centroid is minimized. A cluster centre is the arithmetic mean of the points within a cluster.

The K-means algorithm starts out by guessing cluster centers. Then it employs an expectation-maximization approach to converge. The iterative steps involve assigning each data point to the nearest cluster center and defining the location of the cluster center by taking the mean of the data points in it. The algorithm progresses iteratively and converges when the assignments remain fixed.

K-means clustering:

Is quite easy to implement
Can be computationally fast with large data sets
Guarantees convergence
Adapts to new examples
Generalizes to clusters of varying shapes/ sizes

However, it has certain limitations:

It is tough to predict ‘k’, the number of clusters
The final result rely a lot on the initial values
Centroids can be affected by outliers
Requires attention in terms of scaling the data

Mean-shift clustering

This is another centroid-based clustering algorithm and a considerable advantage it has over the K-means algorithm is that it does not require you to specify the number of clusters upfront. The mean-shift algorithm works iteratively to discover blobs within the data set while updating centroids based on the density of points in a region. The centroids are the mean of the points in a region and the algorithm ensures that near-duplicates are filtered out.

In the mean-shift algorithm, all the data points are initialized as clusters/ cluster centroids. The algorithm then proceeds to find centroids based on the data point density and it does so by tuning the bandwidth hyperparameter and shifting the kernel according to the mean-shift vector. The process converges when the centroids cannot move any further.

Some advantages of mean-shift clustering are:

You do not have to specify the number of clusters
The algorithm will converge
Outliers do not pose a threat
It can model complex clusters and does not require model assumptions

Some drawbacks of mean-shift clustering are:

It is not very scalable owing to the fact that each iteration involves multiple nearest neighbor searches
No control over the number of clusters

Agglomerative hierarchical clustering

It is the most common type of hierarchical clustering algorithm and adopts a bottom-up approach with the result being that, ultimately, all the data points are part of one root cluster. To start off, each data point is considered a cluster in itself and with each iteration, clusters are merged based on the similarity they share. When all the clusters have been successfully merged you get a tree representation called a dendrogram. The clusters are merged according to the linkage criteria:

Single linkage
Maximum linkage
Average linkage
Ward

The algorithm works iteratively, merging clusters that increase the chosen linkage distance minimally.

Agglomerative clustering:

Provides a handy visualization of the clusters
Works well with finding small clusters
Does not require you to specify the number of clusters
Allows for the dendrogram to be split at the desired level

However,

Once clusters are merged, the step cannot be undone
The linkage criterion chosen may be sensitive to outliers and noise, break large clusters, be biased to global clusters, not handle irregular shapes and different sizes well

DBSCAN algorithm

Density-based spatial clustering of applications with noise (DBSCAN) is a clustering technique that groups together data points in high-density regions into clusters. Data points in low-density regions are marked as outliers. For DBSCAN a cluster is a dense group of points. So, if a point belongs to a cluster, there must be several other points lying nearby. This relationship is described by two parameters:

epsilon (ε): Defines how near points must be to be considered part of the same cluster
minPts: Defines what ‘dense’ means, i.e, the minimum amount of points required to form a cluster

DBSCAN starts with an arbitrary point, and if that point has more than minPts number of points within a radius of ε then a cluster is formed. The cluster is expanded by checking if the new points also have more than minPts points within their ε. When there are no more points to add to the cluster, a new arbitrary point is selected and the process starts again. At the end DBSCAN labels the data points as:

Core point: A point that has at least minPts points within a radius of ε
Border point: A point that is within a cluster, within an ε radius of a core point, but does not have minPts points within its ε radius
Noise point: A point that does not belong to any cluster. It is an outlier.

A few upsides of DBSCAN are that:

It does not require you to specify the number of clusters
It works well with clusters of arbitrary shape and size
Is robust to outliers

Some downsides of DBSCAN are that:

You have to define ε and minPts
Border points could be part of more than one cluster, i.e., DBSCAN is not entirely deterministic
Fails with clusters of varying density
Does not work well with high-dimensional data

Learning about these 4 clustering algorithms is merely to scratch the surface. As a data scientist, you have to dig deep and often use different and advanced techniques to find meaningful patterns in the data. To get yourself started it is well worth getting your hands dirty with some datasets. Download some from Kaggle or another source, and test different algorithms using Python and scikit-learn.

It also pays to land a job at a large company that deals with big data on a regular basis. For that, simply sign up with Talent500. Our dynamic skill assessment algorithms match your data science capabilities to the best job openings at Fortune 500 companies. Taking our test is really the best way to cluster together with a group of talented professionals who will help you take your data science career to the next level.