The Talent500 Blog

Explained: 4 Clustering Algorithms Every Data Scientist Must Know

In data science, clustering or cluster analysis is the task of grouping data objects into clusters, such that the objects within one cluster are similar to each other, based on some trait, and dissimilar to objects within other clusters. The aim of clustering is to segregate objects based on the similarity of the identified traits. Within a cluster the variance of the objects should be low, while the variance should be high across the clusters. In other words, there should be a high degree of intra-class similarity, and a low level of inter-class similarity.

Clustering is a form of unsupervised learning, meaning that the data set is unlabeled, and you aren’t aware of patterns within the data. The algorithm you choose would do its best to discover those patterns and form clusters accordingly.

Why is clustering important?

A prime use of clustering is customer segmentation, for instance, in marketing, when data scientists employ algorithms to discover unique groups within a customer base. Marketeers can then devise targeted marketing strategies based on the hidden patterns discovered.

Some other practical uses of clustering are:

Based on the data set at hand, the right choice of clustering technique can generate impressive results. Here are X of the most common clustering algorithms every data scientist must know.

K-means clustering

It is perhaps the most common clustering algorithm and being described in 1967, it is one of the oldest. The aim of the K-means algorithm is to segregate n data points into k clusters in such a way that the sum of the distances between the data points and their cluster centroid is minimized. A cluster centre is the arithmetic mean of the points within a cluster.

The K-means algorithm starts out by guessing cluster centers. Then it employs an expectation-maximization approach to converge. The iterative steps involve assigning each data point to the nearest cluster center and defining the location of the cluster center by taking the mean of the data points in it. The algorithm progresses iteratively and converges when the assignments remain fixed.

K-means clustering:

However, it has certain limitations:

Mean-shift clustering

This is another centroid-based clustering algorithm and a considerable advantage it has over the K-means algorithm is that it does not require you to specify the number of clusters upfront. The mean-shift algorithm works iteratively to discover blobs within the data set while updating centroids based on the density of points in a region. The centroids are the mean of the points in a region and the algorithm ensures that near-duplicates are filtered out.

In the mean-shift algorithm, all the data points are initialized as clusters/ cluster centroids. The algorithm then proceeds to find centroids based on the data point density and it does so by tuning the bandwidth hyperparameter and shifting the kernel according to the mean-shift vector. The process converges when the centroids cannot move any further.

Some advantages of mean-shift clustering are:

Some drawbacks of mean-shift clustering are:

Agglomerative hierarchical clustering

It is the most common type of hierarchical clustering algorithm and adopts a bottom-up approach with the result being that, ultimately, all the data points are part of one root cluster. To start off, each data point is considered a cluster in itself and with each iteration, clusters are merged based on the similarity they share. When all the clusters have been successfully merged you get a tree representation called a dendrogram. The clusters are merged according to the linkage criteria:

The algorithm works iteratively, merging clusters that increase the chosen linkage distance minimally.

Agglomerative clustering:

However,

DBSCAN algorithm

Density-based spatial clustering of applications with noise (DBSCAN) is a clustering technique that groups together data points in high-density regions into clusters. Data points in low-density regions are marked as outliers. For DBSCAN a cluster is a dense group of points. So, if a point belongs to a cluster, there must be several other points lying nearby. This relationship is described by two parameters:

DBSCAN starts with an arbitrary point, and if that point has more than minPts number of points within a radius of ε then a cluster is formed. The cluster is expanded by checking if the new points also have more than minPts points within their ε. When there are no more points to add to the cluster, a new arbitrary point is selected and the process starts again. At the end DBSCAN labels the data points as:

A few upsides of DBSCAN are that:

Some downsides of DBSCAN are that:

Learning about these 4 clustering algorithms is merely to scratch the surface. As a data scientist, you have to dig deep and often use different and advanced techniques to find meaningful patterns in the data. To get yourself started it is well worth getting your hands dirty with some datasets. Download some from Kaggle or another source, and test different algorithms using Python and scikit-learn.

It also pays to land a job at a large company that deals with big data on a regular basis. For that, simply sign up with Talent500. Our dynamic skill assessment algorithms match your data science capabilities to the best job openings at Fortune 500 companies. Taking our test is really the best way to cluster together with a group of talented professionals who will help you take your data science career to the next level.

1+