In data science, clustering or cluster analysis is the task of grouping data objects into clusters, such that the objects within one cluster are similar to each other, based on some trait, and dissimilar to objects within other clusters. The aim of clustering is to segregate objects based on the similarity of the identified traits. Within a cluster the variance of the objects should be low, while the variance should be high across the clusters. In other words, there should be a high degree of intra-class similarity, and a low level of inter-class similarity.
Clustering is a form of unsupervised learning, meaning that the data set is unlabeled, and you aren’t aware of patterns within the data. The algorithm you choose would do its best to discover those patterns and form clusters accordingly.
Why is clustering important?
A prime use of clustering is customer segmentation, for instance, in marketing, when data scientists employ algorithms to discover unique groups within a customer base. Marketeers can then devise targeted marketing strategies based on the hidden patterns discovered.
Some other practical uses of clustering are:
- Insurance: To identify health policy holders with a high average claim cost
- Town planning: To segregate houses as per location, type, value, etc.
- Fake news: To weed out sensationalized and clickbait articles
- Spam email: To flag emails based on sender, headline, contents
- Medicine: To differentiate between types of tissue in medical imaging
Based on the data set at hand, the right choice of clustering technique can generate impressive results. Here are X of the most common clustering algorithms every data scientist must know.
It is perhaps the most common clustering algorithm and being described in 1967, it is one of the oldest. The aim of the K-means algorithm is to segregate n data points into k clusters in such a way that the sum of the distances between the data points and their cluster centroid is minimized. A cluster centre is the arithmetic mean of the points within a cluster.
The K-means algorithm starts out by guessing cluster centers. Then it employs an expectation-maximization approach to converge. The iterative steps involve assigning each data point to the nearest cluster center and defining the location of the cluster center by taking the mean of the data points in it. The algorithm progresses iteratively and converges when the assignments remain fixed.
- Is quite easy to implement
- Can be computationally fast with large data sets
- Guarantees convergence
- Adapts to new examples
- Generalizes to clusters of varying shapes/ sizes
However, it has certain limitations:
- It is tough to predict ‘k’, the number of clusters
- The final result rely a lot on the initial values
- Centroids can be affected by outliers
- Requires attention in terms of scaling the data
This is another centroid-based clustering algorithm and a considerable advantage it has over the K-means algorithm is that it does not require you to specify the number of clusters upfront. The mean-shift algorithm works iteratively to discover blobs within the data set while updating centroids based on the density of points in a region. The centroids are the mean of the points in a region and the algorithm ensures that near-duplicates are filtered out.
In the mean-shift algorithm, all the data points are initialized as clusters/ cluster centroids. The algorithm then proceeds to find centroids based on the data point density and it does so by tuning the bandwidth hyperparameter and shifting the kernel according to the mean-shift vector. The process converges when the centroids cannot move any further.
Some advantages of mean-shift clustering are:
- You do not have to specify the number of clusters
- The algorithm will converge
- Outliers do not pose a threat
- It can model complex clusters and does not require model assumptions
Some drawbacks of mean-shift clustering are:
- It is not very scalable owing to the fact that each iteration involves multiple nearest neighbor searches
- No control over the number of clusters
Agglomerative hierarchical clustering
It is the most common type of hierarchical clustering algorithm and adopts a bottom-up approach with the result being that, ultimately, all the data points are part of one root cluster. To start off, each data point is considered a cluster in itself and with each iteration, clusters are merged based on the similarity they share. When all the clusters have been successfully merged you get a tree representation called a dendrogram. The clusters are merged according to the linkage criteria:
- Single linkage
- Maximum linkage
- Average linkage
The algorithm works iteratively, merging clusters that increase the chosen linkage distance minimally.
- Provides a handy visualization of the clusters
- Works well with finding small clusters
- Does not require you to specify the number of clusters
- Allows for the dendrogram to be split at the desired level
- Once clusters are merged, the step cannot be undone
- The linkage criterion chosen may be sensitive to outliers and noise, break large clusters, be biased to global clusters, not handle irregular shapes and different sizes well
Density-based spatial clustering of applications with noise (DBSCAN) is a clustering technique that groups together data points in high-density regions into clusters. Data points in low-density regions are marked as outliers. For DBSCAN a cluster is a dense group of points. So, if a point belongs to a cluster, there must be several other points lying nearby. This relationship is described by two parameters:
- epsilon (ε): Defines how near points must be to be considered part of the same cluster
- minPts: Defines what ‘dense’ means, i.e, the minimum amount of points required to form a cluster
DBSCAN starts with an arbitrary point, and if that point has more than minPts number of points within a radius of ε then a cluster is formed. The cluster is expanded by checking if the new points also have more than minPts points within their ε. When there are no more points to add to the cluster, a new arbitrary point is selected and the process starts again. At the end DBSCAN labels the data points as:
- Core point: A point that has at least minPts points within a radius of ε
- Border point: A point that is within a cluster, within an ε radius of a core point, but does not have minPts points within its ε radius
- Noise point: A point that does not belong to any cluster. It is an outlier.
A few upsides of DBSCAN are that:
- It does not require you to specify the number of clusters
- It works well with clusters of arbitrary shape and size
- Is robust to outliers
Some downsides of DBSCAN are that:
- You have to define ε and minPts
- Border points could be part of more than one cluster, i.e., DBSCAN is not entirely deterministic
- Fails with clusters of varying density
- Does not work well with high-dimensional data
Learning about these 4 clustering algorithms is merely to scratch the surface. As a data scientist, you have to dig deep and often use different and advanced techniques to find meaningful patterns in the data. To get yourself started it is well worth getting your hands dirty with some datasets. Download some from Kaggle or another source, and test different algorithms using Python and scikit-learn.
It also pays to land a job at a large company that deals with big data on a regular basis. For that, simply sign up with Talent500. Our dynamic skill assessment algorithms match your data science capabilities to the best job openings at Fortune 500 companies. Taking our test is really the best way to cluster together with a group of talented professionals who will help you take your data science career to the next level.