After introducing classification in our previous blog article, we will now move on to another popular supervised learning task: clustering.
What is clustering?
Clustering is a type of unsupervised learning technique. It is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. Fundamentally, all clustering methods use the same approach, i.e., first we calculate similarities and then we use it to cluster the data points into groups or batches. A simple example is shown in Figure 1. The data points clustered together can be classified into one single group, thus we can identify that there are 3 clusters in this input data.
The process of clustering is basically a collection of objects on the basis of similarity and dissimilarity between data points. It is very much important as these algorithms determine the intrinsic grouping among the unlabeled data present. There are no criteria for a good clustering. It depends on the user, what is the criteria they may use which satisfy their need. For example, in a library management system, you might want to cluster different books on the basis of topics but maybe there’s another manager wants to cluster the books based on the year of publication.
Popular clustering algorithms
There are various clustering algorithms which can be categorized based on their cluster model. For example, connectivity-based (hierarchical clustering), centroid-based, distribution-based, and density-based clustering. The followings are some classic clustering algorithms that are widely used: Gaussian mixture models, \(k\)-means clustering, and DBSCAN (density-based spatial clustering of applications with noise.) Since there are too many clustering algorithms nowadays, only several classic ones are listed here and others are left for future posts.
Gaussian mixture models (GMMs)
One prominent clustering method is Gaussian mixture models (GMMs). For GMMs, the data set is modeled with a fixed number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to better fit the data set. Usually, the expectation-maximization (EM) algorithm is implemented for fitting mixture-of-Gaussian models; note, the EM algorithm can also give confidence ellipsoids for multivariate models. The Bayesian Information Criterion can also be compute to assess the number of clusters in the data. After learning a GMM from training data, it can be used for clustering by assigning any given testing data sample to the Gaussian to which it mostly probably belongs. For GMMs, data samples can have soft membership, unlike \(k\)-means where the membership is hard.
A 1-D GMM example with two Gaussian components
\(K\)-means is also a really popular algorithm. Unlike GMM, \(k\)-means clusters are each defined by a single data point (centroid). Data samples are then assigned to the cluster according to which of the cluster centroids they are closest to. Figure 2 is an example of \(k\)-means with 4 clusters. The white crosses are where the four centroids locate.
\(K\)-means training involves starting with arbitrary centroids and then alternating between assigning observations to the nearest cluster centroid. After that, the algorithm will update the cluster centroids based on the new membership. The GMM may be thought of as generalizing \(k\)-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians. \(K-d \quad trees\) is a popular technique for the nearest neighbor search and intelligent initialization to speed up the process of \(k\)-means clustering.
Different from the basic idea of GMM and \(k\)-means, DBSCAN starts with finding core samples of high density and expands clusters from them. After grouping together points in high-density regions, outliers that lie alone in low-density regions points are then marked. We do not need to find centroids for each cluster but we need to find clusters that have dense regions in the data space. Regions of the lower density of points are then separated. Figure 3 is a DBSCAN demo which finds core samples of high density and expands clusters from them. The black points are outliers (noise.)
The DBSCAN algorithm is based on the intuitive notion of “clusters” and “noise.” Thus, the core key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points. OPTICS (Ordering Points To Identify the Clustering Structure)  is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter \(\varepsilon\) , and produces a hierarchical result related to that of linkage clustering.
- Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231).
- Ankerst, Mihael; Breunig, Markus M.; Kriegel, Hans-Peter; Sander, Jörg (1999). “OPTICS: Ordering Points To Identify the Clustering Structure”. ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60. CiteSeerX 10.1.1.129.6542
Comparison of clustering methods
Figure 4: A comparison of the clustering algorithms
Figure 4 is a comparison of the clustering algorithms generated by scikit-learn (Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.) under Python environment, which consists of \(k\)-means, mini batch \(k\)-means, GMM, mean shift, DBSCAN, OPTICS, and Birch clustering. As shown in the figure, there is no objectively “correct” or “best” clustering algorithm. The most appropriate algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. That is to say, choosing a felicitous type of clustering algorithm may highly depend on how your data spreads. In general, an algorithm that is designed for one kind of model will probably fail on a data set that contains a radically different kind of model. For example, k-means cannot find non-convex clusters.
In a nutshell …
Clustering or cluster analysis is a task of grouping a set of objects in such a way that objects in the same group, which is called a cluster, are more similar in some sense to each other than to those in other groups. It is a common technique for statistical data analysis that can be used in many fields. Some common applications for clustering include the following:
- Market segmentation
- Social network analysis
- Search result grouping
- Anomaly detection
- Pattern recognition
- … etc.
To be more specifically, let’s take medical industry as an example to see how cluster analysis can do:
- \(K\)-means and GMM are widely used for medical image segmentation.
- In medical institutes, \(k\)-means and Birch clustering are commonly applied on patients’ medical report for the purpose of personalized medical resource decision making.
- DBSCAN and OPTICS are powerful tools in the field of bio-medical research, for example, Atom Probe Tomography (APT).
Hope you liked our articles, and we will have more articles about machine learning applications!
- Cluster analysis
- Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1, No. 2). Cambridge: MIT press.
Editor: Chieh-Feng Cheng
Ph.D. in ECE, Georgia Tech
Technical Writer, inwinSTACK