Clustering Algorithms Unsupervised Learning Part-2

Unsupervised Algorithms Part-2 Clustering

CLUSTER ANALYSIS – BASIC CONCEPTS

In virtually every scientific field dealing with empirical data, people try to get a first impression on their data by trying to identify groups of “similar behavior” in their data. Consider you go to a supermarket, you will observe see how different items are grouped and arranged. They group and cluster items based on similarity. Clustering is one of the most widely used techniques for exploratory data analysis. Its applications range from statistics, computer science, biology to social sciences or psychology.

Figure-1

Figure-2:

Figure-3

Nature’s color cluster

Here is another example. Let’s say you have a YouTube channel. Youtube may have a lot of data about the subscribers of your channel. You don’t need to tell the algorithm which group a subscriber belongs. The algorithm can find those connections without your help. For example, it may tell you that 35% of your subscribers are from India, while 20% of them are from the United States. This is only one simple example of how clustering algorithms are useful in real life data analysis, data mining and many more applications. Clustering is mainly an unsupervised machine learning task.

In unsupervised learning, we have data that has no labels. Other than the feature values in the dataset we don’t really know anything else. There is no information about the class to which this each data sample belongs. Clustering algorithms are used to find out how data can be grouped into its several classes.

Clusters Definition

Clusters are defined as groups of data in an n-dimensional space containing relatively high density of points, separated by relatively low density of points.

See figure below which illustrates various types of clusters.

Compact cluster Elongated clusters Shell shaped clusters

Figure-4:

What is Clustering?

The term “clustering” is used to describe methods for grouping of unlabeled data. Clustering refers to is the task of dividing a population data into a number of groups such that the members in the same groups are more similar to members belonging to other groups.

The structure revealed through the clustering process allows us to set up a classifier. The “anchor” points of the classifier are the prototypes of various clusters in unlabeled data. Each cluster forms a single class which can be labelled C₁, C₂, . . ., C_K. The prototype points μ₁, μ₂, ,…., μ_K,are representative points of each cluster.

Clustering problem

Consider a training data set S which

$\mathbf{s_{\mathit{n}}}\in \boldsymbol{}\mathit{S}=\left \{\mathbf{s_{1},\mathbf{s_{2},....\mathbf{s_{\mathit{N}}}}} \right \}$

The problem is to partition the set S into some finite number of blocks or clusters. In other words, we partition S, so that

$\mathit{S}=\mathit{C}_{1}\cup \mathit{C}_{2}\cup \mathit{C}_{3}\cup ....\cup \mathit{C}_{\mathit{k}}\cup....\cup \mathit{C}_{\mathit{K}}$

where C_k is the k^th cluster.

Clustering results in “data organization” and involves partitioning of data set. Data is organized based on feature similarity. There are about 100 published clustering algorithms we discuss the major ones.

The classification rules generate the corresponding decision regions in the feature space. To determine this similarity/dissimilarity, different proximity measures are used. Different proximity measures (distance functions) result in different geometries of the classification boundaries.

PROXIMITY MEASURES

Definitions

Proximity measures may be defined between vectors, and it can also be measures between subsets of the data set S.

Euclidean (L₂ norm) Distance metric

$\mathit{d\left ( \mathbf{x},\mathbf{y} \right )}=\sqrt{\sum_{n=1}^{N}\left ( \mathit{x}_{n}^{2} -\mathit{y}_{n}^{2} \right )}$

Hamming or Cityblock (L₁norm) Distance metric

$\mathit{d\left ( \mathbf{x},\mathbf{y} \right )}=\sum_{n=1}^{N}\left | \mathit{x}_{n} -\mathit{y}_{n} \right |$

Tchebyschev (L_∞ norm) Distance metric

$\mathit{d\left ( \mathbf{x},\mathbf{y} \right )}=max\left ( \left | \mathit{x}_{n} -\mathit{y}_{n} \right | \right )$

Minkowski (L_p norm) Distance metric

$\mathit{d\left ( \mathbf{x},\mathbf{y} \right )}=\sqrt[p]{\sum_{n=1}^{N}\left ( x_{n}-y_{n} \right )^{p}}$

Figure-5:

Different metrics of distance measurements

Stages in clustering

Figure-6:

The basic steps to develop a clustering task:

Feature selection and Extraction. Features must be properly selected so as to encode as much information as possible concerning the task of interest. Minimizing information redundancy among the features is a major goal. This preprocessing step results in pattern representation.

Interpattern Similarity or Proximity measure. This measure quantifies how “similar” or “dissimilar” between each other. This measure is essential for the next stage of grouping the pattern represenations.

Grouping using a clustering criterion and clustering algorithm. A clustering criterion is defined based on a cost function. The type of clustering criterion may be chosen according to the type of data distribution. The choice of a specific algorithmic scheme can unravel the clustering structure of the data set.

Feedback and Validation of the results. Once the results of the clustering algorithm have been obtained, their correctness of the results must be verified. The results of the algorithm performance are feedback to previous stages to modifications until desired performance is achieved. Interpretation of the results. In many cases, the results of clustering may be integrated with other experimental evidence and analysis in order to draw the right conclusions.

In some cases a step known as clustering tendency should be involved. This includes various tests that indicate whether or not the available data possess a clustering structure.

Applications

Clustering algorithms find major applications in diverse areas viz. search engines, disease diagnosis, wireless networks, landmine detection, social and academic activity prediction, identifying fake news, spam classification, document analysis, fraudulent activity detection and many more.

The four basic areas in which clustering can be used follows.

Data reduction. A major application of clustering is data compression. In several cases, the amount of the available data is often very large and as a consequence, its processing becomes very demanding. Consider that the number of data points N is very large. Cluster analysis can be used in order to group the data into a number of “sensible” clusters, m (<<N), and to process each cluster as a single entity. For example, in data transmission, a representative for each cluster is defined. Then, instead of transmitting the data samples, we transmit a code number corresponding to the representative of the cluster in which each specific sample lies. Thus, data compression is achieved.

Figure-7:

Hypothesis generation. In this case cluster analysis is applied to a data set in order to infer some hypotheses concerning the nature of the data. Thus, clustering is used here as a tool to suggest hypotheses. These hypotheses must then be verified using other data sets.

Figure-8:

Hypothesis testing. In this context, cluster analysis is used for the verification of the validity of a specific hypothesis. Consider, for example, the following hypothesis: “Big companies invest abroad.” One way to verify whether this is true is to apply cluster analysis to a large and representative set of companies. Suppose that a company is represented by its size, its activities abroad, and its ability to complete successfully projects on applied research. If, after applying cluster analysis, a cluster is formed that corresponds to companies that are large and have investments abroad (regardless of their ability to complete successfully projects on applied research), then the hypothesis is supported by the cluster analysis.

Figure-9:

Prediction based on groups. In this case, cluster analysis is applied to the available data set, and the resulting clusters are characterized based on the characteristics of the patterns by which they are formed. For a given unknown pattern, we can determine the cluster to which it can be classified, and we characterize it based on the characterization of the respective cluster.

Figure-10:

Various approaches in clustering

There are several approaches in clustering which can be categorized as

Sequential:, k-means, neural network methods

Hierarchical: Agglomerative and Divisive;

Objective function, Partitional: Hard and Soft, k-means, Mean Shifting, Boundary Detection, Fuzzy based e.g., Fuzzy-C means, Probability model based e.g., Expectation Maximization, and

Others: Branch and bound, Density based e.g., DBSCAN; Genetic, Stochastic Relaxation, Valley-Seeking, Competitive and so on.

In the following part of the notes we discuss a few important and most widely used clustering algorithms.

Hard and Soft clustering

Hard clustering allow a data instance of data to belong exclusively to only one class. However soft clustering allows an instance of data to belong to one or more classes.

Figure-11:

Hard clustering algorithms

Flat clustering algorithms

Flat clustering is when the algorithm starts by assuming the number of categories K, to cluster the data into.

Flat clustering algorithmic steps.

Start with a random (partial) partitioning of data into K groups.

Refine the clusters iteratively by optimizing a partitioning criterion.

Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. Flat cluster initially assumes that there is fixed number of partitions K, in the available data. The algorithm starts by assigning randomly K cluster centers. These centers are then refined iteratively optimizing a partitioning criterion until no further improves are required. An example of flat clustering is the K – means clustering.

K - Means Clustering

The k-means clustering algorithm is a special case of the generalized hard clustering algorithmic scheme. It is a centroid based clustering algorithm. K-means makes use of the Lloyd’s algorithm a.k.a. Voronoi iteration, which iteratively updates the clusters and the cluster centroids. It is popular because of simplicity and faster convergence. It is suited for compact clusters when point representatives are used to minimize squared Euclidean distances (within cluster variances) between data vectors x_n, and its mean cluster representatives (centroids) θ_j. This results in a partitioning of the data space into Voronoi cells. Therefore it is partitional clustering algorithm.

K-means is hard clustering algorithm and is also equivalent to the soft clustering expectation-maximization algorithm with a small, all-equal, diagonal covariance matrix. K-means clustering is a partitional algorithm that performs hard clustering algorithm. It is a semi-parametric approach (since the number of parameters does not increase as the data size N increases). It can be considered a special case of the expectation maximization algorithm or the fuzzy clustering method. When all are known and only is to be estimated.

$\alpha _{k}=\frac{1}{K}$

${\mathbf{\Sigma }} _{k}=\sigma ^{2}\mathbf{I}_{}$

Given enough time, k-means will always converge.

Once clustering none of the k clusters are empty and any pair of clusters does not overlap, it means

C_i≠ Ø and C_i ∪ C_j = Ø for i ≠ j and i, j ∈{1, 2, · · · , K}.

Variations of K-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (K-medoids), choosing medians (K-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (fuzzy c-means). K-means clustering works efficiently for compact or globular clusters. Centroids of clusters are exemplar or prototype data points. These points represent their respective clusters rather than domain specific parameters. One limitation of the above methods is they are sensitive to initial choice of the K- exemplars at the beginning of iterations.

Figure12:

National Park Service Vornoi maps

MiniBatch K – Means

MiniBatch K - Means is faster, but gives slightly different results.

Will K-means clustering work for following cluster types?

Figure13:

Mean-Shift Clustering

A natural method to characterized clusters in a dataset is to find regions in the data space with large density of data separated by low density regions. This can be done by placing a weighing function called Kernel Density Estimator (KDE) e.g. Gaussian in the data space to determine the density data points. (The word "Kernel" is a fancy mathematical word for a weighting function generally used in convolution.)

A KDE is a generalization of histograms to define density estimates in any dimension that are smooth. KDE method is a non parametric method to estimate the underlying distribution also called the probability density function (e.g. Gaussian) for a set of data. It works by placing a kernel on each point in the data set.

They simplify the mathematical and computational treatment of densities and, crucially, enable one to use continuous optimization to find maxima of the density. The mean shift algorithm is therefore a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters.

Mean shift is a hill climbing algorithm which involves shifting the kernel iteratively to a higher density region until convergence. Every shift is defined by a mean shift vector. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. It is also known as the Mode-seeking algorithm. The downside to Mean Shift is that it is computationally expensive — O(N ²). This method was proposed by Fukunaga and Hotetler in 1975.

Figure-14:

Clusters in intergallatic space

The various steps

Given a set of data points, the algorithm iteratively assigns each data point towards the closest cluster centroid. The centroids may be considered the peak of the hill in the region of the data points.

The mean shift method exploits the KDE idea by imagining what the points would do if they all climbed up hill to the nearest peak on the KDE surface. It does so by iteratively including the each point in the region uphill until it reaches a peak. Depending on the kernel bandwidth used, the KDE surface (and end clustering) will be different.

The algorithm stops when every point in the dataset is assigned to a cluster.

Steps

1. Define a window (bandwidth of the kernel) and place the window on a data point

Figure-15:

2. Calculate the mean for all the points in the window

Figure-16:

3. Move the center of the window to the location of the mean

Figure17:

4. Repeat steps 2 and 3 for all data until there is convergence

Figure-18:

Figure-19:

Density-Based Methods:

These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise) , OPTICS (Ordering Points to Identify Clustering Structure) etc.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

While dealing with spatial clusters of different density, size and shape, it could be challenging to detect the cluster of points. The task can be even more complicated if the data contains noise and outliers. To deal with large spatial databases, Martin Ester and his co-authors proposed Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The main reasons for using the algorithm according to Ester et.al. are

It requires minimum domain knowledge.
It can discover clusters of arbitrary shape.
Efficient for large database, i.e. sample size more than few thousands.

Figure-20:

Density based clustering basic idea (From Ester et al. 1996)

DBSCAN algorithm main concept

DBSCAN is based on this intuitive notion of “clusters” and “noise”. The density of points in a cluster is considerably higher than the density of points outside the cluster (“areas of noise”). The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.

Density of a region around a point: To identify dense region around a point two parameters are required

Epsilon neighbourhood: For each point p in the cluster, the radius epsilon (ϵ) from point p. This is the circle with radius ϵ around the point.
Minimum number of points: For each point in the cluster, the circle with radius ϵ contains at least minimum number of points (MinPts).

The Epsilon neighborhood of a point p in the database D is defined as

N (p) = {q ∈ D | dist(p, q) ≤ ϵ}

The following figure illustrates the concept of three different points core point, border point and outlier or noise point.

Core Point

A point can be classified as a Core Point if |N (p)|≥ MinPts. The Core Points, as the name suggests, lie usually within the interior of a cluster.

Border Point

A Border Point (Frontier point) has fewer than MinPts within its ϵ-neighborhood (N), but it lies in the neighborhood of another core point.

Noise

Noise is any data point that is neither core nor border point. See the picture below for better understanding.

Figure-21:

Since MinPts is a parameter in the algorithm, setting it to a low value to include the border points in the cluster can cause problem to eliminate the noise.

Directly Density Reachable:

Data-point q is directly density reachable from a point p,

|N (p)|≥ MinPts; i.e. p is a core point.
q ∈ N(p) i.e. q is in the epsilon neighborhood of p.

The notion of directly density reachable is not symmetric. Consider a border point q and a core point p, the border point q is reachable from core point p. However even though the core point p falls in the epsilon neighborhood of border point q, the border point doesn’t have enough Minpts and therefore p is not directly reachable from q.

Density Reachable: Point q is density reachable from a point p with respect to ϵ and MinPts,

For a chain of core points q₁, q₂, q₃, ….., q_n where q₁= p and q_i₊₁ = q such that is directly density reachable from q_i.

Density reachable is transitive in nature, just like direct density reachable, it is not symmetric.

Density Connected: There can be cases when 2 border points will belong to the same cluster but they don’t share a specific core point, then we say that they are density connected if, there exists a common core point, from which these borders points are density reachable. As you can understand that density connectivity is symmetric. Definition from the Ester et.al. paper is given below

“A point q is density connected to a point p with respect to ϵ and MinPts, if there is a core point o such that, both p and q are density reachable from o w.r.t. to ϵ and MinPts.”

Figure-22:

Steps of DBSCAN Algorithm:

Using the above definitions the steps of DBSCAN algorithm as below.

The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the ϵ parameter.
If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise the point is labeled as noise. This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster. Concept of density reachable and density connected points are important here.
If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster. So all the points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.
The above process continues until the density-connected cluster is completely found.
The process restarts with a new point which can be a part of a new cluster or labeled as noise.

Drawbacks of DBSCAN

If the database has data points that form clusters of varying density, then DBSCAN fails to cluster the data points well, since the clustering depends on ϵ and MinPts parameter, they cannot be chosen separately for all clusters.
If the data and features are not so well understood by a domain expert then, setting up ϵ and MinPts could be tricky and, may need comparisons for several iterations with different values of ϵ and MinPts.

Other density based clustering algorithms are HDBSCAN (hierarchical DBSCAN) and OPTICS (Ordering Points to Identify Clustering Structure )

Hierarchical Algorithms

Unlike the K-means method hierarchical algorithms do not require to pre-specify the number of clusters. These are connectivity based methods. As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. The clusters formed in this method forms a tree type structure based on the hierarchy. Hence they are called Hierarchical algorithms. There are two approaches,

Agglomerative (bottom up approach) Divisive (top down approach).

Other examples are CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies) etc.

Unlike flat clustering methods, in hierarchical clustering the machine is allowed to decide how many clusters to create based on its own algorithms.

Hierarchical algorithms proceeds as follow

Create a hierarchy.
Bottom-up, agglomerative or Top-down, divisive

Agglomerative Clustering

All data points are classified initially as singleton clusters and then partitioned as the distance increases. The algorithm therefore starts at the bottom with as many clusters as the number of available data points and progresses up to the top until there is only a single cluster. It creates a hierarchy of clusters by assuming initially at the bottom each data is a cluster center and builds up to the top, grouping the data points based on some similarity measure. Hence this method is also called agglomerative or bottom to top clustering.

Figure-23:

Bottom-to-Top and Top-to-Bottom approaches of Hierarchical clustering

Divisive Clustering

Top-down clustering requires a method for dividing a cluster that contains the whole data and proceeds by splitting clusters recursively until all individual data points have been split into singleton clusters. This clustering allows the unsupervised algorithm to decide how many points are available in the data.

Figure-24

Fixing a threshold to determine the no. of desired clusters

Spectral Clustering

For both K - means algorithm as well as EM algorithm we assume K centroids adjusted in an iterative process to find the clusters.

But the main problem is:

It makes assumption on the shape of the data (a round sphere, a radial basis).

It requires multiple restarts at times to find the local minima (i.e. the best clustering).

Spectral Clustering algorithm helps to solve these two problems. This algorithm relies on the power of graphs and the proximity between the data points in order to cluster them, makes it possible to avoid the compact sphere shaped clusters that the K-means algorithm forces us to assume. Spectral theory includes the theory of eigenvector and eigenvalues of a single square matrix to a variety of linear vector spaces.

The Three basic stages of spectral clustering

Pre processing – Construct a matrix representation of the graph. Decomposition – Compute Eigen values and Eigen vectors of the matrix.
Map each point to a lower-dimensional representation based on one or more Eigen vectors.
Grouping – Assign points to two or more clusters based on new representation..

Graph and similarity matrix: Consider our dataset comprises of X = {x₁, x₂, …….., x_n}. The proximity between any pair of data points x_i and x_j can be expressed as a similarity measure s_ij ≥ 0. The goal of the intuitive goal of clustering is to divide the data points into several groups such that points in the same group are similar and points in different groups are dissimilar to each other. Assuming no other measures need to be considered, the dataset can be represented with a graph G = (V, E). The vertices v_i in this graph represent the data points x_i . Two vertices are connected if the similarity s_ij between the corresponding data points x_i and x_j is positive (or larger than a certain threshold), and the edge is weighted by s_ij .

The problem of clustering can now be reformulated using the similarity graph. We can now find a partition of the graph such that the edges between different groups have a very low weight and the edges within a group have high weight.

Similarity matrix graphs construction: We assume that the graphs are undirected so that s_ij = s_ji. The similarity S is an n × n real valued square matrix. The choice of the similarity metric depends on the domain the data comes from. There are different choices concerning the construction of the similarity graph. This depends on the parameter which governs its connectedness (e.g., the parameter ε of the ε-neighborhood graph or the parameter k of the k-nearest neighbor graph or whether it is fully connected).

Figure-25:

Computing the eigenvalues and choosing the number of clusters: To implement spectral clustering in practice one has to compute the first k eigenvectors of a potentially large Laplace matrix. We can determine the eigenvalues of this matrix. They are ordered with increasing magnitude values, respecting multiplicities. The first k smallest eigenvalues and its vectors are considered. If we use the k-nearest neighbor graph or the ε-neighborhood graph, then all Laplace matrices are sparse. Efficient methods exist to compute the first eigenvectors of sparse matrices.

(Ref: A Tutorial on Spectral Clustering Ulrike von Luxburg1 August 2006, Max Planck Institute for Biological Cybernetics)

Figure-26:

Comparing K-means and Spectral clustering

Spectral clustering is a popular unsupervised machine learning algorithm which often outperforms other approaches. In addition, spectral clustering is very simple to implement and can be solved efficiently by standard linear algebra methods.

Grid-based Methods: In this method the data space are formulated into a finite number of cells that form a grid-like structure. All the clustering operation done on these grids are fast and independent of the number of data objects. Examples are STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest) etc.

Affinity Propagation

This method creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given. Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these messages.

The main drawback of Affinity Propagation is its complexity. The algorithm has a time complexity of the order O(N²T)

References:

Konstantinos Koutroumbas Sergios Theodoridis. Pattern Recognition - 4th Edition.
A Clustering Based Data Reduction for Very Large Spatio-Temporal Data, Nhein-An Le Khac et.al, semanticscholar.org.
Types-of-clustering-methods-overview-and-quick-start, datanovia.com
Model-based-clustering-essentials, datanovia.com.
A review of mean-shift algorithms for clustering, Miguel A. Carreira-Perpinan, Electrical Engineering and Computer Science, University of California, Merced, http://eecs.ucmerced.edu.
Dbscan, github.com.
A Tutorial on Spectral Clustering, Ulrike von Luxburg1, Max Planck Institute for Biological Cybernetics.
Unsupervised-learning-explained, infoworld.com

Figure Credits:

Figure-1: i.insider.com.
Figure-2: i.pinimg.com.
Figure-3: previews.123rf.com.
Figure-5: wikimedia.org.
Fig-7: wikimedia.org.
Figure-8: economist.com.
Figure-9: null-hypothesis-vs-alternative-hypothesis,amazonaws.com.
Figure-10: The Utility of Clustering in Prediction Tasks, Shubhendu Trivedi, Zachary A. Pardos and Neil T. Heffernan, ttic.uchicago.edu.
Figure-11: image.slidesharecdn.com.
Figure12: engaging-data.com.
Figure13: mathworks.com.
Figure14: earthsky.org.
Figure19: mdpi.com.
Figure20: github.com.
Figure-21: miro.medium.com. Figure-22: camo.githubusercontent.com.
Figure-23: Conceptual dendrogram for agglomerative and divisive Hierarchical based clustering, researchgate.net.
Figure-24: www.mathworks.com.
Figure-25: Spectral clustering builds a graph connecting data samples in a local neighborhood, researchgate.net.
Figure-26: Comparing K-means and Spectral clustering, tbn0.gstatic.com

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence