Clustering Algorithms Unsupervised Learning Part-2
Unsupervised
Algorithms Part-2 Clustering
CLUSTER
ANALYSIS – BASIC CONCEPTS
In virtually
every scientific field dealing with empirical data, people try to get a first
impression on their data by trying to identify groups of “similar behavior” in
their data. Consider you go to a supermarket,
you will observe see how different items are grouped and arranged. They
group and cluster items based on similarity. Clustering is
one of the most widely used techniques for exploratory data analysis. Its
applications range from statistics, computer science, biology to social
sciences or psychology.
Nature’s color cluster
Here is
another example. Let’s say you have a YouTube channel. Youtube may have a lot
of data about the subscribers of your channel. You don’t need to tell the
algorithm which group a subscriber belongs. The algorithm can find those
connections without your help. For example, it may tell you that 35% of your
subscribers are from India, while 20% of them are from the United States. This
is only one simple example of how clustering algorithms are useful in real life
data analysis, data mining and many more applications. Clustering is mainly an unsupervised
machine learning task.
In unsupervised learning, we have data that has no labels. Other
than the feature values in the dataset we don’t really know anything else. There
is no information about the class to which this each data sample belongs.
Clustering algorithms are used to find out how data can be grouped into its
several classes.
Clusters Definition
Clusters are defined as groups of data in an n-dimensional
space containing relatively high density of points, separated by relatively low
density of points.
See figure below which illustrates various types of
clusters.
Compact cluster Elongated clusters Shell shaped clusters
Figure-4:
What is Clustering?
The term
“clustering” is used to describe methods for grouping of unlabeled data. Clustering refers to is the
task of dividing a population data into a number of groups such that the
members in the same groups are more
similar to members belonging to other groups.
The structure
revealed through the clustering process allows us to set up a classifier. The
“anchor” points of the classifier are the prototypes
of various clusters in unlabeled data. Each cluster forms a single class which
can be labelled C1, C2, . . ., CK.
The prototype points μ1, μ2, ,…., μK, are representative points of each
cluster.
Clustering
problem
Consider a training data set S which
The problem is to partition the set S into some finite number of blocks or
clusters. In other words, we partition S, so that
where Ck
is the kth cluster.
Clustering results in “data organization” and
involves partitioning of data set. Data
is organized based on feature similarity. There are about 100 published
clustering algorithms we discuss the major ones.
The classification rules generate the corresponding
decision regions in the feature space. To determine this
similarity/dissimilarity, different proximity measures are used. Different
proximity measures (distance functions) result in different geometries of the
classification boundaries.
PROXIMITY MEASURES
Definitions
Proximity measures may be defined between vectors,
and it can also be measures between subsets of the data set S.
Hamming
or Cityblock (L1 norm) Distance metric
Minkowski (Lp norm) Distance metric
Different
metrics of distance measurements
Stages
in clustering
The basic steps to develop a clustering
task:
Feature selection and Extraction.
Features must be properly selected so as to encode as much information as
possible concerning the task of interest. Minimizing information redundancy
among the features is a major goal. This preprocessing step results in pattern
representation.
Interpattern Similarity or Proximity
measure.
This measure quantifies how “similar” or “dissimilar” between each other. This
measure is essential for the next stage of grouping the pattern represenations.
Grouping using a clustering criterion
and clustering algorithm. A clustering criterion is defined based on a cost function. The type of
clustering criterion may be chosen according to the type of data distribution.
The choice of a specific algorithmic scheme can unravel the clustering
structure of the data set.
Feedback and Validation of the results.
Once the results of the clustering algorithm have been obtained, their
correctness of the results must be verified. The results of the algorithm
performance are feedback to previous stages to modifications until desired
performance is achieved. Interpretation of the results. In many cases,
the results of clustering may be integrated with other experimental evidence
and analysis in order to draw the right conclusions.
In some cases a step known as clustering tendency
should be involved. This includes various tests that indicate whether or
not the available data possess a clustering structure.
Applications
Clustering algorithms find major applications in
diverse areas viz. search engines, disease diagnosis, wireless networks,
landmine detection, social and academic activity prediction, identifying fake
news, spam classification, document analysis, fraudulent activity detection and
many more.
The four basic areas in which clustering can be used
follows.
Data reduction. A
major application of clustering is data
compression. In several cases, the amount of the available data is often
very large and as a consequence, its processing becomes very demanding. Consider
that the number of data points N is
very large. Cluster analysis can be used in order to group the data into a
number of “sensible” clusters, m (<<N), and to process each
cluster as a single entity. For example, in data transmission, a representative
for each cluster is defined. Then, instead of transmitting the data samples, we
transmit a code number corresponding to the representative of the cluster in
which each specific sample lies. Thus, data compression is achieved.
Hypothesis generation. In
this case cluster analysis is applied to a data set in order to infer some
hypotheses concerning the nature of the data. Thus, clustering is used here as
a tool to suggest hypotheses. These hypotheses must then be verified using
other data sets.
Hypothesis testing. In
this context, cluster analysis is used for the verification of the validity of
a specific hypothesis. Consider, for example, the following hypothesis: “Big
companies invest abroad.” One way to verify whether this is true is to apply
cluster analysis to a large and representative set of companies. Suppose that a
company is represented by its size, its activities abroad, and its ability to
complete successfully projects on applied research. If, after applying cluster
analysis, a cluster is formed that corresponds to companies that are large and
have investments abroad (regardless of their ability to complete successfully
projects on applied research), then the hypothesis is supported by the cluster
analysis.
Prediction based on groups. In
this case, cluster analysis is applied to the available data set, and the
resulting clusters are characterized based on the characteristics of the
patterns by which they are formed. For a given unknown pattern, we can
determine the cluster to which it can be classified, and we characterize it
based on the characterization of the respective cluster.
Figure-10:
Various approaches in
clustering
There are several approaches in clustering which can
be categorized as
Sequential:, k-means, neural network methods
Hierarchical: Agglomerative and Divisive;
Objective function, Partitional: Hard and Soft, k-means, Mean Shifting, Boundary Detection,
Fuzzy based e.g., Fuzzy-C means, Probability
model based e.g., Expectation Maximization, and
Others:
Branch and bound, Density based e.g., DBSCAN; Genetic, Stochastic
Relaxation, Valley-Seeking, Competitive and so on.
In the following part of the notes we discuss a few important
and most widely used clustering algorithms.
Hard and Soft
clustering
Hard
clustering allow a data instance of data to belong exclusively to only one
class. However soft clustering allows an instance of data to belong to one or
more classes.
Hard clustering algorithms
Flat clustering
algorithms
Flat clustering is when the algorithm starts by assuming the number of
categories K, to cluster the data
into.
Flat clustering algorithmic steps.
Start with a random (partial) partitioning of data into K groups.
Refine the clusters iteratively by optimizing a partitioning criterion.
Flat clustering creates
a flat set of clusters without any explicit structure that would relate
clusters to each other. Flat cluster initially assumes that there is fixed number
of partitions K, in the available data. The
algorithm starts by assigning randomly K
cluster centers. These centers are then refined iteratively optimizing a
partitioning criterion until no further improves are required. An example of
flat clustering is the K – means
clustering.
K - Means Clustering
K - Means Clustering
The k-means clustering algorithm is a
special case of the generalized hard clustering algorithmic scheme. It is a
centroid based clustering algorithm. K-means makes
use of the Lloyd’s algorithm a.k.a. Voronoi iteration, which iteratively updates
the clusters and the cluster centroids. It is popular because of
simplicity and faster convergence. It is suited for compact clusters when point
representatives are used to minimize squared Euclidean distances (within
cluster variances) between data vectors xn, and its mean cluster representatives
(centroids) θj. This results in a partitioning of the data space
into Voronoi cells. Therefore it is partitional clustering algorithm.
K-means is hard clustering algorithm and
is also equivalent to the soft clustering expectation-maximization algorithm
with a small, all-equal, diagonal covariance matrix. K-means clustering is a partitional algorithm that performs hard
clustering algorithm. It is a
semi-parametric approach (since the number of parameters does not
increase as the data size N
increases). It can be considered a special case of the expectation maximization
algorithm or the fuzzy clustering method. When all are known and only is to be estimated.
Given
enough time, k-means will always
converge.
Once clustering none of the k clusters are empty and any pair of clusters does not overlap, it
means
Ci ≠ Ø and Ci ∪ Cj = Ø for i ≠ j and i, j
∈{1, 2,
· · · , K}.
Variations
of K-means often include such optimizations as choosing the best of
multiple runs, but also restricting the centroids to members of the data set (K-medoids),
choosing medians (K-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (fuzzy
c-means). K-means clustering works efficiently for compact or globular clusters.
Centroids of clusters are exemplar or prototype data points. These points
represent their respective clusters rather than domain specific parameters. One
limitation of the above methods is they are sensitive to initial choice of the
K- exemplars at the beginning of
iterations.
MiniBatch
K – Means
MiniBatch
K - Means is faster, but gives slightly
different results.
Will
K-means clustering work for following cluster types?
Mean-Shift Clustering
A natural method to characterized clusters in a
dataset is to find regions in the data space with large density of data
separated by low density regions. This can be done by placing a weighing
function called Kernel Density Estimator (KDE) e.g. Gaussian in the data space to
determine the density data points. (The word "Kernel" is a fancy mathematical word
for a weighting function generally used in convolution.)
A KDE is a
generalization of histograms to define density estimates in any dimension that
are smooth. KDE method is a non parametric method to estimate the underlying
distribution also called the probability density function (e.g. Gaussian) for a
set of data. It works by placing a kernel on each point in the data set.
They simplify
the mathematical and computational treatment of densities and, crucially,
enable one to use continuous optimization to find maxima of the density. The mean shift algorithm is therefore
a nonparametric clustering technique which does not require prior knowledge of
the number of clusters, and does not constrain the shape of the clusters.
Mean shift is a hill climbing algorithm which
involves shifting the
kernel iteratively to a higher density region until convergence. Every shift is
defined by a mean
shift vector. It is a centroid based algorithm, which works by
updating candidates for centroids to be the mean of the points within a given
region. These candidates are then filtered in a post-processing stage to
eliminate near-duplicates to form the final set of centroids. It is also known
as the Mode-seeking algorithm. The downside to Mean Shift is that it is
computationally expensive — O(N ²). This method was proposed by
Fukunaga and Hotetler in 1975.
Clusters
in intergallatic space
The
various steps
Given a set of data points, the
algorithm iteratively assigns each data point towards the closest cluster
centroid. The centroids may be considered the peak of the hill in the region of
the data points.
The mean shift method exploits the KDE idea by
imagining what the points would do if they all climbed up hill to the nearest
peak on the KDE surface. It does so by iteratively including the each point in
the region uphill until it reaches a peak. Depending on the kernel bandwidth
used, the KDE surface (and end clustering) will be different.
The algorithm stops when every point in the dataset is assigned to a cluster.
Steps
1.
Define a window (bandwidth of the kernel) and
place the window on a data point
2.
Calculate
the mean for all the points in the window
3.
Move the center of the window to the location
of the mean
4. Repeat steps 2 and 3 for all data until
there is convergence
Figure-18:
Density-Based Methods:
These methods consider the clusters as the dense
region having some similarity and different from the lower dense region of the
space. These methods have good accuracy and ability to merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) , OPTICS (Ordering Points to
Identify Clustering Structure) etc.
Density-Based Spatial
Clustering of Applications with Noise (DBSCAN)
While dealing with spatial
clusters of different density, size and shape, it could be challenging to
detect the cluster of points. The task can be even more complicated if the data
contains noise and outliers. To deal with large spatial databases, Martin Ester
and his co-authors proposed Density-Based Spatial Clustering of
Applications with Noise (DBSCAN). The main reasons
for using the algorithm according to Ester et.al. are
- It requires minimum domain knowledge.
- It can discover clusters of arbitrary shape.
- Efficient for large database, i.e. sample size more than few thousands.
Density based clustering basic idea (From Ester et al.
1996)
DBSCAN
algorithm main concept
DBSCAN is based on this intuitive
notion of “clusters” and “noise”. The density of points in a cluster is
considerably higher than the density of points outside the cluster (“areas of
noise”). The key idea is that for each point of a cluster, the neighborhood of
a given radius has to contain at least a minimum number of points.
Density
of a region around a point: To
identify dense region around a point two parameters are required
- Epsilon neighbourhood: For each point p
in the cluster, the radius epsilon (ϵ) from point p.
This is the circle with radius ϵ around the point.
- Minimum number of points: For each point in the cluster, the circle with radius ϵ contains at least minimum number of points (MinPts).
The Epsilon neighborhood of a point p
in the database D is defined as
N (p) = {q ∈ D | dist(p, q) ≤ ϵ}
The following figure illustrates the
concept of three different points core point, border point and outlier or noise
point.
Core
Point
A point can be classified as a Core
Point if |N (p)|≥ MinPts. The Core Points, as the name
suggests, lie usually within the interior of a cluster.
Border Point
A Border Point (Frontier
point) has fewer than MinPts within its ϵ-neighborhood (N), but it lies in the
neighborhood of another core point.
Noise
Noise is any data point that is
neither core nor border point. See the picture below for better
understanding.
Since MinPts is a parameter in the
algorithm, setting it to a low value to include the border points in the
cluster can cause problem to eliminate the noise.
Directly Density Reachable:
Data-point q is directly density reachable from a
point p,
- |N (p)|≥ MinPts; i.e. p is a core point.
- q ∈ N(p) i.e. q is in the epsilon neighborhood of p.
The notion of directly density reachable is not symmetric. Consider a border point q and a core point p, the
border point q is reachable from core
point p. However even though the core
point p falls in the epsilon
neighborhood of border point q, the
border point doesn’t have enough Minpts
and therefore p is not directly
reachable from q.
Density Reachable: Point q is
density reachable from a point p with
respect to ϵ and MinPts,
if
- For a chain of core points q1, q2, q3, ….., qn where q1 = p and qi+1 = q such that is directly density reachable from qi.
Density Connected: There can be cases when 2 border points
will belong to the same cluster but they don’t share a specific core point,
then we say that they are density connected if, there exists a common core
point, from which these borders points are density reachable. As you can
understand that density connectivity is symmetric. Definition from the Ester
et.al. paper is given below
“A point q is
density connected to a point p with respect to ϵ and
MinPts, if there is a core point o such that,
both p and q are density
reachable from o w.r.t. to ϵ and MinPts.”
Steps of DBSCAN Algorithm:
Using the above definitions the steps of DBSCAN algorithm as below.
- The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the ϵ parameter.
- If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise the point is labeled as noise. This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster. Concept of density reachable and density connected points are important here.
- If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster. So all the points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.
- The above process continues until the density-connected cluster is completely found.
- The process restarts with a new point which can be a part of a new cluster or labeled as noise.
Drawbacks
of DBSCAN
- If the database has data points that form clusters of varying density, then DBSCAN fails to cluster the data points well, since the clustering depends on ϵ and MinPts parameter, they cannot be chosen separately for all clusters.
- If the data and features are not so well understood by a domain expert then, setting up ϵ and MinPts could be tricky and, may need comparisons for several iterations with different values of ϵ and MinPts.
Other density based clustering
algorithms are HDBSCAN (hierarchical DBSCAN)
and OPTICS (Ordering Points to Identify Clustering Structure )
Hierarchical Algorithms
Unlike the K-means method
hierarchical algorithms do not require to pre-specify the number of
clusters. These are connectivity based methods. As
the name suggests, these models are based on the notion that the data points
closer in data space exhibit more similarity to each other than the data points
lying farther away. The clusters formed in this method forms a tree type
structure based on the hierarchy. Hence they are called Hierarchical
algorithms. There are two approaches,
Agglomerative (bottom up approach) Divisive (top down approach).
Other examples
are CURE (Clustering Using Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using Hierarchies) etc.
Unlike flat clustering methods, in hierarchical
clustering the machine is allowed to decide how many clusters to create based
on its own algorithms.
Hierarchical
algorithms proceeds as follow
- Create a hierarchy.
- Bottom-up, agglomerative or Top-down, divisive
Agglomerative Clustering
All data points are classified initially as singleton clusters and then partitioned as the distance increases. The
algorithm therefore starts at the bottom with as many clusters as the number of
available data points and progresses up to the top until there is only a single
cluster. It creates a hierarchy of clusters by assuming initially at the bottom each data is a cluster center and builds up to the top, grouping the data points based on some similarity measure. Hence this method is also called agglomerative or bottom to top clustering.
Bottom-to-Top
and Top-to-Bottom approaches of Hierarchical clustering
Divisive Clustering
Top-down clustering requires a method for dividing a cluster
that contains the whole data and proceeds by splitting clusters recursively
until all individual data points have been split into singleton clusters. This
clustering allows the unsupervised algorithm to decide how many points are available in
the data.
Fixing
a threshold to determine the no. of desired clusters
Spectral
Clustering
For both K
- means algorithm as well as EM algorithm we assume K centroids adjusted in an iterative process to find the clusters.
But the main
problem is:
It makes assumption
on the shape of the data (a round sphere, a radial basis).
It requires multiple restarts at times to find the local minima (i.e. the best
clustering).
Spectral
Clustering algorithm helps to solve these two problems. This algorithm relies
on the power of graphs and the proximity between the data points in order to
cluster them, makes it possible to avoid the compact sphere shaped clusters
that the K-means algorithm forces us
to assume. Spectral theory includes the theory of eigenvector and eigenvalues
of a single square matrix to a variety of linear vector spaces.
The Three
basic stages of spectral clustering
- Pre processing – Construct a matrix representation of the graph. Decomposition – Compute Eigen values and Eigen vectors of the matrix.
- Map each point to a lower-dimensional representation based on one or more Eigen vectors.
- Grouping – Assign points to two or more clusters based on new representation..
Graph and similarity matrix:
Consider our dataset comprises of X = {x1, x2,
…….., xn}. The proximity between any pair of data
points xi and xj can be expressed as a
similarity measure sij ≥
0. The goal of the intuitive goal of clustering is to divide the data points
into several groups such that points in the same group are similar and points
in different groups are dissimilar to each other. Assuming no other measures
need to be considered, the dataset can be represented with a graph G = (V,
E). The vertices vi in this graph represent the data points xi
. Two vertices are connected if the similarity sij between the corresponding data points xi
and xj is positive (or larger than a certain threshold), and
the edge is weighted by sij
.
The problem of
clustering can now be reformulated using the similarity graph. We can now find
a partition of the graph such that the edges between different groups have a
very low weight and the edges within a group have high weight.
Similarity
matrix graphs construction: We
assume that the graphs are undirected so that sij = sji.
The similarity S is an n × n real valued square matrix. The
choice of the similarity metric depends on the domain the data comes from.
There are different choices concerning
the construction of the similarity graph. This depends on the parameter which
governs its connectedness (e.g., the parameter ε of the ε-neighborhood graph or
the parameter k of the k-nearest neighbor graph or whether it is fully
connected).
Computing the eigenvalues and choosing the number of clusters: To implement
spectral clustering in practice one has to compute the first k eigenvectors of a potentially large
Laplace matrix. We can determine the eigenvalues of this matrix. They
are ordered with increasing magnitude values, respecting multiplicities. The first k
smallest eigenvalues and its vectors are considered. If we use the k-nearest neighbor graph or the
ε-neighborhood graph, then all Laplace matrices are sparse. Efficient methods
exist to compute the first eigenvectors of sparse matrices.
(Ref: A Tutorial on Spectral Clustering Ulrike
von Luxburg1 August 2006, Max Planck Institute for Biological Cybernetics)
Comparing
K-means and Spectral clustering
Spectral clustering is a popular
unsupervised machine learning algorithm which often outperforms other
approaches. In addition, spectral clustering is very simple to implement and
can be solved efficiently by standard linear algebra methods.
Grid-based Methods: In this method the data space are formulated into a finite number of cells that form a grid-like structure. All the clustering operation done on these grids are fast and independent of the number of data objects. Examples are STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest) etc.
Affinity Propagation
This method creates clusters by sending messages
between pairs of samples until convergence. A dataset is then described using a
small number of exemplars, which are identified as those most representative of
other samples. The messages sent between pairs represent the suitability for
one sample to be the exemplar of the other, which is updated in response to the
values from other pairs. This updating happens iteratively until
convergence, at which point the final exemplars are chosen, and hence the final
clustering is given. Affinity Propagation can be interesting as it
chooses the number of clusters based on the data provided. For this purpose,
the two important parameters are the preference, which controls
how many exemplars are used, and the damping factor which
damps the responsibility and availability messages to avoid numerical
oscillations when updating these messages.
The
main drawback of Affinity Propagation is its complexity. The algorithm has a
time complexity of the order O(N2T)
References:
- Konstantinos Koutroumbas Sergios Theodoridis. Pattern Recognition - 4th Edition.
- A Clustering Based Data Reduction for Very Large Spatio-Temporal Data, Nhein-An Le Khac et.al, semanticscholar.org.
- Types-of-clustering-methods-overview-and-quick-start, datanovia.com
- Model-based-clustering-essentials, datanovia.com.
- A review of mean-shift algorithms for clustering, Miguel A. Carreira-Perpinan, Electrical Engineering and Computer Science, University of California, Merced, http://eecs.ucmerced.edu.
- Dbscan, github.com.
- A Tutorial on Spectral Clustering, Ulrike von Luxburg1, Max Planck Institute for Biological Cybernetics.
- Unsupervised-learning-explained, infoworld.com
Figure Credits:
- Figure-1: i.insider.com.
- Figure-2: i.pinimg.com.
- Figure-3: previews.123rf.com.
- Figure-5: wikimedia.org.
- Fig-7: wikimedia.org.
- Figure-8: economist.com.
- Figure-9: null-hypothesis-vs-alternative-hypothesis,amazonaws.com.
- Figure-10: The Utility of Clustering in Prediction Tasks, Shubhendu Trivedi, Zachary A. Pardos and Neil T. Heffernan, ttic.uchicago.edu.
- Figure-11: image.slidesharecdn.com.
- Figure12: engaging-data.com.
- Figure13: mathworks.com.
- Figure14: earthsky.org.
- Figure19: mdpi.com.
- Figure20: github.com.
- Figure-21: miro.medium.com. Figure-22: camo.githubusercontent.com.
- Figure-23: Conceptual dendrogram for agglomerative and divisive Hierarchical based clustering, researchgate.net.
- Figure-24: www.mathworks.com.
- Figure-25: Spectral clustering builds a graph connecting data samples in a local neighborhood, researchgate.net.
- Figure-26: Comparing K-means and Spectral clustering, tbn0.gstatic.com
Comments
Post a Comment