Fuzzy Clustering, Expectation Maximization, Gaussian Mixture Modelling, Principal Component Analysis - Unsupervised Algorithms Part-3
Fuzzy Clustering
Fuzzy
clustering is a soft clustering method such that a data point can have partial membership in more than class. The clusters
produced by the k -means procedure as described in the previous article is also called "hard" or
"crisp" clusters, since any feature vector (data point) “s” either "is" or "is not" a member of a
particular cluster. This is in contrast to "soft" or
"fuzzy" clustering, in which a feature vector “s” can have a degree of membership in one or more clusters.
The fuzzy clustering algorithm implements an
iterative method similar to the k-means
but objective function differs from the k-means function by the
inclusion of values that defines varying degree of class memberships for each data
point. Points close to a cluster centre will have a higher
degree of membership than points far from the centre. The degree, to which an
element belongs to a given cluster, is a numerical value that belongs the interval 0 to 1.
Consider we observe a one-dimensional data set distribution
S as shown in Figure-1. The data points belong to two class labels A and B which is not known apriori.
Figure-1:
Data Distribution
A data point can be clustered to any of the two classes.
Figures 2 and 3 illustrates difference between hard (e.g., k-means) and soft (e.g., fuzzy) cluster approaches.
In general data is multidimensional data and
we represent a sample data point as sn
and we assume there are
K clusters.
K-means
clustering
k-means clustering method assigns an input data sample (point) sn either to class A or B with hard membership values uk(sn) which is either 1 or 0.
Fuzzy clustering procedures assign a
fractional degree of membership uk(sn) in each
output cluster k. A fuzzy clustering of a dataset S with N data points into K clusters
is characterized by membership functions unk which satisfies the following properties
for any sn
for any sn
and
and for any k
The
membership function values constitute a membership matrix U which is N x K.
Fuzzy clustering
Fuzzy clustering algorithm steps:
- Initialize the values of unk, K and a stopping critierion ε.
- At every iteration step “m” calculate the cluster centers ck using the values if unk.
- After the ck are computed update the values of unk, using ck.
- If ||U(m+1)-U(m)|| < ε, stop iterations otherwise repeat 2 to 4.
Clustering results and cluster shapes will depend on the
type of distance metric chosen.
Crisp and Fuzzy Sets
Boolean logic is crisp logic that uses sharp
distinction between one class and the other. It forces us to draw lines between class members and nonmembers. Fuzzy or multivalued logic was introduced by Jan Lukasiewicz a Polish
philosopher in 1930s. Lukasiewicz introduced the logic that extends the range
of truths to a range of values between 0 to 1 to represent the possibility that
a given statement is true or false. This led to an inexact reasoning logic
often called “possibility theory”. Hence fuzzy clustering is a.k.a possibilistic
clustering.
Difference between Fuzziness and Probability
Consider this scenario:
You are in your apartment with two adjacent
rooms: the kitchen and the dining room. Your status within the house
is that you are likely to have a membership either "in the kitchen"
or "in the dining room" (same as “you are not in kitchen”). Assume
you are a person who spends more time in kitchen than in dining room. In that
case the uncertainty of your status can be quantified with probability.
Probability is about that given the current state of our
knowledge we can say the uncertainty probability of you being kitchen is only
0.4.
However if you stand in the doorway between the
kitchen and dining room with only part of your foot in the dining then you may
be considered "partially in the kitchen" and “partially in the
dining”. Quantifying this partial state yields a fuzzy set
membership. With only one foot in the dining room, we might say your
membership is 0.95 "in the kitchen" and 0.05 in the “in the dining”. Fuzzy logic is all about degrees of truth about partial
or relative truths. In the following figure, bottles A and B are shown with fuzzy membership and probability values. Bottle A has a 0.9 fuzzy membership of water, the rest is poison. Bottle B has a probability of 0 .9 that it contains only water.
The Fuzzy-C-Means (Fuzzy-k-Means)
FCM (also referred to as soft clustering or soft k-means) is frequently used in pattern recognition. The main difference with k-means cluster is that objective function for fuzzy c-means
algorithm allows different cluster membership with varying values which ranges
from 0 to 1, where as k-means cluster
has strict objective function allows only one cluster membership value equal to
1 all others equal to 0. This
method was developed by Dunn in 1973 and improved by Bezdek in 1981. The degree of membership can also be interpreted probabilistically
as the square root of the a posteriori probability the s is in cluster “k”
(Duda and Hart)
Clustering spherical data using FCM for
directional data
Fuzzy clustering by Local Approximation of Memberships
(FLAME) is a data clustering algorithm that defines clusters in the dense parts of a
dataset and performs cluster assignment solely based on the neighborhood
relationships among objects. The key feature of this algorithm is that the
neighborhood relationships among neighboring objects in the feature space are
used to constrain the memberships of neighboring objects in the fuzzy
membership space.
Probabilistic clustering methods
Expectation Maximization for mixture density distribution.
EM algorithm works best for exponential family of probability distributions. It is assumed that the observed random data is an independent and identically distributed dataset
(i.i.d). This means that the observation of any data sample does not in any way influence the observation of any other sample. Let the data model consist of random variable vectors
of the observed set of data vectors X
and a set of hidden random variable vectors Z. Both observed variable X and latent variables Z have a joint distribution for all data. As in the following figure there are two such joint distributions which are Gaussian and hence they belong to exponential family of distributions. The combination of two Gaussian distributions makes a mixture of joint distributions. The data samples belong to two different clusters. The parameters which are (1) prior probabilities of each cluster class (2) mean values of each class distribution and (3) the cluster variance. These parameters define the joint probability distributions are unknown.
EM
algorithm for 1D data
Expectation Maximization
(EM) Algorithm
Expectation
Maximization (EM) Algorithm is a mixture decomposition (resolving)
approach to cluster analysis and has the underlying assumption that the
patterns to be clustered are drawn from one of several distributions. The mixture decomposition schemes make use of the
familiar Bayesian classification rule. The goal is to identify the
parameters of each of these distributions and also the total number of such
distributions. The individual components of the mixture density are assumed to
be exponential e.g., Gaussian. In such cases the parameters of the individual
Gaussians are to be estimated by the EM algorithm. The problem is to iteratively obtain a maximum likelihood estimate of the parameter vectors of
the component densities.
In mixture models it is assumed that
there are J distributions. So let there be J clusters, Cj, j = 1, . . . , J, underlying the
data set. According to the
Baye’s classification rule each vector xn , n =
1, . . . , N,
belongs to a cluster Ck if probability,
The class distributions contribute to
the formation of p(xn).
Thus, each point xn may be “drawn” from
any of the J model distributions with probability Pj ,
j =1, 2, . . . , J.
This type of modeling can approximate any continuous density function for a
sufficient number of mixtures J and its corresponding model parameters.
The first step of the procedure
involves the choice of the set of components that represent the likelihood
density function p(x|Cj) in parametric form, that is, p(x|Cj;θ), and then the computation of the unknown parameters, θ and the class probabilities P(Cj), j =1, 2, . . . , J , is based on
the set of the available training samples xn, n
= 1,2,….N.
Mixture decomposition schemes are implemented under the following conditions
- no training data with known cluster labeling are available and
- the a priori probabilities P(Cj) ≡ Pj are also not known.
Hence it is an unsupervised learing
method. However the goal is to
classify any data xn into either of the J different distributions. Since there
are multiple distributions of J
different distributions, the set unknown parameter vectors θj can be
represented as a matrix,
ΘJ = [θ1,…θj,…. θJ]T .
where θj being the parameter vector corresponding to
the jth cluster,
Let Pj = P(Cj) represent
the a priori probability for the jth cluster,
then vector
P =
[P1, .
. . , PJ]T ,
The unknown parameter matrix ΘJ can be augmented to form
Θ = [ΘJ , PT ]T .
The iterative method
Application of the basic two step interative method optimization method
Considering all the data samples and
classes.
Step-1
Mixture models and its decomposition:Mixture of Gaussians (MOGs)
It is assumed that the complete probability
structure (type of pdf) of random
variable X is known and all patterns
in X are generated by some mixture
density. That means that the structures (type of pdf) of prior probabilities πj
= P(Cj) and the class conditional (likelihood) probabilities
p(x|Cj) are
known. However the parameters that best
characterizes the clusters underlying X is unknown and class
labels are also unknown. The unknown parameter vector matrix is represented by Θ.
Figure-9:
Figure-10:
3 Gaussian mixtures of 2D data
Figure-11:
3D plot of 4 Gaussians
Mixture
model for words
Figure-12:
Gaussian Mixture Centers and Covariances
The cost function Q is constructed on the basis of random parameter vectors chosen
for θj. The parameter vectors θj are estimated by optimizing the cost
function.
We can therefore decompose the mixture into its
components by computing the maximum aposteriori
probability. The samples x drawn
from the set X with the known mixture
density are assigned the class labels based on the posterior probability as per
Bayesian classification rule.
When there are K number of base distributions we get a mixture model that is represented by
Dimensionality Reduction
In statistics, data mining, machine learning, and information
theory, dimensionality reduction or dimension reduction is the
process of reducing the
dimensionality of the random variables under. Many machine learning data sets contain
thousands of features for each data instance. This necessitates large model
complexity, model training will be slow and obtaining a good solution to the
problem it will be difficult. In a dataset with multidimensional feature variables
can contain many features with similar information and hence are linearly
related. The objective of dimensionality reduction is to minimize the
correlations and decrease data dimensionality without losing too much of information.
One way to do dimensionality reduction is to merge all those correlated
features into one. This method is also called feature extraction. When the metric used to measure distances between data points is
Euclidean it is same as Principal Coordinates Analysis (PCA).
PCA cannot be used in cases when the
data is only piece-wise linear or nonlinear. As an example consider the
nonlinear case of data that is distributed in two circular clouds one with
lesser radius than the other. Since both set of data has equal variances in all
directions PCA cannot determine variances in increasing or decreasing
magnitudes and fails for dimensionality reduction. However a polar
transformation can be used such as two dimensional circular data is reduced to
one-dimensional data.
It is always
a good practice to try to reduce the dimensionality of your training data before
you feed the data to another machine learning algorithm. This will make the
data less complex, algorithm much faster with reduced memory.
Reducing the
dimensionality may lose some information. So, even if this will speed up the
training, most of the times, it may also make your system perform slightly
worse. So dimensionality reduction is used only if the training is too slow.
Otherwise, try to use the original data.
These
are some of the most common dimensionality reduction algorithms in machine
learning:
- Principal Component Analysis (PCA).
- Kernel PCA. Locally-Linear Embedding.
- t-Distributed Stochastic Neighbor Embedding (t-SNE).
- Auto Encoders
An example data visualization benefits by dimensionality reduction
Dimensionality
reduction is essential for visualizing the pdfs
of multidimensional data. Visualization
is the process of creating diagrams, images, graphs, charts, etc., to
communicate some information. This method can be applied using
unsupervised machine learning.
For example, let’s say you are a football coach and you
have some data about your team’s performance in a tournament. You may want to
find all the statistics about the matches quickly.
You can feed the complex and unlabeled data to some
visualization algorithm.
These algorithms can output a two-dimensional or
three-dimensional representation of your data that can easily be plotted. So,
by seeing the plotted graphs, you can easily get a lot of information. This
information will help you to maintain your winning formula, correct your
previous mistakes and win the ultimate trophy.
Principal component analysis (PCA) an intuitional explanation
Principal Components Analysis (PCA) is one of the simplest
and oldest eigen analysis-based method. PCA is a statistical procedure that uses a linear orthogonal transformation to convert a set of data observations
which are possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. There are many applications for PCA and a major one
dimensionality reduction for data visualization.
Geometrically, PCA is a rigid rotation of the original data
matrix, and can be defined as a projection of data points onto a new set of
axes, such that the maximum variance is "extracted" along the first
axis, data points along axis 1 with maximum variation is projected on the
second orthogonal axis with next highest variance, data points along the first
and second axis is projected on the third axis and so on. Figure illustrates the dimensionality reduction by PCA as a projection of the
original data dimensions to a lower dimensional space.
Figure-13:
Eigen Analysis and PCA
A detailed discussion of eigen analysis is
beyond the scope of our context. However the basic methodology is described
below. (Also read blog on Linear algebra for AI and ML for futher discussions
on Eigen Decomposition)
Eigen analysis is performed on a square, symmetric correlation matrix R pertaining to zero mean data set with I dimensions. The correlation matrix R is derived from the data matrix X after subtracting its mean vector from
every data point vector.
There
is a unique solution to the eigen analysis, no
matter the order of data. The solution results in decomposition of the square
matrix R to it its eigen values and
eigen vectors. The associated eigenvalues define the extremal
values of the variances.
- Each eigenvalue is associated with an eigenvector (ordination axis).
- Axes are ranked by their eigenvalues.
- Thus, the first axis has the highest eigen value (variance), the second axis has the second highest eigenvalue and so on.
- In principal components analysis, eigenvalues are ‘variance extracted’.
- The eigen vector are the principal component directions.
- The principal component axes are orthogonal to each other.
- The number of eigen values and eigen vectors is same as the dimensionality of the dataset. In a large dimensional dataset the number of eigen axes will be correspondingly large and visualizations in dimensionality greater than is impossible.
Hence it is worth reducing the data dimensionality to low values for data interpretation.
Geometric Rationale of PCA
The geometric
rationale of PCA is illustrated by the following figures with 2D data. The
orthogonal axes corresponding to x1
and x2 can be rotated so
that the x1-axis aligns
with the line along which the data has maximum variance. The objective of PCA is to rigidly rotate the
axes of I - dimensional data space to new directions which are called
principal axes that have the following properties:
Steps
Principal axis-1 along which the data projections have the highest
variance, principal axis-2 has the next highest variance and so on, until the Ith principal axis has the
lowest variance among all. Covariance among each pair of the principal axes is
zero (the principal axes are uncorrelated).
PC1 is the direction of maximum variance in the cloud of
points which are I-dimensional. The squared distances of data points
from PC1 axis are minimum. The line along this axis is the “line of best fit” with the least-squared
error .
PC2 is in the direction of the next highest variance, subject
to the constraint that it has zero covariance with PC1.
PC3 is in the direction of the next highest variance, subject
to the constraint that it has zero covariance with bot
h PC1 and PC2 and so on... up to PCI.
h PC1 and PC2 and so on... up to PCI.
Figure-14:
(A) Data distribution in 3D space, (B) Finding a new
orthogonal coordinate system axes pointing in directions of largest variances
in data space (C) Use a subspace axes system to describe data in reduced
dimensionality
Dimensionality reduction with PCA
The goal of PCA is to find the set of orthogonal axes along
which the multidimensional data has most information. Given the
original data vector x, the l < I principal components may be computed by projecting vector x onto the l - dimensional subspace. This results in a transformation from the
I - dimensional input data space to l - dimensional subspace.
If the dimensionality of the data is I, it is required to find a
dimensionality l < I that conserves most of the valuable
information contained in the original data set as a simple example if I = 3, in figure above l = 2.
Since variance contains information we can find a optimal set of
orthogonal directions which are the eigen vectors of the mean subtracted data
covariance matrix R. The eigen
values then reflect the variance along these basis vectors. We can then arrange
the I eigen values in the order of
decreasing or increasing magnitudes and choose the l set of eigen vectors along which the variance values are highest.
The data with dimensionality l < I can be reconstructed using these
coefficients.
Figure-15:
PCA as a projection of
data points. Typical
data sets will have many more than 3 species.
For further illustrative figures on
PCA please read blog article on “Paradigms of Learning algorithms for AI and
ML” and “Unsupervised Algorithms Part-1”
References
- Sergios Theodoridis, Konstantinous Koutroumbas, “Pattern Recognition”, Academic Press, 2006,
- Christopher M Bishop, “Pattern Recognition and Machine Learning”, Springer 2007,
- Richard O Duda, Hart P.E., David G Stork, “Pattern Classification”, John Wiley and Sons, 2nd Ed.
- Stephen M. Kay, “Fundamentals of Statistical Signal Processing, Detection and Estimation”, Prentice Hall
Image Credits:
Figure-1: wikipedia.org
Figure-2: wikipedia.org
Figure-3: wikipedia.org
Figure-4: researchgate.net
Figure-5: cf2.pptonline.org
Figure-6: slideplayer.com
Figure-7: Clustering algorithm for directional data
(FCM4DD).pdf
Figure-9: miro.medium.com
Figure-10: pypr.sourceforge.net
Figure-11: https://www.researchgate.net
Figure-12: lh3.googleusercontent.com
Figure-13: Heartbeat – Fritz AI.com
Figure-14: mengnote.blogspot.com
Figure-15: ordination.okstate.edu
Comments
Post a Comment