Fuzzy Clustering, Expectation Maximization, Gaussian Mixture Modelling, Principal Component Analysis

Fuzzy Clustering

Fuzzy clustering is a soft clustering method such that a data point can have partial membership in more than class. The clusters produced by the k -means procedure as described in the previous article is also called "hard" or "crisp" clusters, since any feature vector (data point) “s” either "is" or "is not" a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clustering, in which a feature vector “s” can have a degree of membership in one or more clusters.

The fuzzy clustering algorithm implements an iterative method similar to the k-means but objective function differs from the k-means function by the inclusion of values that defines varying degree of class memberships for each data point. Points close to a cluster centre will have a higher degree of membership than points far from the centre. The degree, to which an element belongs to a given cluster, is a numerical value that belongs the interval 0 to 1.

Consider we observe a one-dimensional data set distribution S as shown in Figure-1. The data points belong to two class labels A and B which is not known apriori.

Figure-1:

Data Distribution

A data point can be clustered to any of the two classes. Figures 2 and 3 illustrates difference between hard (e.g., k-means) and soft (e.g., fuzzy) cluster approaches.

In general data is multidimensional data and we represent a sample data point as s_nand we assume there are K clusters.

Figure-2:

K-means clustering

k-means clustering method assigns an input data sample (point) s_n either to class A or B with hard membership values u_k(s_n) which is either 1 or 0.

Fuzzy clustering procedures assign a fractional degree of membership u_k(s_n) in each output cluster k. A fuzzy clustering of a dataset S with N data points into K clusters is characterized by membership functions u_nk which satisfies the following properties

for any s_n

$u_{nk}\left ( \mathbf{s_{n}} \right ) \in [0,1], k = 1,2,...,K$

and

$\sum_{k=1}^{K}u_{nk}\left ( \mathbf{s_{n}} \right )=1, n =1,2,..,N$

and for any k

$0<\sum_{n=1}^{N}u_{nk}\left ( \mathbf{s_{n}} \right ) <N,$

The membership function values constitute a membership matrix U which is N x K.

Figure-3:

Fuzzy clustering

Fuzzy clustering algorithm steps:

Initialize the values of u_nk, K and a stopping critierion ε.
At every iteration step “m” calculate the cluster centers c_k using the values if u_nk.
After the c_k are computed update the values of u_nk, using c_k.
If ||U(m+1)-U(m)|| < ε, stop iterations otherwise repeat 2 to 4.

Clustering results and cluster shapes will depend on the type of distance metric chosen.

Crisp and Fuzzy Sets

Boolean logic is crisp logic that uses sharp distinction between one class and the other. It forces us to draw lines between class members and nonmembers. Fuzzy or multivalued logic was introduced by Jan Lukasiewicz a Polish philosopher in 1930s. Lukasiewicz introduced the logic that extends the range of truths to a range of values between 0 to 1 to represent the possibility that a given statement is true or false. This led to an inexact reasoning logic often called “possibility theory”. Hence fuzzy clustering is a.k.a possibilistic clustering.

Figure-4:

Figure-5:

Difference between Fuzziness and Probability

Consider this scenario:

You are in your apartment with two adjacent rooms: the kitchen and the dining room. Your status within the house is that you are likely to have a membership either "in the kitchen" or "in the dining room" (same as “you are not in kitchen”). Assume you are a person who spends more time in kitchen than in dining room. In that case the uncertainty of your status can be quantified with probability. Probability is about that given the current state of our knowledge we can say the uncertainty probability of you being kitchen is only 0.4.

However if you stand in the doorway between the kitchen and dining room with only part of your foot in the dining then you may be considered "partially in the kitchen" and “partially in the dining”. Quantifying this partial state yields a fuzzy set membership. With only one foot in the dining room, we might say your membership is 0.95 "in the kitchen" and 0.05 in the “in the dining”. Fuzzy logic is all about degrees of truth about partial or relative truths. In the following figure, bottles A and B are shown with fuzzy membership and probability values. Bottle A has a 0.9 fuzzy membership of water, the rest is poison. Bottle B has a probability of 0 .9 that it contains only water.

Figure-6:

The Fuzzy-C-Means (Fuzzy-k-Means)

FCM (also referred to as soft clustering or soft k-means) is frequently used in pattern recognition. The main difference with k-means cluster is that objective function for fuzzy c-means algorithm allows different cluster membership with varying values which ranges from 0 to 1, where as k-means cluster has strict objective function allows only one cluster membership value equal to 1 all others equal to 0. This method was developed by Dunn in 1973 and improved by Bezdek in 1981. The degree of membership can also be interpreted probabilistically as the square root of the a posteriori probability the s is in cluster “k” (Duda and Hart)

Figure-7:

Clustering spherical data using FCM for directional data

Fuzzy clustering by Local Approximation of Memberships (FLAME) is a data clustering algorithm that defines clusters in the dense parts of a dataset and performs cluster assignment solely based on the neighborhood relationships among objects. The key feature of this algorithm is that the neighborhood relationships among neighboring objects in the feature space are used to constrain the memberships of neighboring objects in the fuzzy membership space.

Probabilistic clustering methods

Expectation Maximization for mixture density distribution.

EM algorithm works best for exponential family of probability distributions. It is assumed that the observed random data is an independent and identically distributed dataset (i.i.d). This means that the observation of any data sample does not in any way influence the observation of any other sample. Let the data model consist of random variable vectors of the observed set of data vectors X and a set of hidden random variable vectors Z. Both observed variable X and latent variables Z have a joint distribution for all data. As in the following figure there are two such joint distributions which are Gaussian and hence they belong to exponential family of distributions. The combination of two Gaussian distributions makes a mixture of joint distributions. The data samples belong to two different clusters. The parameters which are (1) prior probabilities of each cluster class (2) mean values of each class distribution and (3) the cluster variance. These parameters define the joint probability distributions are unknown.

Figure-8:

EM algorithm for 1D data

Expectation Maximization (EM) Algorithm

Expectation Maximization (EM) Algorithm is a mixture decomposition (resolving) approach to cluster analysis and has the underlying assumption that the patterns to be clustered are drawn from one of several distributions. The mixture decomposition schemes make use of the familiar Bayesian classification rule. The goal is to identify the parameters of each of these distributions and also the total number of such distributions. The individual components of the mixture density are assumed to be exponential e.g., Gaussian. In such cases the parameters of the individual Gaussians are to be estimated by the EM algorithm. The problem is to iteratively obtain a maximum likelihood estimate of the parameter vectors of the component densities.

In mixture models it is assumed that there are J distributions. So let there be J clusters, C_j, j = 1, . . . , J, underlying the data set. According to the Baye’s classification rule each vector x_n , n = 1, . . . , N, belongs to a cluster C_k if probability,

$P\left ( C_{k} \right|\textbf{x}_{n} )> P\left ( C_{j} \right|\textbf{x}_{n} ),j = 1,....,J, k\neq j$

The class distributions contribute to the formation of p(x_n). Thus, each point x_n may be “drawn” from any of the J model distributions with probability P_j , j =1, 2, . . . , J. This type of modeling can approximate any continuous density function for a sufficient number of mixtures J and its corresponding model parameters.

The first step of the procedure involves the choice of the set of components that represent the likelihood density function p(x|C_j) in parametric form, that is, p(x|C_j;θ), and then the computation of the unknown parameters, θ and the class probabilities P(C_j), j =1, 2, . . . , J , is based on the set of the available training samples x_n, n = 1,2,….N.

Mixture decomposition schemes are implemented under the following conditions

no training data with known cluster labeling are available and
the a priori probabilities P(C_j) ≡ P_j are also not known.

Hence it is an unsupervised learing method. However the goal is to classify any data x_n into either of the J different distributions. Since there are multiple distributions of J different distributions, the set unknown parameter vectors θ_jcan be represented as a matrix,

Θ_J = [θ₁,…θ_j,…. θ_J]^T .

where θ_j being the parameter vector corresponding to the j^th cluster,

Let P_j = P(C_j) represent the a priori probability for the j^th cluster, then vector

P = [P₁, . . . , P_J]^T ,

The unknown parameter matrix Θ_J can be augmented to form

Θ = [Θ_J , P^T ]^T .

The iterative method

Application of the basic two step interative method optimization method

Considering all the data samples and classes.

Step-1

The Expectation (E) –step:

$Q\left ( \mathbf{\Theta };\mathbf{\Theta \left ( t \right )} \right )=\sum_{n=1}^{N}\sum_{j=1}^{J}P\left ( C_{j}|\mathbf{x}_{n};\mathbf{\Theta \left ( t \right )} \right )ln\left (P \left ( \mathbf{x_{n}}|C_{j};\mathbf{\theta } \right )P\left ( C_{j} \right ) \right )$

Step-2

The Maximization (M)-step:

$\mathbf{\Theta }\left ( t+1 \right )=arg\underset{\mathbf{\Theta }}{max}\left ( Q\left ( \mathbf{\Theta |\mathbf{\Theta \left ( t \right )}} \right ) \right )$

Mixture models and its decomposition:Mixture of Gaussians (MOGs)

It is assumed that the complete probability structure (type of pdf) of random variable X is known and all patterns in X are generated by some mixture density. That means that the structures (type of pdf) of prior probabilities π_j = P(C_j) and the class conditional (likelihood) probabilities p(x|C_j) are known. However the parameters that best characterizes the clusters underlying X is unknown and class labels are also unknown. The unknown parameter vector matrix is represented by Θ.

Figure-9:

Figure-10:

3 Gaussian mixtures of 2D data

Figure-11:

3D plot of 4 Gaussians

Mixture model for words

Figure-12:

Gaussian Mixture Centers and Covariances

The cost function Q is constructed on the basis of random parameter vectors chosen for θ_j. The parameter vectors θ_jare estimated by optimizing the cost function.

We can therefore decompose the mixture into its components by computing the maximum aposteriori probability. The samples x drawn from the set X with the known mixture density are assigned the class labels based on the posterior probability as per Bayesian classification rule.

When there are K number of base distributions we get a mixture model that is represented by

$p\left ( \mathbf{x_{n}}|\mathbf{\theta }\right )=\sum_{k=1}^{K}\mathit{\pi _{k}}p_{k}\left ( \mathbf{x_{n}}|\mathbf{\theta } \right )$

This convex combination is a weighted sum of distributions where π_k are nonnegative mixing coefficients such that such that sum of all π_k= 1, (k = 0,1,.....,K)

Dimensionality Reduction

In statistics, data mining, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the dimensionality of the random variables under. Many machine learning data sets contain thousands of features for each data instance. This necessitates large model complexity, model training will be slow and obtaining a good solution to the problem it will be difficult. In a dataset with multidimensional feature variables can contain many features with similar information and hence are linearly related. The objective of dimensionality reduction is to minimize the correlations and decrease data dimensionality without losing too much of information. One way to do dimensionality reduction is to merge all those correlated features into one. This method is also called feature extraction. When the metric used to measure distances between data points is Euclidean it is same as Principal Coordinates Analysis (PCA).

PCA cannot be used in cases when the data is only piece-wise linear or nonlinear. As an example consider the nonlinear case of data that is distributed in two circular clouds one with lesser radius than the other. Since both set of data has equal variances in all directions PCA cannot determine variances in increasing or decreasing magnitudes and fails for dimensionality reduction. However a polar transformation can be used such as two dimensional circular data is reduced to one-dimensional data.

It is always a good practice to try to reduce the dimensionality of your training data before you feed the data to another machine learning algorithm. This will make the data less complex, algorithm much faster with reduced memory.

Reducing the dimensionality may lose some information. So, even if this will speed up the training, most of the times, it may also make your system perform slightly worse. So dimensionality reduction is used only if the training is too slow. Otherwise, try to use the original data.

These are some of the most common dimensionality reduction algorithms in machine learning:

Principal Component Analysis (PCA).
Kernel PCA. Locally-Linear Embedding.
t-Distributed Stochastic Neighbor Embedding (t-SNE).
Auto Encoders

An example data visualization benefits by dimensionality reduction

Dimensionality reduction is essential for visualizing the pdfs of multidimensional data. Visualization is the process of creating diagrams, images, graphs, charts, etc., to communicate some information. This method can be applied using unsupervised machine learning.

For example, let’s say you are a football coach and you have some data about your team’s performance in a tournament. You may want to find all the statistics about the matches quickly.

You can feed the complex and unlabeled data to some visualization algorithm.

These algorithms can output a two-dimensional or three-dimensional representation of your data that can easily be plotted. So, by seeing the plotted graphs, you can easily get a lot of information. This information will help you to maintain your winning formula, correct your previous mistakes and win the ultimate trophy.

Principal component analysis (PCA) an intuitional explanation

Principal Components Analysis (PCA) is one of the simplest and oldest eigen analysis-based method. PCA is a statistical procedure that uses a linear orthogonal transformation to convert a set of data observations which are possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. There are many applications for PCA and a major one dimensionality reduction for data visualization.

Geometrically, PCA is a rigid rotation of the original data matrix, and can be defined as a projection of data points onto a new set of axes, such that the maximum variance is "extracted" along the first axis, data points along axis 1 with maximum variation is projected on the second orthogonal axis with next highest variance, data points along the first and second axis is projected on the third axis and so on. Figure illustrates the dimensionality reduction by PCA as a projection of the original data dimensions to a lower dimensional space.

Figure-13:

Eigen Analysis and PCA

A detailed discussion of eigen analysis is beyond the scope of our context. However the basic methodology is described below. (Also read blog on Linear algebra for AI and ML for futher discussions on Eigen Decomposition)

Eigen analysis is performed on a square, symmetric correlation matrix R pertaining to zero mean data set with I dimensions. The correlation matrix R is derived from the data matrix X after subtracting its mean vector from every data point vector.

There is a unique solution to the eigen analysis, no matter the order of data. The solution results in decomposition of the square matrix R to it its eigen values and eigen vectors. The associated eigenvalues define the extremal values of the variances.

Each eigenvalue is associated with an eigenvector (ordination axis).
Axes are ranked by their eigenvalues.
Thus, the first axis has the highest eigen value (variance), the second axis has the second highest eigenvalue and so on.
In principal components analysis, eigenvalues are ‘variance extracted’.
The eigen vector are the principal component directions.
The principal component axes are orthogonal to each other.
The number of eigen values and eigen vectors is same as the dimensionality of the dataset. In a large dimensional dataset the number of eigen axes will be correspondingly large and visualizations in dimensionality greater than is impossible.

Hence it is worth reducing the data dimensionality to low values for data interpretation.

Geometric Rationale of PCA

The geometric rationale of PCA is illustrated by the following figures with 2D data. The orthogonal axes corresponding to x₁ and x₂ can be rotated so that the x₁-axis aligns with the line along which the data has maximum variance. The objective of PCA is to rigidly rotate the axes of I - dimensional data space to new directions which are called principal axes that have the following properties:

Steps

Principal axis-1 along which the data projections have the highest variance, principal axis-2 has the next highest variance and so on, until the I^th principal axis has the lowest variance among all. Covariance among each pair of the principal axes is zero (the principal axes are uncorrelated).

PC₁ is the direction of maximum variance in the cloud of points which are I-dimensional. The squared distances of data points from PC₁ axis are minimum. The line along this axis is the “line of best fit” with the least-squared error .

PC₂ is in the direction of the next highest variance, subject to the constraint that it has zero covariance with PC₁.

PC₃ is in the direction of the next highest variance, subject to the constraint that it has zero covariance with bot
h PC₁ and PC₂ and so on... up to PC_I.

Figure-14:

(A) Data distribution in 3D space, (B) Finding a new orthogonal coordinate system axes pointing in directions of largest variances in data space (C) Use a subspace axes system to describe data in reduced dimensionality

Dimensionality reduction with PCA

The goal of PCA is to find the set of orthogonal axes along which the multidimensional data has most information. Given the original data vector x, the l < I principal components may be computed by projecting vector x onto the l - dimensional subspace. This results in a transformation from the I - dimensional input data space to l - dimensional subspace.

If the dimensionality of the data is I, it is required to find a dimensionality l < I that conserves most of the valuable information contained in the original data set as a simple example if I = 3, in figure above l = 2. Since variance contains information we can find a optimal set of orthogonal directions which are the eigen vectors of the mean subtracted data covariance matrix R. The eigen values then reflect the variance along these basis vectors. We can then arrange the I eigen values in the order of decreasing or increasing magnitudes and choose the l set of eigen vectors along which the variance values are highest. The data with dimensionality l < I can be reconstructed using these coefficients.

Figure-15:

PCA as a projection of data points. Typical data sets will have many more than 3 species.

For further illustrative figures on PCA please read blog article on “Paradigms of Learning algorithms for AI and ML” and “Unsupervised Algorithms Part-1”

References

Sergios Theodoridis, Konstantinous Koutroumbas, “Pattern Recognition”, Academic Press, 2006,
Christopher M Bishop, “Pattern Recognition and Machine Learning”, Springer 2007,
Richard O Duda, Hart P.E., David G Stork, “Pattern Classification”, John Wiley and Sons, 2^nd Ed.
Stephen M. Kay, “Fundamentals of Statistical Signal Processing, Detection and Estimation”, Prentice Hall

Image Credits:

Figure-1: wikipedia.org

Figure-2: wikipedia.org

Figure-3: wikipedia.org

Figure-4: researchgate.net

Figure-5: cf2.pptonline.org

Figure-6: slideplayer.com

Figure-7: Clustering algorithm for directional data (FCM4DD).pdf

Figure-9: miro.medium.com

Figure-10: pypr.sourceforge.net

Figure-11: https://www.researchgate.net

Figure-12: lh3.googleusercontent.com

Figure-13: Heartbeat – Fritz AI.com

Figure-14: mengnote.blogspot.com

Figure-15: ordination.okstate.edu

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence