Supervised Learning Algorithms
Supervised
learning – Classification and Regression
Supervised
learning makes use of a priori information
of output targets for the input data. When data instances are given with known
targets corresponding to correct outputs then the learning method is called
supervised. These methods construct a model based on available data with known
targets. The type of input data features may be continuous, categorical or
binary. Supervised algorithms can be generally categorized into classification
and regression tasks. When outputs are discrete valued then it is a
classification learning problem. If targets are continuous valued, it is regression.
The following figures
explains the idea of supervised learning tasks for classification and
regression.
Classification
Consider a binary classification problem. There are only two classes of data the black and white dots. The outer box contains the entire data for both classes. The inner box has a subset of data observations which forms the training set with known labels. The task is to learn the target labels from the training set. The model learns to predict the classes of unknown data which were not available for training with high accuracy. When the number of classes are more than two the task is known as multiclass problem.
Classification
Figure-1
Regression
Linear
regression
Figure-2 shows
the relationship between two continuous valued univariates (only one feature value).
The independent variable “x” is
represented in horizontal axis and dependent one “y” along the vertical axis. The ideal value of y varies linearly with x.
The training set contains the observed data shown as blue dots includes random noise
and hence do not fall exactly on the red line which represents the ideal
relation between x and y. The task is to learn the linear
relationship from available training data and predict the value of y with minimum error for a value of x not included in training set.
Linear Regression
Figure-2
Nonlinear regression
Figure-3
illustrates a nonlinear relationship between x and y which is nonlinear.
The red curve shows the true relation between x and y. The nonlinear functions The circles are the
data points which includes random noise.
Non-linear Regression
Figure-3
Training and Predicting
During training
with the known targets, the learning model develops a hypothesis that
approximates the true mapping of the input data variables with its true target
labels or values. The trained model can now be tested on future incoming data which has no target
outputs.
Important
supervised machine learning algorithms.
A list of functional categories of supervised methods is given followed by more detailed description
Statistical
learning algorithms
Statistical learning algorithms rely on Bayesian
learning.
Figure-4:
The simplest statistical machine learning model is the naive Bayes model. Another variation is the Averaged One-Dependence Estimators (AODE).
Artificial
Neural Network based algorithms
These are connectionist models. Examples are Single-layered (Perceptrons), Multi-layered Networks, Deep Neural Networks, Radial Basis Function Networks, etc.
These are connectionist models. Examples are Single-layered (Perceptrons), Multi-layered Networks, Deep Neural Networks, Radial Basis Function Networks, etc.
Figure-5:
Convolutional
Neural Networks (CNNs), Recurrent Neural Networks, Recursive Neural Networks.
Figure-6:
Support Vector
Machines
- SVMs
These
methods are based on maximization of the margins between the data there by
improves generalization performance.
Figure-7:
Logic based algorithms
Decision
Trees, Random Forests, Classification and Regression Trees (CART), Iterative
Dichotomizers (ID3), Chi squared Automatic Interaction Detection (CHAID),
Figure-8:
The
learning methods above are sometimes called Eager learners and takes
more time for training and less time to predict with new input data.
The
other type of methods, learn to recognize based on related data in the stored
training data base.
This type of learning is also called memory based learning or instance based
learning. They take less time to learn but more time to predict and
hence are sometimes called Lazy learners. Instance based algorithms are of this type
Instance Based
Algorithms (Memory Based methods) –
Nearest Neighbor, k Nearest Neighbor classifiers, Case based reasoning methods are similar to k-NN.
They
classify new data by analyzing similar data stored in memory and ignores those
in memory that are different. They solve new problems based on the solutions of
similar past problems.
Nearest Neighbor
Nearest Neighbor
Figure-9:
k
Nearest Neighbor
Figure-10:
All the above approaches assume that the input
feature vector data are of fixed dimensionality.
The k-Nearest Neighbours algorithm does
not model the training data which is labelled, but instead uses the labelled dataset
directly each time a prediction is required. Such algorithms are called transductive algorithms, and k-Nearest
Neighbours algorithm is a classical example.
The Bayesian
Methods
These
methods are based on statistical decision theory and rely on Bayes' theorem.
Statistical learning algorithms are characterized by an explicit underlying probability distribution, which provides a probability of an observed data belonging to a certain class. Observed data is evidence of random variable belonging to a certain distribution. There are several types of discrete as well as continuous probability distributions that define the randomness of data generated. Few examples of distributions for discrete valued variables are Binomial, Bernoulli, Poisson, Multinomial, etc. Few examples of distributions for continuous valued variables are Uniform, Gaussian, gamma, beta, log-normal, Weibull, exponential, Erlang, Chi-Squared, Student’s t, etc. For more details refer to blogpost on Probability and Statistics for Machine Learning and Artificial Intelligence Part-1.
Statistical learning methods range from simple calculation of averages to the construction of complex models such as Bayesian networks and neural networks. They have applications throughout computer science, engineering, neurobiology, psychology, and physics.
Bayesian learning methods formulate learning as a form of probabilistic inference, using the Bayesian theorem. The Bayesian learning technique incrementally updates our prior belief by computing posterior probability P(H|D) of the hypothesis H whenever a a new data evidence D is observed.
It relies on the fundamental result of probability theory known as Bayes’ theorem. The Bayesian theorem says that the update of P(H|D) can be computed by multiplying the likelihood probability P(D|H) of data occurring given the hypothesis multiplied by prior probability of the hypothesis P(H) and dividing the product by probability of the data P(D).
Bayes theorem is one of
the most important rules of probability theory used in Data Science. It provides us with a
way to update our beliefs and inferences based on the observation of new
events. In probability and statistics Bayes' theorem is equivalent to
Pythagoras theorem in geometry. This method finds applications in wide range of
areas such as finance, science, law, medicine, engineering, sports, philosophy
etc.
Assume that we observe a multi - dimensional feature
vector “s” for a two class pattern recognition problem. We can derive that
the posterior probability of the class label given the observed data. It is
called posterior since it computes the probability of the given data can be
classified as belonging the class y1.
This is expressed through the conditional probability function
Fruit – Box Problem
Figure-11:
P(y1|s) is read as the
posterior probability for class label “1” given the feature s. The probability P(s| y1)
is the likelihood probability of observing the data in the class y1., P(y1) measures
our prior belief about class y1
or its called the prior probability of y1. P(s) is the marginal probability of the data feature
“s”. It is called marginal since it is
computed by marginalizing the class label in the joint probability of data and
class.
Correspondingly
for the class label “2”
For the two class problem the classification rule as per the Bayesian decision theory can now be stated as
If P(y1|s) > P(y2|s) then s is classified
as belonging y1.
If P(y1|s) < P(y2|s) then s is classified
as belonging y2.
Probabilistic separation
of two classes
Figure-12:
Venn Diagrams of, the red color for region of erroneous prediction.
Figure-13:
Bayesian Networks and DAGs
Bayesian methods are implemented with Bayesian networks which are Directed Acyclic Graphs (DAGs). DAGs are acyclic graphs that has directed links. A Bayesian network or DAG essentially has random variables, and a graph structure that encodes the dependencies between the variables.
Implementation of the Bayesian Rule
Figure-14:
Supervised learning of the Bayesian Network
Training in Bayesian networks involves learning the relationship between the computational elements given labelled data. Inference of the network involves computing the probability of class variables given that once the feature data variables are observed. During training the true class eventually dominates class prediction for observed data. This is characteristic of Bayesian learning.
For a Multiclass problem the
decision Rule
This
method can be used for multiclass prediction as well. The decision criteria is
now stated as
Assign s to
class yk, if gk(s) > gj(s) , for
all k ≠ j
NaĂŻve Bayes,
A
NaĂŻve Bayes is a very simple supervised statistical machine-learning algorithm
that uses the Bayes theorem. This classifier is a machine learning model that
learns to discriminate objects based on its features.
Naive Bayes methods are a set of supervised learning
algorithms based on applying Bayes’ theorem with the “naive” assumption of
conditional independence between every pair of features given the value of the
class variable. These
are called NaĂŻve Bayes classifies since they rely on the naive assumption that all
features (attributes) measured are independent of each other. Independence
refers to a random variable that is unaffected by all other variables. (The
AODE method was introduced to overcome this limitation)
For
the above posterior probabilities for class labels “j” and “k”, the larger probability indicates the class corresponding to the
feature “s” .
NaĂŻve
Bayes finds application in detection of spam filtering, data classification
etc. An advantage of Naive Bayes method is that it requires less training
compared to other methods.
Conditional
Independence
Assume a simple example
that persons B and C classmates. The following figure illustrates the concept of
conditional independence.
Figure-15:
If there is road block represented by variable A, they both come to class late and the two events are dependent only on A. However if the roads are clear and still both B and C come late, are these two events dependent?
NB Gaussian classifier
development of python code from scratch involves the following steps:
Consider that we use the flower iris data set
Consider that we use the flower iris data set
- Step 1: Write the code to summarize the dataset, by calculating feature wise mean values and variances of data set.
· The iris data
set
The Iris Flower dataset was used by R.A. Fisher for his classic paper in The use of multiple measurements in taxonomic problems as an example for linear discriminant analysis in 1936. The dataset involves
predicting the flower species given the measurements (feature data) of iris
flowers. It
was the botanist Edgar Anderson who catalogued the attributes of Iris flowers
to study its variations.
It is a multiclass classification
problem. The number of observations for each class is balanced. There are 150
observations with 4 input variables and 1 output variable. The variable names
are as follows:
- Sepal length in cm.
- Sepal width in cm.
- Petal length in cm.
- Petal width in cm.
- Class (known a priori).
There are 4 features and one target, which is categorical since
we have 3 classes ( Setosa, Versicolor and Virginica, encoded with 0, 1 and 2).
Figure-16:
Further steps
- Step 2: Summarize data by classes, by calculating feature wise mean values and variances of data set for each class.
- Step 3: Code the Gaussian probability density function as per equation.
- Step 4: Determine the class probabilities using the Bayes theorem.
- Step 5: Write the code to separate the classes.
Artificial
Neural Networks ANNs and Deep Learning– Single layered (Perceptrons),
Multilayered, Recurrent Networks
Artificial
Neural Networks are coarse and abstract mathematical models of biological
neural networks. They aim to emulate the structure and functions of the human
brain. ANNs have an input layers, output
layers and also hidden layers. Each layer can have several nodes. The input
layer is generally not counted as a layer. Therefore a single layer network
will have an output layer and an input layer. A two layer network will have an
input layer, a hidden layer and an output layer.
A
deep neural network (DNN) is an ANN with more than two or more hidden layers
between the output and input layers. ANNs with less than two hidden layers are
called non-deep neural networks or shallow networks. DNNs can model complex
non-linear relations between input data and targeted outputs. DNNs perform
better compared to shallow networks when input data is very dimensional and
exhibits superior performance when the data size grows big compared other
learning model such as decision trees and support vector machines. As an
example shallow networks are not feasible models for facial recognition
problems for which deep learning methods are necessary.
The Biological Neuron
Every biological
neuron has four basic elements soma, dendrites, axon and synaptic terminals.
Figure-17:
McCulloch-Pitts model
The
MP model is a very coarse and simplistic mathematical approximation of a
biological neuron
Simplest
Perceptron
Figure-18:
Graphical
Representation of Training data
Figure-19:
The architecture of a
network is closely related to the training algorithm that is used to train the
network. Based on the feed forward and feedback connections the network
architecture is classified into three classes.
Single-Layer
Feedforward network: In a single layer feed forward
network, there is an input layer of source nodes and an output layer.
The architecture is
illustrated in fig.
Figure-20:
Multi-layered Feedforward Networks: This class of networks consists of multiple layers of computational units
Figure-21:
Recurrent
Networks:
The other type of neural network architecture is the recurrent neural networks. Recurrent networks contain feedback connections between the output layer and inner layers.
Figure-22:
Other types of ANNs
There are other types of ANN architectures such as Radial Basis Function Networks, Self Organizing maps (a.k.a. Kohonen's network), Hopfield Networks and its variations like Boltzmann's machines, deep learning architectures such as Convolutional Neural Networks, Gated Recurrent Neural Network, Long Short Term Memory Networks (LSTM), Restricted Boltzmann's Machines, Deep Belief Networks, Auto-Encoders, Variational Auto-Encoders and Generative Adversarial Networks. For an overview see Figure-6. These topics will be discussed in later blogs.
Training the ANNs
There are other types of ANN architectures such as Radial Basis Function Networks, Self Organizing maps (a.k.a. Kohonen's network), Hopfield Networks and its variations like Boltzmann's machines, deep learning architectures such as Convolutional Neural Networks, Gated Recurrent Neural Network, Long Short Term Memory Networks (LSTM), Restricted Boltzmann's Machines, Deep Belief Networks, Auto-Encoders, Variational Auto-Encoders and Generative Adversarial Networks. For an overview see Figure-6. These topics will be discussed in later blogs.
Training the ANNs
Artificial
Neural Networks makes use of an iterative learning algorithm.
Perceptron
A
single layered perceptron with a single neuron perceptron is the simplest of artificial neural networks.
The
neuron computes the sum of its weighted inputs (linear combination). The
summation including the bias value is called its activation. The activation is
transformed to the neuron output by a nonlinear transfer function called
activation function.
All
perceptron networks use the McCulloch-Pitts model with threshold output
function as the activation function.
Training
the Perceptron
Perceptron learning algorithm
The
output values are compared to the target values resulting in an error function
or a cost function. The optimal values of weights are determined iteratively
updating the weight values according the perceptron learning algorithm. The
network thus learns to model the mapping between the input data and the target
outputs.
The perceptron
learning algorithm has a simple interpretation, as follows. Let the training
patterns be presented in a stochastic manner one at an iteration step. For each
pattern s(m) we
evaluate the perceptron output. If the pattern is correctly classified, then
the weight vector remains unchanged.
For class C1 (t
(m) = 1) if misclassified:
We
add a fraction of input to the weight vector.
For class C2 (tj(m)
= -1) if misclassified:
We subtract a fraction of the input from the weight vector
Classification of two persons using voice recognition
•
Task: Learn to
discriminate between two different voices saying “Hello”
•
Data
–
Sources
•
Steve
•
David
–
Format
•
Frequency
distribution (60 bins)
•
Analogy: cochlea
Frequency distribution of speech waveform
Figure-23:
•
Network
architecture
–
Feed forward
network
•
60 input (one
for each frequency bin)
•
6 hidden
•
2 output (0-1
for “Steve”, 1-0 for “David”)
Classification of Steve
Figure-24:
Classification of David
Figure-25:
Logistic Regression for classification
Logistic regression is a regression analysis model that can be used for probability prediction and binary classification. It makes use of sigmoid function. The value of the dependent function ranges in the interval [0, 1]. The domain of the independent variable can have a range from negative infinity to positive infinity (-∞,∞).
Logistic Regression of two classes
Figure-25:
It is widely used for data mining applications, classification, detection and so on in diverse fields.
Single layer network with linear
neurons
Adapting the
weight of a linear neuron
Linear
neuron networks can be used to construct linear adaptive filters.
The output of the
neuron
is a nonlinear
function of the activation xj.
The activation function f (.) is also
sometimes as the transfer function of
the artificial neuron.
A single neuron
with multiple inputs and bias
Figure-26:
Figure-26:
The
output values are compared to the target values resulting in an error function
or a cost function. The optimal values of weights are determined iteratively
updating the weight values using a gradient descent algorithm by minimizing the
error function using the training data. The network thus learns to model the
mapping between the input data and the target outputs.
Gradient Descent Algorithm
The gradient descent (steepest descent) algorithm and the
perceptron learning rule
According
to gradient descent algorithm equation for weight update at the mth iteration is
where the total error due to misclassifications for a perceptron at the mth iteration is a function of w(m) is given by the summation
and
η is called the learning rate and g(m) is the gradient of the error function
at the mth interation. The total error function may also be called the cost
function.
Applying the gradient descent rule the weights of
individual synapses are updated as follows
It is called the gradient descent
since the direction of incremental updates negative to the direction of gradient.
The following figures show the gradient descent in 1D weight space.

Gradient
descent in 1D weight space
Figure-27:
Offline/Full
Batch learning or learning by Epoch
Also called off-line
learning the weight update is performed only after the presentation of the
entire data set pairs or only after an epoch.
The steps involved are
- Initialize weight and bias parameters.
- For every iteration compute the error for every n data points and sum up.
- Compute the error gradient of summed error.
- Update the weight and bias values.
- Repeat steps 2 to 4 until error is minimized
Online
learning /Stochastic Gradient descent learning or Pattern-by-Pattern learning
When the training pairs
are presented pattern-by-pattern, at each iteration step “m” the training sample are
presented one-by-one.
The steps involved are
- Initialize weight and bias parameters.
- For every iteration compute the error for each individual data point.
- Compute the error gradient of summed error.
- Update the weight and bias values.
- Complete steps 2 to 4 for entire data set.
- Repeat steps 2 to 5 until error function is minimized.
Mini
Batch learning
The disadvantage for
both stochastic gradient descent and full batch gradient descent training is
that for both methods, convergence speed is very slow. A solution to speed up
the training with gradient descent method is by updating weight values after
summing the errors over smaller sets (mini batches) of training data samples.
The weights are updated after summing the squared errors over each mini-batch.
Hence this method is called mini-batch training.
The error function is
defined over the mini batch sized data the # of data points is MB.
- Initialize weight and bias parameters.
- For every iteration select small data subsets called mini batches.
- Compute the error for every n(MB) data points and sum up.
- Compute the error gradient of summed error.
- Update the weight and bias values.
- Repeat steps 2 to 5 until error is minimized.
Comparison of error gradient paths in 2D weight
space
Figure-28:
Support Vector
Machines
Support-vector machines (SVMs, sometimes
also called support-vector networks) are non-probabilistic supervised learning models with associated learning algorithms
that analyse data used for classification and regression analysis.
Support
Vector machines for classification
SVMs
belong to a family of generalized linear classifiers and can be interpreted as
an extension of the perceptron. They can also be considered a special case of
Tikhonov regularization. SVMs have the capability to simultaneously minimize
the classification error and maximize the geometric margin; hence
they are also known as maximum margin classifiers. An SVM is a
classifier that separates a region of space using a maximum-margin hyperplane.
Figure-29:
The above figure illustrates the canonical planes with w*x – b =+1/-1 and the optimal hyper plane w*x – b = 0. The parameters of the maximum-margin (optimal) hyper-plane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the quadratic programming (QP) problem that arises from SVMs.
Support vector classifiers are essentially linear classifiers,
which learns to classify data into two classifiers. However in addition to
performing linear classification, SVMs can efficiently perform a
non-linear classification using what is called the kernel trick,
implicitly mapping their inputs into high-dimensional feature spaces. The
linear inseparable data that exists in the low dimensional space can be
projected into a higher dimensional space where it is linearly separable by a
hyper-plane. This hyper-plane will correspond to non-linear hyper-surface in
the original lower dimensional space.
SVM in higher dimensional space
Figure-30:
Popularly used kernel functions are
the Gaussian radial function. The kernels exist only if they fulfil
certain non-stringent conditions. This is given by the Mercer’s conditions.
Multiclass SVMS
The support
vector machine is fundamentally a two-class classifier. In practice, however,
we often have to tackle problems involving K > 2
classes.
Multiclass SVM
Figure-31:
From figure above it is observed multiclass separation is similar to binary classification. Multiclass separation can be done by considering training data samples from a class yk and training the model to learn those data as the positive examples and the data from the remaining K − 1 classes as the negative examples. This is known as the one-versus-the-rest approach.
Figure-32:
Another approach is to train K(K−1)/2 different 2-class SVMs on all
possible pairs of classes, and then to classify test points according to which
class has the highest number of ‘votes’, an approach that is sometimes called one-versus-one.
Figure-33:
SVM for Regression
Support vector method can also be applied to regression problems by the introduction of an appropriate loss function. Support vector
for regression was introduced in 1996.
Figure shows the regression
function that can be learned by the SVR using epsilon insensitive tube.
SVR with epsilon insensitive tube
Figure-34:
Bayesian SVM
In 2011 it was shown
by Polson and Scott that the SVM admits a Bayesian interpretation through the
technique of data augmentation. In this approach the SVM is
viewed as a graphical model (where the parameters are
connected via probability distributions). This extended view allows the
application of Bayesian techniques to SVMs, such as
flexible feature modeling, automatic hyperparameter tuning,
and predictive uncertainty quantification. Recently, a scalable version of the Bayesian SVM was developed by Wenzel et al.
enabling the application of Bayesian SVMs to big data.
Support Vector
method for clustering
When data are unlabelled, supervised learning is not
possible, and an unsupervised learning approach is required, which
attempts to find natural clustering of the data to groups, and then
map new data to these formed groups. The support-vector clustering algorithm,
created by Hava Siegelmann and Vladimir Vapnik, applies the
statistics of support vectors, developed in the support vector machines
algorithm, to categorize unlabeled data, and is one of the most widely used
clustering algorithms in industrial applications.
Decision trees
The Bayes’ classifier, artificial neural networks,
SVMs methods considered so far in pattern recognition problems are based on
features vectors which may be real or discrete valued numbers. In all such
cases there exists some notion of metric which is a natural measure of distance
such as similarity/dissimilarity between the feature variables. However pattern
recognition problems also can involve qualitative data which may be nominal,
categorical, ordinal data or a list of attributes. These features are not
measurable in the sense of a metric.
Decision trees are directed networks that build classification or regression models in the
form of a tree structure. It is a
flowchart with hierarchical structure which has several decision nodes and leaf
nodes connected by directed links or branches.
For a given data when the terminal nodes predict a
label or a category the trees is called classification tree. When the
terminal nodes predict a real valued number or equations that have a specific
output value then tree is called regression tree.
The top most decision
node is called the root node is linked to descendant decision nodes. The
process of decision making starts at the root node and progresses recursively
through internal nodes until we reach the terminal
or leaf node (descendant nodes
without children). The nodes between the root node and leaf nodes are called internal
nodes.
Decision trees learns by
utilizing if-then rule sets which are mutually exclusive and exhaustive. The
rules are learned sequentially using the training data one at a time. This
process is continued on the training set until meeting a termination condition.
The tree is constructed in a top-down recursive divide-and-conquer
manner.
The 20-questions game is an example of sequential decision
making which is useful for classification using nonmetric features. All
questions can be answered as true/false or yes/no or value (property) from a set of possible
values which are nonmetric.
Structure of Decision Tree
Figure-35:
The
number of splits depends on how many decision outcomes will be there at a node.
For binary valued splits the number of branches will only be two. In case of
multivalued splits there will be more than two splits. The number of splits may
vary throughout the tree.
The
basic method of the training algorithm
1.
Selecting
best feature at every decision node using impurity measure or attribute
selection measure
2.
Break
the data set at the decision node using feature value
3.
Progress recursively until terminal leaf
node so that all available data has the
same impurity value, no more remaining attributes or no more data.
Pros
- Decision
trees are easy to interpret and visualize.
- It
can easily capture Non-linear patterns.
- It
requires fewer data pre-processing from the user, for example, there is no
need to normalize columns.
- It
can be used for feature engineering such as predicting missing values,
suitable for variable selection.
- The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm.
Cons
- Sensitive
to noisy data. It can overfit noisy data.
- The
small variation (or variance) in data can result in the different decision
tree. This can be reduced by bagging and boosting algorithms.
- Decision
trees are biased with imbalance dataset, so it is recommended that balance
out the dataset before creating the decision tree.
Classification accuracy
of rule based learning algorithms (such as in decision trees) (such as in
decision trees) can be improved by combining several classifiers.
Multiclass classification with Decision Tree
Figure-36:
Memory
Based Learning
In memory based
learning a set of past experiences: {(sn, tn)}Nn=1,
are stored as input-output pattern pairs where sn denotes nth sample of the input
vector and tn denotes the corresponding desired response. The set of
input-output pairs represents an association of a set of input and the desired
output vectors. The desired response (target) may also be a scalar quantity tn. When a test vector stest which is not included
in the stored patterns is presented it will be classified into one of labeled
groups based on its local neighborhood.
Therefore all memory
based learning algorithms involve two criteria
- Criterion
used for defining a local neighborhood of a test vector stest.
- Memory
learning rule applied to the training examples in the local neighborhood
of stest.
Nearest Neighbor rule.
Let the training data
set denoted by D = {s1,s2,…..sN}, consist of N labelled prototypes. Let stest be an unknown vector, then the local neighborhood
of a test vector stest is
defined as the region that lies in the immediate neighborhood of the
test vector stest. Let the vector
is considered to be nearest neighbour of stest if
is the Euclidean distance between all vectors sn,
n = 1,2,…N and stest.
The label associated to
with the minimum distance among the prototypes is
reported to be the classification of stest,
i.e., the nearest neighbour’s class is assigned to stest.
The response (output)
of the network due to input stest
will be same as the response due to the nearest neighbor
.
The above rule is
independent of the underlying distribution responsible for generating the
training vectors. An example of a memory based classifier that performs
classification using pairs of data is the Radial basis function networks.
k-nearest neighbour
classifier
A variant to the
nearest neighbourhood classifier is the k-nearest
neighbour classifier.
The learning rule is as
follows
- For some integer value of k, identify the k labelled prototype patterns that lie nearest to the test vector stest.
- Assign test vector to the class that is most frequently represented in the k nearest neighbourhood to stest.
Before and after classification with k-nearest neighbors
Figure-37:
Figure Credits:
References:
- Simon Haykin, Neural Networks Pearsons education asia (2001), II edition.
- Konstantinos Koutroumbas Sergios Theodoridis. Pattern Recognition - 4th Edition.
- Richard O. Duda. Peter E. Hart. David G. Stork. PATTERN. CLASSIFICATION. Second Edition.
- Kevin P. Murphy, Machine learning: a probabilistic perspective.
- Christopher Bishop, Pattern Recognition and Machine Learning, 2nd Edition
Figure Credits:
- Figure-2: wikimedia.org.
- Figure-3: blogs.mathworks.com.
- Figure-5: upload.wikimedia.org.
- Figure-6: hackernoon.com.
- Figure-7: agdal1125.github.io.
- Figure-8: www.sfu.ca.
- Figure-9: slideplayer.com.
- Figure-10: pbs.twimg.com.
- Figure-16: thegoodpython.com.
- Figure-17: teaching.ncl.ac.uk.
- Figure-21: qph.fs.quoracdn.net.
- Figure-22: ailabpagecom.files.wordpress.com.
- Figure-27: i0.wp.com/kraj3.com.np.
- Figure-29,30: upload.wikimedia.org.
- Figure-31: Multiclass classification problem solved by multiclass support vector machines, researchgate.net.
- Figure-33: Binary and multiclass sstrategies for machine learning, slidesharecdn.com.
- Figure-33: Binary and multiclass sstrategies for machine learning, slidesharecdn.com.
- Figure-34: miro.medium.com.
- Figure-35: static.javatpoint.com
Comments
Post a Comment