Supervised learning – Classification and Regression

Supervised learning makes use of a priori information of output targets for the input data. When data instances are given with known targets corresponding to correct outputs then the learning method is called supervised. These methods construct a model based on available data with known targets. The type of input data features may be continuous, categorical or binary. Supervised algorithms can be generally categorized into classification and regression tasks. When outputs are discrete valued then it is a classification learning problem. If targets are continuous valued, it is regression.

The following figures explains the idea of supervised learning tasks for classification and regression.

Classification

Consider a binary classification problem. There are only two classes of data the black and white dots. The outer box contains the entire data for both classes. The inner box has a subset of data observations which forms the training set with known labels. The task is to learn the target labels from the training set. The model learns to predict the classes of unknown data which were not available for training with high accuracy. When the number of classes are more than two the task is known as multiclass problem.

Classification

Figure-1

Regression

Linear regression

Figure-2 shows the relationship between two continuous valued univariates (only one feature value). The independent variable “x” is represented in horizontal axis and dependent one “y” along the vertical axis. The ideal value of y varies linearly with x. The training set contains the observed data shown as blue dots includes random noise and hence do not fall exactly on the red line which represents the ideal relation between x and y. The task is to learn the linear relationship from available training data and predict the value of y with minimum error for a value of x not included in training set.

Linear Regression

Figure-2

Nonlinear regression

Figure-3 illustrates a nonlinear relationship between x and y which is nonlinear. The red curve shows the true relation between x and y. The nonlinear functions The circles are the data points which includes random noise.

Non-linear Regression

Figure-3

Training and Predicting

During training with the known targets, the learning model develops a hypothesis that approximates the true mapping of the input data variables with its true target labels or values. The trained model can now be tested on future incoming data which has no target outputs.

Important supervised machine learning algorithms.

A list of functional categories of supervised methods is given followed by more detailed description

Statistical learning algorithms

Statistical learning algorithms rely on Bayesian learning.

Figure-4:

The simplest statistical machine learning model is the naive Bayes model. Another variation is the Averaged One-Dependence Estimators (AODE).

Artificial Neural Network based algorithms

These are connectionist models. Examples are Single-layered (Perceptrons), Multi-layered Networks, Deep Neural Networks, Radial Basis Function Networks, etc.

Figure-5:

Deep Learning algorithms

Convolutional Neural Networks (CNNs), Recurrent Neural Networks, Recursive Neural Networks.

Figure-6:

Support Vector Machines - SVMs

These methods are based on maximization of the margins between the data there by improves generalization performance.

Figure-7:

Logic based algorithms

Decision Trees, Random Forests, Classification and Regression Trees (CART), Iterative Dichotomizers (ID3), Chi squared Automatic Interaction Detection (CHAID),

Figure-8:

The learning methods above are sometimes called Eager learners and takes more time for training and less time to predict with new input data.

The other type of methods, learn to recognize based on related data in the stored training data base. This type of learning is also called memory based learning or instance based learning. They take less time to learn but more time to predict and hence are sometimes called Lazy learners. Instance based algorithms are of this type

Instance Based Algorithms (Memory Based methods) –

Nearest Neighbor, k Nearest Neighbor classifiers, Case based reasoning methods are similar to k-NN.

They classify new data by analyzing similar data stored in memory and ignores those in memory that are different. They solve new problems based on the solutions of similar past problems.

Nearest Neighbor

Figure-9:

k Nearest Neighbor

Figure-10:

All the above approaches assume that the input feature vector data are of fixed dimensionality.

The k-Nearest Neighbours algorithm does not model the training data which is labelled, but instead uses the labelled dataset directly each time a prediction is required. Such algorithms are called transductive algorithms, and k-Nearest Neighbours algorithm is a classical example.

The Bayesian Methods

These methods are based on statistical decision theory and rely on Bayes' theorem.

Statistical learning algorithms are characterized by an explicit underlying probability distribution, which provides a probability of an observed data belonging to a certain class. Observed data is evidence of random variable belonging to a certain distribution. There are several types of discrete as well as continuous probability distributions that define the randomness of data generated. Few examples of distributions for discrete valued variables are Binomial, Bernoulli, Poisson, Multinomial, etc. Few examples of distributions for continuous valued variables are Uniform, Gaussian, gamma, beta, log-normal, Weibull, exponential, Erlang, Chi-Squared, Student’s t, etc. For more details refer to blogpost on Probability and Statistics for Machine Learning and Artificial Intelligence Part-1.

Statistical learning methods range from simple calculation of averages to the construction of complex models such as Bayesian networks and neural networks. They have applications throughout computer science, engineering, neurobiology, psychology, and physics.

Bayesian learning methods formulate learning as a form of probabilistic inference, using the Bayesian theorem. The Bayesian learning technique incrementally updates our prior belief by computing posterior probability P(H|D) of the hypothesis H whenever a a new data evidence D is observed.
It relies on the fundamental result of probability theory known as Bayes’ theorem. The Bayesian theorem says that the update of P(H|D) can be computed by multiplying the likelihood probability P(D|H) of data occurring given the hypothesis multiplied by prior probability of the hypothesis P(H) and dividing the product by probability of the data P(D). Posterior and likelihood probabilities are conditional probabilities. P(H) is our prior belief about existence of the hypothesis and observed data D is our evidence.

Bayes theorem is one of the most important rules of probability theory used in Data Science. It provides us with a way to update our beliefs and inferences based on the observation of new events. In probability and statistics Bayes' theorem is equivalent to Pythagoras theorem in geometry. This method finds applications in wide range of areas such as finance, science, law, medicine, engineering, sports, philosophy etc.

Assume that we observe a multi - dimensional feature vector “s” for a two class pattern recognition problem. We can derive that the posterior probability of the class label given the observed data. It is called posterior since it computes the probability of the given data can be classified as belonging the class y₁. This is expressed through the conditional probability function

$p\left ( y_{1} |\mathbf{s}\right )=\frac{p\left ( \mathbf{s}|y_{1} \right )p\left ( y_{1} \right )}{p\left ( \mathbf{s} \right )}$

Fruit – Box Problem

Figure-11:

P(y₁|s) is read as the posterior probability for class label “1” given the feature s. The probability P(s| y₁) is the likelihood probability of observing the data in the class y₁., P(y₁) measures our prior belief about class y₁ or its called the prior probability of y₁. P(s) is the marginal probability of the data feature “s”. It is called marginal since it is computed by marginalizing the class label in the joint probability of data and class.

Correspondingly for the class label “2”

$p\left ( y_{2} |\mathbf{s}\right )=\frac{p\left ( \mathbf{s}|y_{2} \right )p\left ( y_{2} \right )}{p\left ( \mathbf{s} \right )}$

For the two class problem the classification rule as per the Bayesian decision theory can now be stated as

If P(y₁|s) > P(y₂|s) then s is classified as belonging y₁.

If P(y₁|s) < P(y₂|s) then s is classified as belonging y₂.

Probabilistic separation of two classes

Figure-12:

Venn Diagrams of, the red color for region of erroneous prediction.

Figure-13:

Bayesian Networks and DAGs

Bayesian methods are implemented with Bayesian networks which are Directed Acyclic Graphs (DAGs). DAGs are acyclic graphs that has directed links. A Bayesian network or DAG essentially has random variables, and a graph structure that encodes the dependencies between the variables.

Implementation of the Bayesian Rule

Figure-14:

Supervised learning of the Bayesian Network

Training in Bayesian networks involves learning the relationship between the computational elements given labelled data. Inference of the network involves computing the probability of class variables given that once the feature data variables are observed. During training the true class eventually dominates class prediction for observed data. This is characteristic of Bayesian learning.

For a Multiclass problem the decision Rule

This method can be used for multiclass prediction as well. The decision criteria is now stated as

Assign s to class y_k, if g_k(s) > g_j(s) , for all k ≠ j

Naïve Bayes,

A Naïve Bayes is a very simple supervised statistical machine-learning algorithm that uses the Bayes theorem. This classifier is a machine learning model that learns to discriminate objects based on its features.

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. These are called Naïve Bayes classifies since they rely on the naive assumption that all features (attributes) measured are independent of each other. Independence refers to a random variable that is unaffected by all other variables. (The AODE method was introduced to overcome this limitation)

For the above posterior probabilities for class labels “j” and “k”, the larger probability indicates the class corresponding to the feature “s” .

Naïve Bayes finds application in detection of spam filtering, data classification etc. An advantage of Naive Bayes method is that it requires less training compared to other methods.

Conditional Independence

Assume a simple example that persons B and C classmates. The following figure illustrates the concept of conditional independence.

Figure-15:

If there is road block represented by variable A, they both come to class late and the two events are dependent only on A. However if the roads are clear and still both B and C come late, are these two events dependent?

NB Gaussian classifier development of python code from scratch involves the following steps:
Consider that we use the flower iris data set

Step 1: Write the code to summarize the dataset, by calculating feature wise mean values and variances of data set.

· The iris data set

The Iris Flower dataset was used by R.A. Fisher for his classic paper in The use of multiple measurements in taxonomic problems as an example for linear discriminant analysis in 1936. The dataset involves predicting the flower species given the measurements (feature data) of iris flowers. It was the botanist Edgar Anderson who catalogued the attributes of Iris flowers to study its variations.

It is a multiclass classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

Sepal length in cm.
Sepal width in cm.
Petal length in cm.
Petal width in cm.
Class (known a priori).

There are 4 features and one target, which is categorical since we have 3 classes ( Setosa, Versicolor and Virginica, encoded with 0, 1 and 2).

Figure-16:

Further steps

Step 2: Summarize data by classes, by calculating feature wise mean values and variances of data set for each class.
Step 3: Code the Gaussian probability density function as per equation.
Step 4: Determine the class probabilities using the Bayes theorem.
Step 5: Write the code to separate the classes.

Artificial Neural Networks ANNs and Deep Learning– Single layered (Perceptrons), Multilayered, Recurrent Networks

Artificial Neural Networks are coarse and abstract mathematical models of biological neural networks. They aim to emulate the structure and functions of the human brain. ANNs have an input layers, output layers and also hidden layers. Each layer can have several nodes. The input layer is generally not counted as a layer. Therefore a single layer network will have an output layer and an input layer. A two layer network will have an input layer, a hidden layer and an output layer.

A deep neural network (DNN) is an ANN with more than two or more hidden layers between the output and input layers. ANNs with less than two hidden layers are called non-deep neural networks or shallow networks. DNNs can model complex non-linear relations between input data and targeted outputs. DNNs perform better compared to shallow networks when input data is very dimensional and exhibits superior performance when the data size grows big compared other learning model such as decision trees and support vector machines. As an example shallow networks are not feasible models for facial recognition problems for which deep learning methods are necessary.

The Biological Neuron

Every biological neuron has four basic elements soma, dendrites, axon and synaptic terminals.

Figure-17:

McCulloch-Pitts model

The MP model is a very coarse and simplistic mathematical approximation of a biological neuron

Simplest Perceptron

Figure-18:

In the general form the input signal s is a vector of dimensionality I, and there can be N number of learning examples. The set of all possible input vectors is known as the input space.

Graphical Representation of Training data

Figure-19:

The architecture of a network is closely related to the training algorithm that is used to train the network. Based on the feed forward and feedback connections the network architecture is classified into three classes.

Single-Layer Feedforward network: In a single layer feed forward network, there is an input layer of source nodes and an output layer.

The architecture is illustrated in fig.

Figure-20:

Multi-layered Feedforward Networks: This class of networks consists of multiple layers of computational units

Figure-21:

Recurrent Networks: The other type of neural network architecture is the recurrent neural networks. Recurrent networks contain feedback connections between the output layer and inner layers.

Figure-22:

Other types of ANNs
There are other types of ANN architectures such as Radial Basis Function Networks, Self Organizing maps (a.k.a. Kohonen's network), Hopfield Networks and its variations like Boltzmann's machines, deep learning architectures such as Convolutional Neural Networks, Gated Recurrent Neural Network, Long Short Term Memory Networks (LSTM), Restricted Boltzmann's Machines, Deep Belief Networks, Auto-Encoders, Variational Auto-Encoders and Generative Adversarial Networks. For an overview see Figure-6. These topics will be discussed in later blogs.

Training the ANNs

Artificial Neural Networks makes use of an iterative learning algorithm.

Perceptron

A single layered perceptron with a single neuron perceptron is the simplest of artificial neural networks.

The neuron computes the sum of its weighted inputs (linear combination). The summation including the bias value is called its activation. The activation is transformed to the neuron output by a nonlinear transfer function called activation function.

All perceptron networks use the McCulloch-Pitts model with threshold output function as the activation function.

Training the Perceptron

Perceptron learning algorithm

The output values are compared to the target values resulting in an error function or a cost function. The optimal values of weights are determined iteratively updating the weight values according the perceptron learning algorithm. The network thus learns to model the mapping between the input data and the target outputs.

The perceptron learning algorithm has a simple interpretation, as follows. Let the training patterns be presented in a stochastic manner one at an iteration step. For each pattern s(m) we evaluate the perceptron output. If the pattern is correctly classified, then the weight vector remains unchanged.

For class C₁ (t (m) = 1) if misclassified:

We add a fraction of input to the weight vector.

$\mathbf{W}\left ( m+1 \right )=\mathbf{W}\left ( m\right )+\eta \boldsymbol{s}\left ( m \right )$

For class C₂ (t_j(m) = -1) if misclassified:

We subtract a fraction of the input from the weight vector

$\mathbf{W}\left ( m+1 \right )=\mathbf{W}\left ( m\right )-\eta \boldsymbol{s}\left ( m \right )$

A simple application example

Classification of two persons using voice recognition

• Task: Learn to discriminate between two different voices saying “Hello”

• Data

– Sources

• Steve

• David

– Format

• Frequency distribution (60 bins)

• Analogy: cochlea

Frequency distribution of speech waveform

Figure-23:

• Network architecture

– Feed forward network

• 60 input (one for each frequency bin)

• 6 hidden

• 2 output (0-1 for “Steve”, 1-0 for “David”)

• Classification

Classification of Steve

Figure-24:

Classification of David

Figure-25:

Logistic Regression for classification

Logistic regression is a regression analysis model that can be used for probability prediction and binary classification. It makes use of sigmoid function. The value of the dependent function ranges in the interval [0, 1]. The domain of the independent variable can have a range from negative infinity to positive infinity (-∞,∞).

Logistic Regression of two classes

Figure-25:

It is widely used for data mining applications, classification, detection and so on in diverse fields.

Single layer network with linear neurons

Adapting the weight of a linear neuron

Linear neuron networks can be used to construct linear adaptive filters.

The output of the neuron is a nonlinear function of the activation x_j. The activation function f (.) is also sometimes as the transfer function of the artificial neuron.

A single neuron with multiple inputs and bias
Figure-26:

The Least Mean Squared (LMS) also known as the Delta rule or the Widrow Hoff rule is the basis of implementing a linear adaptive filtering.

The output values are compared to the target values resulting in an error function or a cost function. The optimal values of weights are determined iteratively updating the weight values using a gradient descent algorithm by minimizing the error function using the training data. The network thus learns to model the mapping between the input data and the target outputs.

Gradient Descent Algorithm

The gradient descent (steepest descent) algorithm and the perceptron learning rule

According to gradient descent algorithm equation for weight update at the m^th iteration is

$\mathbf{w}\left ( m+1 \right )=\mathbf{w\left ( \mathit{m} \right )}+\eta \triangledown E_{p}\left ( \mathbf{w}\left ( m \right ) \right )=\mathbf{w\left ( \mathit{m} \right )}+\eta\mathbf{g\left ( \mathit{m} \right )}$

where the total error due to misclassifications for a perceptron at the m^th iteration is a function of w(m) is given by the summation

$E_{p}\left ( \mathbf{w}\left ( m \right ) \right )=-\sum_{m}^{} \mathbf{w^{T}}\left ( m \right )\mathbf{s}\left ( m \right )t\left ( m \right )$

and η is called the learning rate and g(m) is the gradient of the error function at the m^th interation. The total error function may also be called the cost function.

Applying the gradient descent rule the weights of individual synapses are updated as follows

$w_{ki}\left ( m +1\right )=w_{ki}\left ( m \right )-\eta \frac{\partial E_{p}\left ( \mathbf{w}\left ( m \right ) \right )}{\partial w_{ki}\left ( m \right )}$

It is called the gradient descent since the direction of incremental updates negative to the direction of gradient. The following figures show the gradient descent in 1D weight space.

Gradient descent in 1D weight space

Figure-27:

Offline/Full Batch learning or learning by Epoch

Also called off-line learning the weight update is performed only after the presentation of the entire data set pairs or only after an epoch.

The steps involved are

Initialize weight and bias parameters.
For every iteration compute the error for every n data points and sum up.
Compute the error gradient of summed error.
Update the weight and bias values.
Repeat steps 2 to 4 until error is minimized

The summed up error also called the expected error over the entire training set at any iteration step

$E\left ( \mathbf{w}\left ( m \right ) \right )=\frac{1}{2}\sum_{n}^{}e^{2}\left ( m \right )$
where e²(m) is the squared error for every data presentation. This is the expected squared error function of the entire data. This procedure has a smoothing effect on the gradient descent of the error surface. Hence gradient descent for the batch learning computes the true gradient at every iteration “m” so that we can take steepest descent at that iteration to reduce the error.

Online learning /Stochastic Gradient descent learning or Pattern-by-Pattern learning

When the training pairs are presented pattern-by-pattern, at each iteration step “m” the training sample are presented one-by-one.

The error function in this case is defined as

$E\left ( \mathbf{w}\left ( m \right ) \right )=\frac{1}{2}e^{2}\left ( m \right )$

The steps involved are

Initialize weight and bias parameters.
For every iteration compute the error for each individual data point.
Compute the error gradient of summed error.
Update the weight and bias values.
Complete steps 2 to 4 for entire data set.
Repeat steps 2 to 5 until error function is minimized.

Mini Batch learning

The disadvantage for both stochastic gradient descent and full batch gradient descent training is that for both methods, convergence speed is very slow. A solution to speed up the training with gradient descent method is by updating weight values after summing the errors over smaller sets (mini batches) of training data samples. The weights are updated after summing the squared errors over each mini-batch. Hence this method is called mini-batch training.

The error function is defined over the mini batch sized data the # of data points is MB.

$E\left ( \mathbf{w}\left ( m \right ) \right )=\frac{1}{2}\sum_{n\left ( MB \right )}^{}e^{2}\left ( m \right )$ The steps involved are

Initialize weight and bias parameters.
For every iteration select small data subsets called mini batches.
Compute the error for every n(MB) data points and sum up.
Compute the error gradient of summed error.
Update the weight and bias values.
Repeat steps 2 to 5 until error is minimized.

Comparison of error gradient paths in 2D weight space

Figure-28:

Support Vector Machines

Support-vector machines (SVMs, sometimes also called support-vector networks) are non-probabilistic supervised learning models with associated learning algorithms that analyse data used for classification and regression analysis.

Support Vector machines for classification

SVMs belong to a family of generalized linear classifiers and can be interpreted as an extension of the perceptron. They can also be considered a special case of Tikhonov regularization. SVMs have the capability to simultaneously minimize the classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. An SVM is a classifier that separates a region of space using a maximum-margin hyperplane.

Canonical hyper planes and Optimal hyper planes

Figure-29:

The above figure illustrates the canonical planes with w*x – b =+1/-1 and the optimal hyper plane w*x – b = 0. The parameters of the maximum-margin (optimal) hyper-plane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the quadratic programming (QP) problem that arises from SVMs.

Support vector classifiers are essentially linear classifiers, which learns to classify data into two classifiers. However in addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. The linear inseparable data that exists in the low dimensional space can be projected into a higher dimensional space where it is linearly separable by a hyper-plane. This hyper-plane will correspond to non-linear hyper-surface in the original lower dimensional space.

SVM in higher dimensional space

Figure-30:

Popularly used kernel functions are the Gaussian radial function. The kernels exist only if they fulfil certain non-stringent conditions. This is given by the Mercer’s conditions.

Multiclass SVMS

The support vector machine is fundamentally a two-class classifier. In practice, however, we often have to tackle problems involving K > 2 classes.

Multiclass SVM

Figure-31:

From figure above it is observed multiclass separation is similar to binary classification. Multiclass separation can be done by considering training data samples from a class y_k and training the model to learn those data as the positive examples and the data from the remaining K − 1 classes as the negative examples. This is known as the one-versus-the-rest approach.

Figure-32:

Another approach is to train K(K−1)/2 different 2-class SVMs on all possible pairs of classes, and then to classify test points according to which class has the highest number of ‘votes’, an approach that is sometimes called one-versus-one.

Figure-33:

SVM for Regression

Support vector method can also be applied to regression problems by the introduction of an appropriate loss function. Support vector for regression was introduced in 1996. Figure shows the regression function that can be learned by the SVR using epsilon insensitive tube.

SVR with epsilon insensitive tube

Figure-34:

Bayesian SVM

In 2011 it was shown by Polson and Scott that the SVM admits a Bayesian interpretation through the technique of data augmentation. In this approach the SVM is viewed as a graphical model (where the parameters are connected via probability distributions). This extended view allows the application of Bayesian techniques to SVMs, such as flexible feature modeling, automatic hyperparameter tuning, and predictive uncertainty quantification. Recently, a scalable version of the Bayesian SVM was developed by Wenzel et al. enabling the application of Bayesian SVMs to big data.

Support Vector method for clustering

When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.

Decision trees

The Bayes’ classifier, artificial neural networks, SVMs methods considered so far in pattern recognition problems are based on features vectors which may be real or discrete valued numbers. In all such cases there exists some notion of metric which is a natural measure of distance such as similarity/dissimilarity between the feature variables. However pattern recognition problems also can involve qualitative data which may be nominal, categorical, ordinal data or a list of attributes. These features are not measurable in the sense of a metric.

Decision trees are directed networks that build classification or regression models in the form of a tree structure. It is a flowchart with hierarchical structure which has several decision nodes and leaf nodes connected by directed links or branches.

For a given data when the terminal nodes predict a label or a category the trees is called classification tree. When the terminal nodes predict a real valued number or equations that have a specific output value then tree is called regression tree.

The top most decision node is called the root node is linked to descendant decision nodes. The process of decision making starts at the root node and progresses recursively through internal nodes until we reach the terminal or leaf node (descendant nodes without children). The nodes between the root node and leaf nodes are called internal nodes.

Decision trees learns by utilizing if-then rule sets which are mutually exclusive and exhaustive. The rules are learned sequentially using the training data one at a time. This process is continued on the training set until meeting a termination condition. The tree is constructed in a top-down recursive divide-and-conquer manner.

The 20-questions game is an example of sequential decision making which is useful for classification using nonmetric features. All questions can be answered as true/false or yes/no or value (property) from a set of possible values which are nonmetric.

Structure of Decision Tree

Figure-35:

The number of splits depends on how many decision outcomes will be there at a node. For binary valued splits the number of branches will only be two. In case of multivalued splits there will be more than two splits. The number of splits may vary throughout the tree.

The basic method of the training algorithm

1. Selecting best feature at every decision node using impurity measure or attribute selection measure

2. Break the data set at the decision node using feature value

3. Progress recursively until terminal leaf node so that all available data has the same impurity value, no more remaining attributes or no more data.

Pros

Decision trees are easy to interpret and visualize.
It can easily capture Non-linear patterns.
It requires fewer data pre-processing from the user, for example, there is no need to normalize columns.
It can be used for feature engineering such as predicting missing values, suitable for variable selection.
The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm.

Cons

Sensitive to noisy data. It can overfit noisy data.
The small variation (or variance) in data can result in the different decision tree. This can be reduced by bagging and boosting algorithms.
Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating the decision tree.

Classification accuracy of rule based learning algorithms (such as in decision trees) (such as in decision trees) can be improved by combining several classifiers.

Multiclass classification with Decision Tree

Figure-36:

Memory Based Learning

In memory based learning a set of past experiences: {(s_n, t_n)}^N_n₌₁, are stored as input-output pattern pairs where s_n denotes n^th sample of the input vector and t_n denotes the corresponding desired response. The set of input-output pairs represents an association of a set of input and the desired output vectors. The desired response (target) may also be a scalar quantity t_n. When a test vector s_test which is not included in the stored patterns is presented it will be classified into one of labeled groups based on its local neighborhood.

Therefore all memory based learning algorithms involve two criteria

Criterion used for defining a local neighborhood of a test vector s_test.
Memory learning rule applied to the training examples in the local neighborhood of s_test.

Nearest Neighbor rule.

Let the training data set denoted by D = {s₁,s₂,…..s_N}, consist of N labelled prototypes. Let s_testbe an unknown vector, then the local neighborhood of a test vector s_test is defined as the region that lies in the immediate neighborhood of the test vector s_test. Let the vector $\mathbf{s^{'}}\in D$ . that is closest to test vector, then the memory base learning rule is follows, The vector

$\mathbf{s^{'}}\in \left \{ s_{1},s_{2},s_{3},....,s_{N}, \right \}$

is considered to be nearest neighbour of s_testif

$\mathit{d}\left ( \mathbf{s^{'}},\mathbf{s_{test}} \right )=min\left ( \mathit{d}\left (\mathbf{s^{'}},\mathbf{s_{test}} \right ) \right )$ where

$\mathit{d}\left ( \mathbf{s^{'}},\mathbf{s_{test}} \right )$

is the Euclidean distance between all vectors s_n, n = 1,2,…N and s_test.

The label associated to with the minimum distance among the prototypes is reported to be the classification of s_test, i.e., the nearest neighbour’s class is assigned to s_test.

The response (output) of the network due to input s_test will be same as the response due to the nearest neighbor .

The above rule is independent of the underlying distribution responsible for generating the training vectors. An example of a memory based classifier that performs classification using pairs of data is the Radial basis function networks.

k-nearest neighbour classifier

A variant to the nearest neighbourhood classifier is the k-nearest neighbour classifier.

The learning rule is as follows

For some integer value of k, identify the k labelled prototype patterns that lie nearest to the test vector s_test.
Assign test vector to the class that is most frequently represented in the k nearest neighbourhood to s_test.

Before and after classification with k-nearest neighbors

Figure-37:

References:

Simon Haykin, Neural Networks Pearsons education asia (2001), II edition.
Konstantinos Koutroumbas Sergios Theodoridis. Pattern Recognition - 4th Edition.
Richard O. Duda. Peter E. Hart. David G. Stork. PATTERN. CLASSIFICATION. Second Edition.
Kevin P. Murphy, Machine learning: a probabilistic perspective.
Christopher Bishop, Pattern Recognition and Machine Learning, 2^nd Edition

Figure Credits:

Figure-2: wikimedia.org.
Figure-3: blogs.mathworks.com.
Figure-5: upload.wikimedia.org.
Figure-6: hackernoon.com.
Figure-7: agdal1125.github.io.
Figure-8: www.sfu.ca.
Figure-9: slideplayer.com.
Figure-10: pbs.twimg.com.
Figure-16: thegoodpython.com.
Figure-17: teaching.ncl.ac.uk.
Figure-21: qph.fs.quoracdn.net.
Figure-22: ailabpagecom.files.wordpress.com.
Figure-27: i0.wp.com/kraj3.com.np.
Figure-29,30: upload.wikimedia.org.
Figure-31: Multiclass classification problem solved by multiclass support vector machines, researchgate.net.
Figure-33: Binary and multiclass sstrategies for machine learning, slidesharecdn.com.
Figure-33: Binary and multiclass sstrategies for machine learning, slidesharecdn.com.
Figure-34: miro.medium.com.
Figure-35: static.javatpoint.com

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence

Supervised Learning Algorithms

Figure-32:

Another approach is to train K(K−1)/2 different 2-class SVMs on all possible pairs of classes, and then to classify test points according to which class has the highest number of ‘votes’, an approach that is sometimes called one-versus-one.

Support vector method can also be applied to regression problems by the introduction of an appropriate loss function. Support vector for regression was introduced in 1996. Figure shows the regression function that can be learned by the SVR using epsilon insensitive tube.

Bayesian SVM

Pros

Cons

Comments

Post a Comment

Popular posts from this blog

Artificial Intelligence and Machine Learning Life Cycle

Regularization and Generalization in Deep Learning

Modeling Threshold Logic Neural Networks: McCulloch-Pitts model and Rosenblatt’s Perceptrons