Modeling Threshold Logic Neural Networks: McCulloch-Pitts model and Rosenblatt’s Perceptrons

Biological neurons exhibit a spiking output behavior in response to its inputs (refer Figures 10, 11 of "Essential biology and electrochemistry for Deep Learning"). Inputs signals from the dendrites collectively raise the internal electric potential of the cell. Once this internal potential reaches or crosses the threshold value the neuron fires. This is a nonlinear behavior

During the formative years ANNs (1943-1958), McCulloch-Pitts introduced the idea of neural networks as computing machines. Frank Rosenblatt in 1962, and later Minsky and Papert in 1988 developed a large class of artificial neural networks called Perceptrons.

McCulloch-Pitts model

Warren Sturgis. McCulloch a neurophysiologist and Walter Harris Pitts a logician published a paper “A logical calculus of the ideas immanent in nervous activity”, 1943 which introduced the simplest artificial model of the biological neuron. McCulloch and Pitts modeled the nonlinear behavior with threshold logic function. The model had both excitatory and inhibitory inputs and bias. The output of this model could only take values binary.

Figure-1:

Mathematical model of biological neuron

Figure-2:

McCulloch-Pitts Model

The model was so simplistic that its weights and bias were fixed. Excitatory and inhibitory input synapses had positive and negative weight values. These values could not be adjusted but had to be analytically determined. However this simple model had substantial computing potential. Any task that can be represented by Boolean logic functions could be modeled by a network of MP neuron. MP neuron is also called a threshold gate since its output could only logical states as shown in figure above. This neuron models could represent a finite state automaton.

First generation neural networks - Rosenblatt Perceptrons

In 1958, Franklin Rosenblatt introduced a major advancement which is called the Perceptron. Using the McCulloch-Pitts neuron and the findings of Canadian psychologist Donal O. Hebb, Rosenblatt developed the first perceptron. Unlike MP neuron, Rosenblatt’s perceptron had learnable weight and bias values. The major difference between the Perceptron and MP neuron is that perceptron has weight and bias values that could be determined analytically or by a learning algorithm. Learning is achieved by an iterative process of weight adjustments until the exact output response is produced. Rosenblatt’s perceptrons paved the way for the development of ANNs, which progressed into Deep learning with its applications in AI.

The Rosenblatt Perceptron consisted of a single layer McCulloch-Pitts model neuron which sums up all the weighted inputs and produced a binary/bipolar output using a threshold activation (discriminant) function. The Perceptron with a single discriminating neuron is the simplest form of a neural network used for classification of patterns that are linearly separable. Such networks and its variations are collectively called Perceptrons.

Rosenblatt proved that if the input data signals (vectors) representing two linearly separable classes drawn from a multidimensional space are used to train the MP model neuron, the perceptron learning algorithm will converge. The learning algorithm will thus determine a discriminant function that defines the decision hyper-plane that separates the training data into two classes. All perceptron networks use the McCulloch-Pitts model with threshold output function as the discriminant function. Such networks can be therefore be used to solve classification problems.

Figure-3:

Original Perceptron

The original perceptron had a sensory layer at the input, associator layer (which is hidden) and the response layer at the output. The inputs from sensor units are binary. Output from the response unit is either binary or bipolar (0, +1 or -1, +1). The network is trained to associate an output with given set of inputs. The desired output is compared with the actual output and the error is used to adjust the weights of the network.

Figure-4:

Rosenblatt with Mark1 Perceptron

The first computer that could learn new skills by trial and error, using a type of neural network that simulates human thought processes was called Mark1. Rosenblatt’s Perceptron was initially simulated on an IBM 704 computer at Cornell Aeronautical Laboratory in 1957.

ROSENBLATT’S PERCEPTRON LEARNING RULE (ALGORITHM) OF AND CONVERGENCE THEOREM

As stated earlier Rosenblatt’s perceptron (or simply called the Perceptron) is built around the McCulloch-Pitts model of a neuron. The goal of a single layer single neuron perceptron is to perform dichotomy or to classify correctly the given data two classes C₁ and C₂. If a perceptron has to learn to dichotomize the classes must be linearly separable by a hyperplane. The training paradigm adopted by perceptron is supervised.

The procedure for incrementally adjusting the weights of the interconnections is called a learning law or learning algorithm.

In its initial configuration the perceptron is incapable of distinguishing the patterns of interest, since its weights are initialized to random values. Through a training process it can be made capable of distinguishing several patterns.

The perceptron learning rule introduced by Rosenblatt is a typical error correction learning method for single layer feed forward network with bipolar or binary threshold logic activation. The perceptron learns by adjusting its weights in a step-by-step manner over several iterations. For a given input vector the actual output vector is compared with the desired or the target output vector and the error is used to adjust the weight vectors. The process is repeated iteratively until the weight values converge or the error becomes zero. The process of learning is described below.

Consider a single layer single neuron perceptron with bipolar input which receives its input from an I – dimensional signal vector s. Let the connection strength vector w from the input to neuron at the m^th iteration be represented by w(m). Let the input data belong to two linearly separable classes C₁ and C₂ corresponding the positive and negative classes respectively.

The activation of the output neuron is

$x\left ( m \right )=\mathbf{w\left ( m \right )}^{T}\mathbf{s\left ( m \right )}+\theta \left ( m \right )$

where θ(m) is the bias signal at the m^th iteration step.

Output signal y = f(x) is a “hard-limit transfer function” such that for x(m) ≥ 0 then y(m) = +1 and if x(m) ≥ 0 then y(m) = -1 or y(m) = 0.

The learning proceeds as follows (let y = +1 for C₁ and y = -1 or 0 for C₂)

We assume that our training set contains N samples and at iteration step we present the data sample s(m)

Case:1 (Network prediction is correct)

If s(m) belongs to C₁ then the target

$t\left ( m \right )=1; \textrm{with} \ x\left ( m \right )\geq 1,\ y\left ( m \right ) =1$

Then

$y\left ( m \right )t\left ( m \right )=+1$

If s(m) belongs to C₂ then the target

$t\left ( m \right )=-1; \textrm{with} \ x\left ( m \right )< 0,\ y\left ( m \right ) =-1$

Then also

$y\left ( m \right )t\left ( m \right )=+1$

Then the weights are not updated, i.e.,

w (m+1) = w (m)

Case:2 (Misclassified output)

If s(m) belongs to C₁ then the target t(m) = +1

Due to misclassification

$t\left ( m \right )=1; \textrm{with} \ x\left ( m \right )< 0,\ y\left ( m \right ) =-1$

Then

$x\left ( m \right )t\left ( m \right )< 0,\textrm{ and }\ y\left ( m \right ) t\left ( m \right ) =-1$

Similarly

If s(m) belongs to C₂ then the target t(m) = -1

Due to misclassification

$t\left ( m \right )=-1; \textrm{with} \ x\left ( m \right )\geq 0,\ y\left ( m \right ) =1$

Then also

$x\left ( m \right )t\left ( m \right )< 0,\textrm{ and }\ y\left ( m \right ) t\left ( m \right ) =-1$

Then the weights are updated,

w (m+1) = w (m) + Δ w (m)

where Δ w (m) is the change in weight values (weight vector) to eliminate classification errors.

The perceptron cost function is a function of the weight vector (w)

Not all input data samples are misclassified. The total error due to misclassifications given by the summation

$E_{p}\left ( \mathbf{w} \right )= -\underset{M}{\sum \mathbf{w}^{T}\left ( m \right )}\mathbf{s}\left ( m \right )t\left ( m \right )$

where M < N is a subset of the training data and denotes the total no. of misclassified patterns.

In the above equation the weight vector w is the augmented vector which includes the synaptic weights values from the signal inputs as well as the bias. Similarly the vector s includes the bias s₀ = 1.

The contribution to the error associated with a particular misclassified pattern is a linear function of w in regions of w space where the pattern is misclassified and zero in regions where it is correctly classified. The total error function is therefore piecewise linear. The learning algorithm proceeds by seeking a weight vector w* such that it minimizes the perceptron criterion.

Perceptron learning algorithm

The perceptron learning algorithm has a simple interpretation, as follows. Let the training patterns be presented in a stochastic manner one at an iteration step. For each pattern s(m) we evaluate the perceptron output. If the pattern is correctly classified, then the weight vector remains unchanged.

The above weight update equation

$\mathbf{w}\left ( m+1 \right) = \mathbf{w}\left ( m \right)+\eta \mathbf{s}\left ( m \right)t\left ( m \right )$

where η is called the learning rule.

Illustration of convergence

In the following illustration the red dots belong to the positive class and blue belong to the negative class. When initialized the weight vector (colored black) points to the negative side. The misclassified pattern is shown as encircled with green.

Figure-5

Illustration on Perceptron convergence

During training the effect of each update in the perceptron learning algorithm will be such that the distance between updated separating hyperplane to the feasible solution reduces. In other words the updated weight vector moves closer to the set of feasible weight vectors. The figure order of figures is from left top, left bottom, right top and right bottom. The contribution to the error from a misclassified pattern progressively reduces during the training iterations. Finally the training stops when the error of misclassification reduces to zero.

Rosenblatt proved that after a finite number of iterations the perceptron algorithm will finally converge if the training patterns are linearly separable.

PERCEPTRON CONVERGENCE THEOREM

The perceptron convergence theorem states that

Given C₁ and C₂ are two linearly separable subsets of a training set X, if there exists a feasible weight vector w* that can perform the dichotomy, then the perceptron learning algorithm is guaranteed to find an exact solution w* in a finite number of steps for any choice of initial weight values.

(Proof of this theorem can be found in Reference-7,8:)

Simplified algorithm when t(m) = +1 or -1

When the patterns are correctly classified:

do nothing

When the patterns are misclassified:

For class C₁, add a fraction of the pattern vector to the weight vector
For class C₂, subtract a fraction of the pattern vector from the weight vector.

Special case of learning by error correction

This can be seen as a special case of the error correction algorithm where learning rate equals "1/2" and "1" for bipolar targets and binary targets respectively. Please read post on "Artificial Neural Networks - Definition and Learning methods for Deep Learning" for more details on error correction learning rule

Single Layer Perceptrons with K outputs

With single layer single neuron perceptron only one discriminant function can be implemented.

Figure-6:

Single Layer perceptron with K outputs

A single layered perceptron with I inputs (two-dimensional space) and K threshold logic output units can produce K distinct decision boundaries in the pattern space.

The gradient descent algorithm and the perceptron learning rule

The perceptron learning algorithm can be derived using gradient descent algorithm. The gradient descent algorithm uses a fraction (controlled by η) of the gradient (derivative) w.r.t w(m) of the error function E_p(w(m)). The gradient direction is always positive. The algorithm is called the gradient descent since the direction of weight update is opposite (negative) to the direction of the gradient.

In the above figure consider any output neuron indexed by “k” and a single synapse indexed by “i”, applying the gradient descent rule the weights of individual synapses are updated as follows

$w_{ki}\left ( m+1 \right )=w_{ki}\left ( m \right )-\eta \frac{\partial E_{_p}\left ( \mathbf{w\left ( \mathit{m} \right )} \right )}{w_{ki}\left ( m \right )}=w_{ki}\left ( m \right )-\eta t_{k}\left (m\right )s_{i}\left ( m \right )$

The corresponding vector equation for weight update is

$\mathbf{w}(m)\left ( m+1 \right )=\mathbf{w}\left ( m \right )-\eta \bigtriangledown E_{p}\left ( \mathbf{w}(m) \right )=\mathbf{w\left ( m \right )}-\eta \mathit{\textbf{g}(m))}$

where

$\triangledown =\frac{\partial }{\partial \left ( \mathbf{w}\left ( m \right ) \right )}$

Since the cost function is piece-wise linear it is not continuously differentiable w.r.t weight vector w. Therefore weight update does not follow a truly gradient descent algorithm.

(As assignment modify the above equations when k = 1)

Multilayered threshold logic neuron networks and Linear inseparability

Rosenblatt’s perceptrons were the first-generation neural networks. Perceptrons with a single neuron is limited to perform pattern classification with only two classes that are linearly separable. A perceptron with single output neuron can perform dichotomy (binary classification). By increasing the number of discriminant units at the output the number of binary classifications can be increased. These outputs can further be used as inputs to a second layer of discriminant units. Neurons in the second layer can combine several linearly separated regions to form convex regions in the input space. Thus, a two-layer perceptron network can solve nonlinear pattern classification problems that require separation of a convex region from the rest of the input space. By adding a third layer of perceptrons convex regions from the second layer can be combined to form more complex regions including concave regions. Thus, a layer three-layer perceptron network can be used to solve arbitrary pattern recognition problems.

A multiclass pattern classification problem can be viewed as problem of determining set of the hyperplanes and combining these hyperplanes to separate out regions in a multidimensional space corresponding to the different classes.

Three-layered Perceptron

A three-layered network with two hidden layers is still more general. There are no convexity issues. The Layer-3 neurons receive its input from a group of convex polygons and the logical combination of these convex regions need not be convex. This makes it possible to enclose a region of space to any desired arbitrary shape. Thus linearly inseparable (hard problems) can also be solved by a multilayer feedforward with threshold logic neurons.

Figure-7:

In Figure7(a) if the output unit in the third layer unit performs a logical AND operation i.e., C₂ AND C₁, it corresponds to the intersection of the two regions. If figure(a) C2 and C1 are disjoint, hence there is no intersection. If however the operation is C₂ OR C₁ then the output defines either C₁ or C₂ as shown by the shaded regions. In figure(b) the third layer unit performs a combined function of C₂ AND NOTC₁ . Then it defines a concave region which is the shaded area of C₂ excluding the area that common to C₂ and C₁.

Figure-8:

Summarizing the capabilities of perceptrons

Thus it can be shown that a three layered network of hard limiting neurons with two hidden layers and one output layer can perform any complex pattern classification tasks. Patterns that are linearly non-separable (hard problems) can be separated by multilayered perceptrons with at least three layers. Examples are shown in figure above.

References:

Simon Haykin, Neural Networks, Pearsons education Asia (2001), II edition
Richard O. Duda. Peter E. Hart. David G. Stork. PATTERN. CLASSIFICATION. Second Edition
Kevin P. Murphy, Machine learning : a probabilistic perspective, MIT Press, 2012
Christopher Bishop, Pattern Recognition and Machine Learning, 2^nd Edition
Martin T. Hagan, Howard B. Demuth, Mark H. Beale, Orlando De Jesus, Neural Network Design (2nd Edition),
J.P. Marques de Sa, Pattern Recognition Concepts, Methods and Applications, Springer, 2001
Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms., Spartan, 1962.
Bishop, C. M. Neural Networks for Pattern Recognition. Oxford University Press, 1995.
news.cornell.edu/stories/2019/09/professors perceptron paved way ai 60 years too soon.

Image Credits:

Figure-1: researchgate.net

Figure-3: Reference-1

Figure-4: reddit.com

Figure-5: Reference-4

Figure-7: Reference-1

Figure-8: Reference-1

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence