Artificial Neural Networks - Definition and Learning methods for Deep Learning

Artificial Neural Networks and its learning methods

There are many definitions of an artificial neural network. Some definitions emphasize its connectionist and parallel structure, others its processing elements and learning algorithms, others are based on graph theory.

Neural Computing

Artificial neural networks were developed from the concept of neural computing and connectionism. Neural computing is a style of computing that models biological neural systems based on learning from experience as opposed to classical rule based tightly specified algorithmic rules.

Alexander and Morton defines neural computing as

“Neural computing is the study of networks with adaptable nodes which, through a process of learning from task examples, store experiential knowledge and make it available for use.”

Artificial neural networks are models that put neural computing in practice. We therefore use the following definition for ANNs as given by Simon Haykin.

Artificial Neural Networks – a definition

Artificial neural networks are massively parallel networks made up of simple nonlinear processing units, which has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects:

Knowledge is acquired by the network from its environment through a learning process.
Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.

The neural network derives its computing power through its massively parallel distributed structure and its ability to learn and generalize.

Figure-1

Parallelly connected neurons

Learning in Artificial Neural Networks

Learning Process

Learning in the context of neural network is defined as a process by which the free parameters (synaptic weights and the bias) of an ANN are adjusted or adapted by stimulation from the environment so as to produce the targeted response. A learning algorithm is a set of well-defined rules which governs the learning process.

Any learning process requires the following

Stimulation of the network from the environment. This provides the inputs.
The network must produce a response. This is the output.
Free parameters of the network comprising of weights and biases are adjusted by an iterative process in order to reduce the difference between the output and the targeted response.

Learning algorithms are dependent on the network architecture. The goal of learning algorithms is to generalize beyond the training data. Driven by their input data artificial neural networks learn by adapting their parameters to optimize an object function (details of objective functions will follow in later posts.)

A learning rule is determined by the manner in which changes in the parameters are implemented.

The first learning algorithm Perceptron algorithm (refer previous post of Modeling Threshold Logic Neurons) developed by Rosenblatt was inspired by the learning process postulated by Canadian Donald O. Hebb (1949). Hebb’s postulate is based on the learning behaviour of a biological neuron and is the oldest of all learning rules.

Figure-2

Donald O. Hebb

Hebb’s postulate

“When an axon of neuron cell A is near enough to excite another cell B and repeatedly or persistently takes part in firing it (cell B) some growth process or metabolic change takes place in one or both cells such that cell A’s efficiency as one of the cells firing B is increased.”

Figure-3

Hebbian synapse

In other words input synapses that receive higher frequency of pulses are strengthened and those with lesser frequencies are weakened. Therefore strength of synapses gets continuously modified. This method of continuous modification of connection strengths results in the complex process of learning in biological networks.

Basic rules of learning in anns can be grouped into five 1) Hebbian, 2) Error Correction, 3) Memory based 4) Competitive and 5) Stochastic.

Learning by association (Hebbian rule) in artificial neural networks

All models of cognition in the human brain use association of patterns in one form or the other. The cerebral cortex is full of recurrent connections, and there is solid evidence for Hebbian synapse modification there. Hence, the cerebrum is believed to function as an associative memory.

Animals and Humans tend to learn by associating things that occur simultaneously. This led Hebb to postulate his learning rule. Associations between an input stimulus s_n, and an output response y_n can be learned if the stimulus and response pattern pairs occur frequently. A pair of patterns s_n, y_n are associated if the network is stimulated by s_n it produces y_n as response. Once the network is trained, it can be used for pattern recognition and pattern recall.

The Hebbian learning rule for synaptic strength between two neuron can be phrased as a two part rule.

If two neurons j and k on either side of a synapse fires (generates output) with only a very small delay between them, (almost simultaneously or synchronously), then the strength of the synapse between the two is increased. Or if the activation of neuron j at a given time can result in the activation of neuron k, then synaptic weight is increased. Then next time when neuron j fires the probability for k^th neuron also to fire increases.
If two neuron j and k on either side of a synapse fired asynchronously, then the synaptic strength between them is weakened or eliminated.

Figure-4

$\Delta w_{ki}\left ( m \right ) = f\left ( y_{k}\left ( m \right ),y_{j} \left ( m \right )\right )$

In other words the simple Hebbian learning rule for anns is stated as

Adjust weight w_kj of the connection between units j and k in proportion to the product of their simultaneous activation.

Such a synapse is known as Hebbian synapse.

Mathematical Modeling of basic Hebbian Learning.

We denote the output of a neuron unit with “y”. According to Hebbian rule the weight change for the m^th iteration step is given by

$\Delta w_{ki}\left ( m \right ) = \eta y_{k}\left ( m \right )y_{j}\left ( m \right )$

This indicates that weight change is a function of both the postsynaptic and presynaptic signals. The simplest form is described by

$w_{ki}\left ( m+1 \right ) = w_{ki}\left ( m \right )+\eta y_{k}\left ( m \right )y_{j}\left ( m \right )$

where η is a positive constant in range 0 to 1, known as the rate of learning . This rule is referred to as the activity product rule.

Figure-5

Output linear to input

Further discussions on the issues of Hebbian rule is beyond the scope of this article. For further reading see the references. The Hebb’s learning rule can have supervised as well as unsupervised versions.

Learning by error correction (Error-correction rule)

The objective of this learning method is find the set of weight and bias values called the trainable (free) network parameters that are the optimal denoted by w*, for which the error between targets and the outputs are minimum. This rule assumes that error of an output neuron k is directly measurable. Hence this rule is applicable to the synapses of the final output layer of a multi-layered network.

Consider the following figure where an output neuron k is driven by its input signal vector y^h(m) produced by the hidden neurons in the previous layer. In case of single layer network the input s(m) is provided directly from the environment.

Figure-6

The argument ‘m’ denotes the iteration step or the m^th discrete time instant. The neurons in the hidden layers are driven by input vector s(m) from the environment. The input is provided by the output of the source nodes in the input layer. The output signal of a neuron k in the output layer is denoted by y_k(m). This output signal from neuron k is compared with desired output denoted by t_k(m). The error signal at the m^th iteration is thus defined by

$e_{k}\left ( m \right )=t_{k}\left ( m \right )- y_{k}\left ( m \right )$

$e_{k}\left ( m \right )=t_{k}\left ( m \right )- f\left ( \mathbf{w}_{k}^{T}\left ( m \right )\mathbf{y}^{h}\left ( m \right ) \right )$

The error signal of the k^th neuron is used to make corrective adjustments to the synaptic weights from its previous such that the error between output signal y_k(m) and the desired output t_k(m) is minimized in a step-by-step manner.

The total output error of the network E (m) called the loss function, also known as the cost function is defined by the following equation.

$E\left ( m \right ) = \frac{1}{2}\sum_{k=1}^{K}e_{k}^{2}\left ( m \right )$

which is the instantaneous value (m^th instant) of the total error energy (since it is a squared error function).

Iterations are repeated until the minimization of the cost function E(m) leads to a stabilization of the synaptic weights or the system output reaches a steady state value.

The updated weight equation is given by

$w_{ki}\left ( m+1 \right ) = w_{ki}\left ( m \right )+\Delta w_{ki}\left ( m \right )$

The change in the weight value Δw_kj(m) is proportional to the gradient of the cost function. The total error energy is defined as function that is dependent on the weight values w(m), hence it can denoted as E (w(m)). Hence

$w_{kj}\left ( m+1 \right ) = w_{kj}\left ( m \right )-\eta \frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right )}{w_{kj}\left ( m \right )}$

This type of learning rule is commonly known as the gradient descent rule. (Please the next post on Gradient Descent Algorithm for further details). It is so called since the updates are performed negative to that of error gradient with respect to the entire set of independent basis vectors in the weight space. This is a derivative based optimization method. When the value of learning rate is small the convergence to the optimal value will be slow, but the minimum error achieved will be smaller compared to when large 𝜂 values are used. Large values of 𝜂 can speed up convergence, but may cause oscillations near the minimum error region.

The gradient descent algorithm is a search in the multidimensional weight space to determine the optimal w* so that the cost function is minimized. The error value is reduced in small steps by updating the weight values iteratively. The update of weight values is proportional to the gradient of error with respect to the weight space, 𝜂 is called the learning parameter or the step size parameter.

This simple error correction rule is applicable only for adapting the output layer synaptic weights since output deviations from target values can be measured only for the final layer. To access neurons in inner hidden layers we need the back propagation algorithm.

The steepest version of this gradient descent rule searches for the direction in weight space along which the derivative of the cost function is the maximum.

Figure-7

Descending down hill

Learning by memorizing - Memory Based Learning (MBL) or Instance Based Learning (IBL)

Memory based learning or instance-based learning algorithms (IBL) store all or a subset of the learn data set in memory and use these stored data to predict the output for a given input sample (see figures below). After the pre-processing steps the analysis of the data sample is done just at the moment when a prediction for a particular input is needed. The most commonly known instance-based learning algorithm is the k Nearest Neighbor (k NN) algorithm.

A set of past experiences: {(s_n, t_n)}^N_n₌₁, are stored as input-output pattern pairs where the vector quantities s_n denotes n^th sample of the input vector and t_n denotes the corresponding desired response. The set of input-output pairs represents an association of a set of input and the desired output vectors. The desired response (target) may also be a scalar quantity t_n. When a test vector s_test which is not included in the stored patterns is presented it will be classified into one of labelled groups by searching through its local neighbourhood.

Therefore all memory based learning algorithms involve two criteria

Criterion used for defining a local neighbourhood of a test vector s_test.
Learning rule applied to the training examples in the local neighbourhood of s_test.

Nearest Neighbour rule.

Let the input training data set called the prototype set be denoted by D = {s₁, s₂, …., s_N}, consist of N labeled prototypes. Let s_testbe an unknown vector, then the local neighbourhood of a test vector s_test is defined as the region that lies in the immediate neighbourhood of the test vector s_test.

Assume the test vector is closest to some vector s’ ϵ D then the memory based learning rule is follows

Vector s’ ϵ {s₁, s₂, …., s_N} is considered to be nearest neighbour of s_test

$d\left ( \mathbf{s}^{'},\mathbf{s}_{test} \right )=\underset{N}{min}\textrm{ }d\left ( \mathbf{s_{n}},\mathbf{s}_{test} \right )$

where d(s_n, s_test) is the Euclidean distance between all vectors s_n, n = 1,2,…N and s_test.

The label associated to s’ i.e., the nearest neighbor’s class among the prototypes is reported to be the classification of s_test.

The response (output) of the network due to input s_test will be same as that of its nearest neighbor. As the more test samples arrive neighbourhood boundaries will change. This rule is independent of the underlying distribution responsible for generating the training vectors.

k - Nearest Neighbour classifier

A variant to the nearest neighbourhood classifier is the k-nearest neighbour classifier.

The learning rule is as follows

For some integer value of k, identify the k labelled prototype patterns that lie nearest to the test vector s_test.
Assign test vector to the class that is maximally represented in the k nearest neighbourhood to s_test.

Figure-8

Classification test vector using k = 5 nearest neighbours

Some application examples

Image search:- Image features used are texture moments, color histogram

Figure-9

Classification of a query image

Try to do the following

Given a web page find 10 most similar pages in the web
Find 3 nearest cities to your dwelling place.

As training data set approaches infinity, and the value of k grows large, then kNN algorithm becomes Bayes optimal.

(Read blog on Supervised Learning for more visual examples)

Learning by competition (Competitive Learning a.k.a Kohonen learning)

Similar to basic Hebbian rule and its certain variants, the competitive learning is also an unsupervised learning method for artificial neural networks. The neurons of the output layer compete among themselves to produce its outputs or to fire. In a network that uses competitive learning only a single output will win and is active at a time.

Figure-10

This method is highly suited to discover the statistical features of a set of input data and patterns and use such features to perform clustering of input data where each output neuron represents a different cluster.

In its simplest form the network has a single layer of output neurons each of which is fully connected to the input (source) nodes. Figure shows the standard schematic of a single layer ann that implements competitive learning algorithm.

Figure-11

Competitive network

These networks include intralayer feedback connections among the neurons of the output layer. These feedback synapses perform lateral inhibition. This means the weight values between the lateral synapses are negative and each neuron tends to inhibit the other neuron to which it is laterally connected. However the feed forward synaptic weight values are all positive i.e., they are excitatory.

For an output neuron k to be the winning neuron, its activation value x_k for a specific input data s must be the largest among all the neurons in the output layer. The output y_k is then set to 1; the outputs of all other neurons that lose the competition are set to zero.

$y_{k}=\left\{\begin{matrix} 1 & \textrm{ if }x_{k}>x_{j}\textrm{ for all \textit{l}, }l\neq k\\ 0 & otherwise \end{matrix}\right.$

The activation (induced local field) x_k represents the combined action of all the feedforward and feedback inputs to neuron k.

The weights w_ki for all input nodes i connected to output node k, are positive and distributed such that

$\sum_{j}^{}w_{kj}=1\textrm{ for all \textit{k}}$ .

According the standard competitive learning rule when a neuron receives inputs from the environment, the change in weights Δw_ki , for the winning neuron is defined by

$\Delta w_{ki}=\left\{\begin{matrix} \eta \left (s_{i}-w_{ki} \right ) & \textrm{ if neuron \textit{k} wins}\\ 0 & otherwise \end{matrix}\right.$

If however the neuron receives inputs from the previous hidden layer.

$\Delta w_{kj}=\left\{\begin{matrix} \eta \left (y_{j}-w_{ki} \right ) & \textrm{ if neuron \textit{k} wins}\\ 0 & otherwise \end{matrix}\right.$

where η is the learning parameter.

This method of weight update is also called the Kohonen learning rule.

It is important to observe that weight update as per the competitive learning rule is applicable only for the feedforward connections and not for the intralayer connections.

The neuron that wins is called a winner-takes-all-neuron. Hence this algorithm is also called the winner-takes-it-all (WTA) algorithm

Instar and Outstar Learning

Two types of network configurations can be trained with this learning rule. The instar learning network proposed by Kohonen and outstar configuration proposed by Grossberg. The WTA algorithm is in fact a modification of the instar algorithm. The instar and outstar networks can be connected together to form complex networks.

Figure-12

The instar neurons win by adapting its values to inputs received from previous layer. In outstar configurations neurons win by adapting to produce output values to be received by its subsequent layer neurons. The instar and outstar networks can be connected together to form complex networks.

Learning under uncertainty (Boltzmann Learning)

The Boltzmann machine stochastic recurrent neural network introduced by Hinton and Sejnowski in 1983. It is a stochastic learning algorithm derived from the principles of statistical mechanics and is hence named after Ludwig Boltzmann. It can be seen as stochastic generative counterpart of the Hopfield network. Similar to the Boltzmann machine, the Hopfield network is also a symmetric network without any hidden layers. Hence Boltzmann machines are more powerful than Hopfield networks.

Figure-13

Hopfield Network (all nodes are visible)

In simplest introductory terms, Boltzmann Machines are primarily divided into two categories: Energy-based Models (EBMs) and Restricted Boltzmann Machines (RBM).

Figure-14

Boltzmann machine: visible (Orange) and hidden (Green).

A neural network designed on the basis of the Boltzmann learning rule is called a Boltzmann machine. Boltzmann machines are symmetric network (i.e., w_kj = w_jk) with hidden layers.

Figure-15

Boltzmann machine is a widely used neural network models to solve difficult combinatorial optimization problems. It can find near optimum solutions to hard problems such as graph partitioning and the Traveling Salesman problem.

The state of neuron unit j in a Boltzmann machine is updated asynchronously according to the stochastic activation rule

$f\left ( x_{j} \right )=\left\{\begin{matrix} 1 & \textrm{ with probability } p\left ( x_{j} \right )\\ -1 & \textrm{ with probability } 1-p\left ( x_{j} \right ) \end{matrix}\right.$

where

$p\left ( x_{j} \right )= \frac{1}{1+\exp \left ( -\frac{x_{j}}{T} \right )}$

At very low temperatures the output becomes deterministic. At large values of temperature output becomes more unpredictable.

Learning in Boltzmann machine

There are two functional groups of neurons in a Boltzmann machine: visible and hidden. The visible neurons are the units in the input/output layers which interact with the external environment. The states of these units can be clamped (fixed) depending upon the input environment or may be unclamped (free to change). The hidden neurons will always operate freely. There are two modes of operation, positive and negative modes.

Positive mode (Clamped condition): in which the visible neurons are all clamped onto specific states determined by the training data.
Negative mode (Unclamped condition): in which all the neurons (visible and hidden) are allowed to operate freely and no input data is given.

During training the states of input and output units are clamped. After training the network is tested with input units clamped and the outputs free to allow the find the correct states.

Weight update equation

Let the expected/average product of y_k and y_j during training with the input and output nodes fixed at a training input/output data, be represented by

$\rho _{kj}^{+} = \left \langle y_{k}y_{j} \right \rangle_{fixed}$

This term denotes the correlation between the output states of neuron j and k when they are in clamped condition. The hidden nodes free to update.

Similarly let the corresponding expected/average product of y_k and y_j when all nodes including the visible input/output nodes run freely, be represented by

$\rho _{kj}^{-} = \left \langle y_{k}y_{j} \right \rangle_{free}$

This term denotes the correlation between the states of the same neurons j and k when they are in free-running condition.

Then according to the Boltzmann learning rule the weight update using the gradient likelihood function

$\Delta w_{kj}=\eta \left ( \rho _{kj}^{+}-\rho _{kj}^{-} \right ) \textrm{ }j\neq k$

Both

$\rho _{kj}^{-} \textrm{ and }\rho _{kj}^{-}$

have range of values between -1 to +1 and η is the learning parameter. Training the biases values is done in a similar manner, but uses only single node activity:

References

Simon S. Haykin, “Neural Networks and Learning Machines”, Pearson Education.
Robert J. Schalkoff, “Artificial Neural Networks”, McGraw-Hill International Editions
Christopher Bishop, “Neural networks for pattern recognition”, Oxford University Press
Laurene V. Fausett, “Fundamentals of neural networks”, Pearson Education.
Erkam Guresen, Gulgun Kayakutlu, “Definition of artificial neural networks with comparison to other networks”, WCIT-2010, Procedia Computer Science 3 (2011) 426–433
Lars Haendel, PhD thesis, “The PNC2 Cluster Algorithm An integrated learning algorithm for rule induction” University of Dortmund, Faculty of Electrical Engineering and Information Technologies, 2003
J. MARK BISHOP, “HISTORY AND PHILOSOPHY OF NEURAL NETWORKS”
https://www.neuraldesigner.com/blog/5_algorithms_to_train_a_neural_network

Image Credits

Figure -1 researchgate.net
Figure - 2 cdnmedhall.org
Figure - 3 mcb.berkeley.edu
Figure - 7 Down Hill
Figure - 8 mc.ai
Figure - 10 officeguycartoons.com
Figure – 12 electronicshub.org
Figure - 13 asimovinstitute.org
Figure - 14 polychord.io
Figure - 15 researchgate.net

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence

Artificial Neural Networks - Definition and Learning methods for Deep Learning

Comments

Post a Comment

Popular posts from this blog

Artificial Intelligence and Machine Learning Life Cycle

Modeling Threshold Logic Neural Networks: McCulloch-Pitts model and Rosenblatt’s Perceptrons

Regularization and Generalization in Deep Learning