Artificial Neural Networks - Definition and Learning methods for Deep Learning
Artificial Neural Networks and its learning methods
There are many definitions of an artificial neural network. Some definitions emphasize its connectionist and parallel structure, others its processing elements and learning algorithms, others are based on graph theory.
Neural Computing
Artificial neural networks were developed from the concept of neural computing and connectionism. Neural computing is a style of computing that models biological neural systems based on learning from experience as opposed to classical rule based tightly specified algorithmic rules.
Alexander and Morton defines neural computing as
“Neural computing is the study of networks with adaptable nodes which, through a process of learning from task examples, store experiential knowledge and make it available for use.”
Artificial neural networks are models that put neural computing in practice. We therefore use the following definition for ANNs as given by Simon Haykin.
Artificial Neural Networks – a definition
Artificial neural networks are massively parallel networks made up of simple nonlinear processing units, which has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects:
- Knowledge is acquired by the network from its environment through a learning process.
- Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.
The neural network derives its computing power through its massively parallel distributed structure and its ability to learn and generalize.
Figure-1
Parallelly connected neurons
Learning in Artificial Neural Networks
Learning Process
Learning in the context of neural network is defined as a process by which the free parameters (synaptic weights and the bias) of an ANN are adjusted or adapted by stimulation from the environment so as to produce the targeted response. A learning algorithm is a set of well-defined rules which governs the learning process.
Any learning process requires the following
- Stimulation of the network from the environment. This provides the inputs.
- The network must produce a response. This is the output.
- Free parameters of the network comprising of weights and biases are adjusted by an iterative process in order to reduce the difference between the output and the targeted response.
Learning algorithms are dependent on the network architecture. The goal of learning algorithms is to generalize beyond the training data. Driven by their input data artificial neural networks learn by adapting their parameters to optimize an object function (details of objective functions will follow in later posts.)
A learning rule is determined by the manner in which changes in the parameters are implemented.
The first learning algorithm Perceptron algorithm (refer previous post of Modeling Threshold Logic Neurons) developed by Rosenblatt was inspired by the learning process postulated by Canadian Donald O. Hebb (1949). Hebb’s postulate is based on the learning behaviour of a biological neuron and is the oldest of all learning rules.
Donald O. Hebb
Hebb’s postulate
“When an axon of neuron cell A is near enough to excite another cell B and repeatedly or persistently takes part in firing it (cell B) some growth process or metabolic change takes place in one or both cells such that cell A’s efficiency as one of the cells firing B is increased.”
Hebbian synapse
In other words input synapses that receive higher frequency of pulses are strengthened and those with lesser frequencies are weakened. Therefore strength of synapses gets continuously modified. This method of continuous modification of connection strengths results in the complex process of learning in biological networks.
Basic rules of learning in anns can be grouped into five 1) Hebbian, 2) Error Correction, 3) Memory based 4) Competitive and 5) Stochastic.
Learning by association (Hebbian rule) in artificial neural networks
All models of cognition in the human brain use association of patterns in one form or the other. The cerebral cortex is full of recurrent connections, and there is solid evidence for Hebbian synapse modification there. Hence, the cerebrum is believed to function as an associative memory.
Animals and Humans tend to learn by associating things that occur simultaneously. This led Hebb to postulate his learning rule. Associations between an input stimulus sn, and an output response yn can be learned if the stimulus and response pattern pairs occur frequently. A pair of patterns sn, yn are associated if the network is stimulated by sn it produces yn as response. Once the network is trained, it can be used for pattern recognition and pattern recall.
The Hebbian learning rule for synaptic strength between two neuron can be phrased as a two part rule.
- If two neurons j and k on either side of a synapse fires (generates output) with only a very small delay between them, (almost simultaneously or synchronously), then the strength of the synapse between the two is increased. Or if the activation of neuron j at a given time can result in the activation of neuron k, then synaptic weight is increased. Then next time when neuron j fires the probability for kth neuron also to fire increases.
- If two neuron j and k on either side of a synapse fired asynchronously, then the synaptic strength between them is weakened or eliminated.
In other words the simple Hebbian learning rule for anns is stated as
Adjust weight wkj of the
connection between units j and k in proportion to the product of their
simultaneous activation.
Such a synapse is known as Hebbian synapse.
Mathematical Modeling of basic Hebbian Learning.
We denote the output of a neuron unit with “y”. According to Hebbian rule the weight change for the mth iteration step is given by
This indicates that weight change is a function of both the postsynaptic and presynaptic signals. The simplest form is described by
Output linear to input
Further discussions on the issues of Hebbian rule is beyond the scope of this article. For further reading see the references. The Hebb’s learning rule can have supervised as well as unsupervised versions.
Learning by error correction (Error-correction rule)
The objective of this learning method is find the set of weight and bias values called the trainable (free) network parameters that are the optimal denoted by w*, for which the error between targets and the outputs are minimum. This rule assumes that error of an output neuron k is directly measurable. Hence this rule is applicable to the synapses of the final output layer of a multi-layered network.
Consider the following figure where an output neuron k is driven by its input signal vector yh(m) produced by the hidden neurons in the previous layer. In case of single layer network the input s(m) is provided directly from the environment.
Figure-6
The argument ‘m’ denotes the iteration step or the mth discrete time instant. The neurons in the hidden layers are driven by input vector s(m) from the environment. The input is provided by the output of the source nodes in the input layer. The output signal of a neuron k in the output layer is denoted by yk(m). This output signal from neuron k is compared with desired output denoted by tk(m). The error signal at the mth iteration is thus defined by
The error signal of the kth neuron is used to make corrective adjustments to the synaptic weights from its previous such that the error between output signal yk(m) and the desired output tk(m) is minimized in a step-by-step manner.
The total output error of the network E (m) called the loss function, also known as the cost function is defined by the following equation.
which is the instantaneous value (mth instant) of the total error energy (since it is a squared error function).
Iterations are repeated until the minimization of the cost function E(m) leads to a stabilization of the synaptic weights or the system output reaches a steady state value.
The updated weight equation is given by
The change in the weight value Δwkj(m) is proportional to the gradient of the cost function. The total error energy is defined as function that is dependent on the weight values w(m), hence it can denoted as E (w(m)). Hence
This type of learning rule is commonly known as the gradient descent rule. (Please the next post on Gradient Descent Algorithm for further details). It is so called since the updates are performed negative to that of error gradient with respect to the entire set of independent basis vectors in the weight space. This is a derivative based optimization method. When the value of learning rate is small the convergence to the optimal value will be slow, but the minimum error achieved will be smaller compared to when large đťś‚ values are used. Large values of đťś‚ can speed up convergence, but may cause oscillations near the minimum error region.
The gradient descent algorithm is a search in the multidimensional weight space to determine the optimal w* so that the cost function is minimized. The error value is reduced in small steps by updating the weight values iteratively. The update of weight values is proportional to the gradient of error with respect to the weight space, đťś‚ is called the learning parameter or the step size parameter.
This simple error correction rule is applicable only for adapting the output layer synaptic weights since output deviations from target values can be measured only for the final layer. To access neurons in inner hidden layers we need the back propagation algorithm.
The steepest version of this gradient descent rule searches for the direction in weight space along which the derivative of the cost function is the maximum.
Descending down hill
Learning by memorizing - Memory Based Learning (MBL) or Instance Based Learning (IBL)
Memory based learning or instance-based learning algorithms (IBL) store all or a subset of the learn data set in memory and use these stored data to predict the output for a given input sample (see figures below). After the pre-processing steps the analysis of the data sample is done just at the moment when a prediction for a particular input is needed. The most commonly known instance-based learning algorithm is the k Nearest Neighbor (k NN) algorithm.
A set of past experiences: {(sn, tn)}Nn=1, are stored as input-output pattern pairs where the vector quantities sn denotes nth sample of the input vector and tn denotes the corresponding desired response. The set of input-output pairs represents an association of a set of input and the desired output vectors. The desired response (target) may also be a scalar quantity tn. When a test vector stest which is not included in the stored patterns is presented it will be classified into one of labelled groups by searching through its local neighbourhood.
Therefore all memory based learning algorithms involve two criteria
- Criterion used for defining a local neighbourhood of a test vector stest.
- Learning rule applied to the training examples in the local neighbourhood of stest.
Nearest Neighbour rule.
Let the input training data set called the prototype set be denoted by D = {s1, s2, …., sN}, consist of N labeled prototypes. Let stest be an unknown vector, then the local neighbourhood of a test vector stest is defined as the region that lies in the immediate neighbourhood of the test vector stest.
Assume the test vector is closest to some vector s’ ϵ D then the memory based learning rule is follows
- Vector s’ ϵ {s1, s2, …., sN} is considered to be nearest neighbour of stest
if
where d(sn, stest) is the Euclidean distance between all vectors sn, n = 1,2,…N and stest.
- The label associated to s’ i.e., the nearest neighbor’s class among the prototypes is reported to be the classification of stest.
The response (output) of the network due to input stest will be same as that of its nearest neighbor. As the more test samples arrive neighbourhood boundaries will change. This rule is independent of the underlying distribution responsible for generating the training vectors.
k - Nearest Neighbour classifier
A variant to the nearest neighbourhood classifier is the k-nearest neighbour classifier.
The learning rule is as follows
- For some integer value of k, identify the k labelled prototype patterns that lie nearest to the test vector stest.
- Assign test vector to the class that is maximally represented in the k nearest neighbourhood to stest.
Classification test vector using k = 5 nearest neighbours
Some application examples
Image search:- Image features used are texture moments, color histogram
Figure-9
Classification of a query image
Try to do the following
- Given a web page find 10 most similar pages in the web
- Find 3 nearest cities to your dwelling place.
As training data set approaches infinity, and the value of k grows large, then kNN algorithm becomes Bayes optimal.
(Read blog on Supervised Learning for more visual examples)
Learning by competition (Competitive Learning a.k.a Kohonen learning)
Similar to basic Hebbian rule and its certain variants, the competitive learning is also an unsupervised learning method for artificial neural networks. The neurons of the output layer compete among themselves to produce its outputs or to fire. In a network that uses competitive learning only a single output will win and is active at a time.
This method is highly suited to discover the statistical features of a set of input data and patterns and use such features to perform clustering of input data where each output neuron represents a different cluster.
In its simplest form the network has a single layer of output neurons each of which is fully connected to the input (source) nodes. Figure shows the standard schematic of a single layer ann that implements competitive learning algorithm.
Competitive network
These networks include intralayer feedback connections among the neurons of the output layer. These feedback synapses perform lateral inhibition. This means the weight values between the lateral synapses are negative and each neuron tends to inhibit the other neuron to which it is laterally connected. However the feed forward synaptic weight values are all positive i.e., they are excitatory.
For an output neuron k to be the winning neuron, its activation value xk for a specific input data s must be the largest among all the neurons in the output layer. The output yk is then set to 1; the outputs of all other neurons that lose the competition are set to zero.
The activation (induced local field) xk represents the combined action of all the feedforward and feedback inputs to neuron k.
The weights wki for all input nodes i connected to output node k, are positive and distributed such that
According the standard competitive learning rule when a neuron receives inputs from the environment, the change in weights Δwki , for the winning neuron is defined by
If however the neuron receives inputs from the previous hidden layer.
This method of weight update is also called the Kohonen learning rule.
It is important to observe that weight update as per the competitive learning rule is applicable only for the feedforward connections and not for the intralayer connections.
The neuron that wins is called a winner-takes-all-neuron. Hence this algorithm is also called the winner-takes-it-all (WTA) algorithm
Instar and Outstar Learning
Two types of network configurations can be trained with this learning rule. The instar learning network proposed by Kohonen and outstar configuration proposed by Grossberg. The WTA algorithm is in fact a modification of the instar algorithm. The instar and outstar networks can be connected together to form complex networks.
Figure-12
The instar neurons win by adapting its values to inputs received from previous layer. In outstar configurations neurons win by adapting to produce output values to be received by its subsequent layer neurons. The instar and outstar networks can be connected together to form complex networks.
Learning under uncertainty (Boltzmann Learning)
The Boltzmann machine stochastic recurrent neural network introduced by Hinton and Sejnowski in 1983. It is a stochastic learning algorithm derived from the principles of statistical mechanics and is hence named after Ludwig Boltzmann. It can be seen as stochastic generative counterpart of the Hopfield network. Similar to the Boltzmann machine, the Hopfield network is also a symmetric network without any hidden layers. Hence Boltzmann machines are more powerful than Hopfield networks.
Hopfield Network (all nodes are visible)
In simplest introductory terms, Boltzmann Machines are primarily divided into two categories: Energy-based Models (EBMs) and Restricted Boltzmann Machines (RBM).
Figure-14
Boltzmann machine: visible (Orange) and hidden (Green).
A neural network designed on the basis of the Boltzmann learning rule is called a Boltzmann machine. Boltzmann machines are symmetric network (i.e., wkj = wjk) with hidden layers.
Figure-15
Boltzmann machine is a widely used neural network models to solve difficult combinatorial optimization problems. It can find near optimum solutions to hard problems such as graph partitioning and the Traveling Salesman problem.
The state of neuron unit j in a Boltzmann machine is updated asynchronously according to the stochastic activation rule
There are two functional groups of neurons in a Boltzmann machine: visible and hidden. The visible neurons are the units in the input/output layers which interact with the external environment. The states of these units can be clamped (fixed) depending upon the input environment or may be unclamped (free to change). The hidden neurons will always operate freely. There are two modes of operation, positive and negative modes.
- Positive mode (Clamped condition): in which the visible neurons are all clamped onto specific states determined by the training data.
- Negative mode (Unclamped condition): in which all the neurons (visible and hidden) are allowed to operate freely and no input data is given.
Weight update equation
Let the expected/average product of yk and yj during training with the input and output nodes fixed at a training input/output data, be represented by
This term denotes the correlation between
the output states of neuron j and k when they are in clamped condition. The
hidden nodes free to update.
Similarly let the corresponding expected/average product of yk and yj when all nodes including the visible input/output nodes run freely, be represented by
This term denotes the correlation between the states of the same neurons j and k when they are in free-running condition.
Then according to the Boltzmann learning rule the weight update using the gradient likelihood function
have range of values between -1 to +1 and η is the learning parameter. Training the biases values is done in a similar manner, but uses only single node activity:
References
- Simon S. Haykin, “Neural Networks and Learning Machines”, Pearson Education.
- Robert J. Schalkoff, “Artificial Neural Networks”, McGraw-Hill International Editions
- Christopher Bishop, “Neural networks for pattern recognition”, Oxford University Press
- Laurene V. Fausett, “Fundamentals of neural networks”, Pearson Education.
- Erkam Guresen, Gulgun Kayakutlu, “Definition of artificial neural networks with comparison to other networks”, WCIT-2010, Procedia Computer Science 3 (2011) 426–433
- Lars Haendel, PhD thesis, “The PNC2 Cluster Algorithm An integrated learning algorithm for rule induction” University of Dortmund, Faculty of Electrical Engineering and Information Technologies, 2003
- J. MARK BISHOP, “HISTORY AND PHILOSOPHY OF NEURAL NETWORKS”
- https://www.neuraldesigner.com/blog/5_algorithms_to_train_a_neural_network
Image Credits
Figure - 2 cdnmedhall.org
Figure - 3 mcb.berkeley.edu
Figure - 7 Down Hill
Figure - 8 mc.ai
Figure - 10 officeguycartoons.com
Figure – 12 electronicshub.org
Figure - 13 asimovinstitute.org
Figure - 14 polychord.io
Figure - 15 researchgate.net
Thanks for the information. This is very nice blog. Keep posting these kind of posts. All the best.
ReplyDeleteDiscovering Top Companies Leveraging Artificial Intelligence/Artificial Intelligence, AI, Sephora
Very nice information. very simple to understand. Tnx so much for sharing sir.
ReplyDelete