Back Propagation Algorithm a short history

Although without reference to neural networks back propagation method was perhaps first described in a 1970 master's thesis by Seppo Linnainmaa. In 1973 Stuart E. Dreyfus in published a simpler derivative based method to minimise cost function by adapting parameters of controllers in proportion to error gradients.

The first neural network specific application of BPA was proposed by Paul Werbos in 1982. This was later refined by Rumelhart, McClelland and Hinton in 1986 and made a radical change in solving supervised learning problems. Their research paper experimentally demonstated the usefulness of BPA for internal representations in the hidden layers and easy addressing of its synapses of Multi Layered Feed Forward Neural Networks (MLFFNN). This paper significantly contributed to the popularisation of BPA.

Description of BPA

BP uses ordered derivatives (as termed by Werbos - The Roots of Backpropagation: From Ordered Derivatives to Neural Networks.) to adapt weights of hidden layer synapses. Hidden layer nodes in a MLFFNN do not have target output values. BPA also called the generalised delta rule is a way to create target values for hidden layer nodes which do not have any desired or target value. (The ordinary delta rule works only for output layer units for which target values can be compared with its visible actual outputs.) An example for network with a single hidden layer is shown in figure below. BPA facilitates addressing any synaptic weight anywhere in the network and update its weight values.

Figure-1

The term backpropagation refers to the manner in which information about gradients of the cost function is propagated into the hidden layers. Since this algorithm is based on the error correction learning rule it is called error back-propagation algorithm. We describe the mathematical derivations of the method.

Do not get intimidated by the seemingly long sequence but simple steps of the following derivations using chain rule. Just enjoy the sheer poetry of derivative calculus that follows

Backprop is a gradient descent weight update algorithm

The variables and notations

Notations

m – iteration index for weight update step. We include this in all our equations for clarity of logic.

k – subscript index for output layer neurons.

j – subscript index for hidden layer neurons (assume only one hidden layer for simplicity).

i – subscript index for input units.

w(m) – set of all trainable weight parameters of the network. This is often a weight matrix for MLFNNs.

E(w(m)) – Error function at the output as a function of weight parameters.

η – learning rate.

w_kj^o(m) – scalar synaptic weight from j^th neuron of hidden layer immediately before to k^th neuron of output layer at the m^th iteration step.

x_k^o(m) – scalar activation value of k^th output neuron.

f (x_k^o(m)) – activation function of k^th output neuron.

y_k^o(m) – activation function output of k^th output neuron.

e_k(m) – error at the output for k^th neuron.

δ_k^o(m) – error gradient information of k^th output neuron.

w_ji^h(m) – synaptic weight between i^th neuron of input layer before j^th neuron of hidden layer.

x_j^h(m) – scalar activation value of j^th hidden neuron.

f (x_j^h(m)) – activation function of j^th hidden neuron.

y_j^h(m) – activation function output of j^th hidden neuron.

δ_j^h(m) – error gradient information for j^th hidden neuron.

The algorithm for Output and Hidden layer weight update

Gradient descent update equation of Output Layer weights

The basic weight update equation for the (m+1)^th step for the output layer is given by

$w_{kj}\left ( m+1 \right )=w_{kj}\left ( m \right )-\eta\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right )}{\partial w_{kj}\left ( m \right )}$ - (1)

Figure shows how input data is forwarded to the output layer through the network

Figure-2

Deriving the update equation for output layer weights in steps

Using the chain rule

$\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right )}{\partial w_{kj}^{o}\left ( m \right )} = \frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right )}{\partial e_{k}^{o}\left ( m \right )} \frac{\partial e_{k}^{o}\left ( m \right )}{\partial y_{k}^{o}\left ( m \right )} \frac{\partial y_{k}^{o}\left ( m \right )}{\partial x_{k}^{o}\left ( m \right )} \frac{\partial x_{k}^{o}\left ( m \right )}{\partial w_{kj}^{o}\left ( m \right )}$ - (2)

Step -1

$\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right )}{\partial e_{k}^{o}\left ( m \right )} = e_{k}^{o}=\left ( t_{k}-y_{k}^{o} \right )$ - (3)

Step -2

$\frac{\partial e_{k}^{o}\left ( m \right ) }{\partial y_{k}^{o}\left ( m \right )} = -1$ - (4)

Step -3

Derivative of output layer neuron activation function w.r.t its activation

$\frac{\partial y_{k}^{o}\left ( m \right ) }{\partial x_{k}^{o}\left ( m \right )} = \frac{\partial f\left (x_{k}^{o}\left ( m \right ) \right )}{\partial x_{k}^{o}\left ( m \right )} = f'\left ( x_{k}^{o} \left ( m \right )\right )$ - (5)

where

$x_{k}^{o}\left ( m \right )=\sum_{j=0}^{J}w_{kj}\left ( m \right )y_{j}^{h}\left ( m \right )$

and “J ” is the total number of hidden neuron connected to the k^th output neuron. The summation interval includes the bias input (j = 0 ).

Step -4 : Defining local error gradient of output neurons

Local error gradient information at the k^th output neuron

$\small \left ( - \right )\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial x_{kj}^{o}\left ( m \right )} =\left [ t_{k}\left ( m \right )-y_{k}^{o}\left ( m \right ) \right ]f'\left ( x_{k}^{o}\left ( m \right ) \right )=e_{k}^{o}\left ( m \right )f'\left ( x_{k}^{o}\left ( m \right ) \right )=\delta _{k}^{o}\left ( m \right )$ - (6)

This is the equal to the product of its error and the derivative of its output function. This local error gradient is also called the error information. (This is also called sensitivity refer MIT notes Backpropagation.pdf)

Step -5

The derivative of activation value of k^th unit in output layer unit w.r.t. synaptic weight from the j^thunit of the hidden layer equals the input from the same unit.

$\frac{\partial x_{k}^{o}\left ( m \right ) }{\partial w_{kj}^{o}\left ( m \right )} = y_{j}^{h}\left ( m \right )$ - (7)

Step -6

Therefore

$\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial w_{kj}^{o}\left ( m \right )} =\left [ t_{k}\left ( m \right )-y_{k}^{o}\left ( m \right ) \right ]f'\left ( x_{k}^{o}\left ( m \right ) \right )\left [y_{j}^{h}\left ( m \right ) \right ]$ - (8)

The weight change

$\Delta w_{kj}\left ( m \right ) = -\eta \frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial w_{kj}^{o}\left ( m \right )} =\eta \delta _{k}^{o}\left ( m \right )y_{j}^{h}\left ( m \right )$ - (9)

Weight value is updated by the product of the learning parameter η, the local error gradient δ_k^o(m) of the k^th output neuron and its the input signal y_j^h(m) from the j^th hidden neuron. In the ordinary delta rule δ_k^o(m) is replaced by the actual error e_k^o(m).

Thus we see that for BPA the computation of the local error gradient (error information) requires the knowledge of the derivative of the activation function f (x). Logistic sigmoid functions and hyperbolic tangent functions are non-linear functions that are differentiable everywhere.

Finally the weight update equation the output layer

$w_{kj}\left ( m+1 \right )=w_{kj\left ( m \right )}+\Delta w_{kj\left ( m \right )} =w_{kj\left ( m \right )}+\eta \delta _{k}^{o}\left ( m \right )y_{j}^{h}\left ( m \right )$

- (10)

The key factor of the above equation in calculating the weight change is the error information signal of the k^th output neuron is represented by

$\delta _{k}^{o}\left ( m \right )$

Gradient descent update equation of Hidden Output Layer weights

A change in the weight values of a hidden neuron can influence the all output neurons connected to it.

For hidden layers weight update equation for the synaptic between j^th neuron and an i^th unit in the previous layer at (m + 1)^th step is given by the equation

$w_{ji}\left ( m+1 \right )=w_{ji\left ( m \right )}+\Delta w_{ji\left ( m \right )}$ - (11)

Step 7: Defining local error gradients of hidden neurons

The error of at hidden layer neurons are not directly computable. Hence error information of the hidden units must be determined using the error information from all units connected to it in the subsequent layer.

We assume that there is only one hidden layer and the input to the hidden layer is the signal vector received from input environment. Similar to the output layer the error information for the hidden layer is denoted by

$\delta _{j}^{h}\left ( m \right )$

Similar to output layer units the change in synaptic weight between neuron j in the hidden layer and neuron i in previous layer is computed as the product of its error information and s_i(m) that is signal received from the i^th unit in previous layer scaled by the learning rate η, as follows

$\Delta w_{ji}^{h}\left ( m \right ) =\eta \delta _{j}^{h}\left ( m \right )s_{i}^{h}\left ( m \right )$ - (12)

The local error gradient δj^h(m) of the j^th hidden layer neuron can be calculated as the product of partial derivative of total output error w.r.t its own output y_j and derivative of its own activation function

$\delta _{j}^{h}\left ( m \right )= \left (- \right )\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial x_{j}^{h}\left ( m \right )} =\left (- \right )\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial y_{j}^{h}\left ( m \right )}\frac{\partial y_{j}^{h}\left ( m \right )}{\partial x_{j}^{h}\left ( m \right )}= \left (- \right )\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial y_{j}^{h}\left ( m \right )}f'\left ( x_{j}^{h}\left ( m \right ) \right )$ - (13)

Step 8: Partial derivative of total output error w.r.t y_j

Differentiating the total squared error at output w.r.t the output from hidden neuron y_j(m) we get

$\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial y_{j}^{h}\left ( m \right )}= \sum_{k=1}^{K}e_{k}^{o}\left ( m \right )\frac{\partial e_{k}^{o}\left ( m \right ) }{\partial y_{j}^{h}\left ( m \right )}$ - (14)

Using chain rule the derivative of error of the k^th output neuron w.r.t j^th hidden neuron output can be written as the product of partial derivatives

$\small \frac{\partial e_{k}^{o}\left ( m \right ) }{\partial y_{j}^{h}\left ( m \right )}= \frac{\partial e_{k}^{o}\left ( m \right ) }{\partial x_{k}^{o}\left ( m \right )}\frac{\partial x_{k}^{o}\left ( m \right ) }{\partial y_{j}^{h}\left ( m \right )}$ - (15)

where activation of k^th output neuron is

$x _{k}^{o}\left ( m \right )$

Note the activation of the k^th neuron

$x_{k}^{o}\left ( m \right )=\sum_{j=0}^{J}w_{kj}\left ( m \right )y_{j}^{h}\left ( m \right )$

hence

$\frac{\partial x_{k}^{o}\left ( m \right ) }{\partial y_{j}^{h}\left ( m \right )}=w_{kj}^{o}\left ( m \right )$ - (16)

Using also the results from Steps 3 and 4 we get

$\small \frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial y_{j}^{h}\left ( m \right )}= \sum_{k=1}^{K}e_{k}^{o}\left ( m \right )\frac{\partial e_{k}^{o}\left ( m \right ) }{\partial y_{j}^{h}\left ( m \right )}=\left ( - \right )\sum_{k=1}^{K}e_{k}^{o}\left ( m \right )f'\left ( x_{k}^{o}\left ( m \right )\right )w_{kj}^{o}\left ( m \right )=\left ( - \right )\sum_{k=1}^{K}\delta _{k}^{o}\left ( m \right )w_{kj}^{o}\left ( m \right )$

- (17)

Figure-3

This means that the error information of each hidden neuron is computed using the error information of all neurons in next layer that are connected to the hidden unit.

Step -9: Derivative of activation function f (x_j^h(m))

Derivative of hidden layer neuron activation function w.r.t its activation

$\frac{\partial y_{j}^{h}\left ( m \right ) }{\partial x_{j}^{h}\left ( m \right )}=\frac{f \left (x_{j}^{h}\left ( m \right ) \right ) }{\partial x_{j}^{h}\left ( m \right )}=f'\left (x_{j}^{h}\left ( m \right ) \right )$ - (18)

Step -10: Representing error information of hidden units using error information from output units

Substituting equation (17) of step 8 and equation (18) of step 9 in equation 13 of step 7 we get the local error gradient of the j^th hidden layer neuron is given by

$\delta _{j}^{h}\left ( m \right )= \left (- \right )\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial x_{j}^{h}\left ( m \right )} =\left (- \right )\frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial y_{j}^{h}\left ( m \right )}\frac{\partial y_{j}^{h}\left ( m \right )}{\partial x_{j}^{h}\left ( m \right )}= \sum_{k=1}^{K}\delta _{k}^{o}\left ( m \right )w_{kj}^{o}\left ( m \right )f'\left ( x_{j}^{h}\left ( m \right ) \right )$ - (19)

The three factors of the local gradient of the hidden neurons

The first set of terms with δ_k^o(m) requires the knowledge of the error signals of all the output neurons directly connected to the j^th hidden neuron.
The second set of terms w_k^o(m) consists of the corresponding connection weights of the hidden neuron.
The third term is the derivative of the activation corresponding to the j^th hidden neuron.

The error gradient terms represent the error propagated back to the hidden layers from the subsequent layers. Hence the name back propagation algorithm.

Step -11

Similar to that in step 5, we get the derivative of activation value of j^th unit in hidden layer equals its input

$\frac{\partial x_{j}^{h}\left (m \right ) }{\partial w_{ji}^{h}\left ( m \right )} =s_{i}\left ( m \right )$ - (20)

Step 12: Computation of weight change in hidden units

From steps 8,9,10,11, and 12 we get (similar to step 6)

$\Delta w_{ji}^{h}\left ( m \right ) = \left (- \right )\eta \frac{\partial E\left ( \mathbf{w}\left ( m \right ) \right ) }{\partial w_{ji}^{h}\left ( m \right )} =\eta \delta _{j}^{h}\left ( m \right )s_{i}\left ( m \right )$ - (21)

The following figure shows how error information is propagated backward

Figure-4

Step 13: Putting it all together

Gradient descent weight update of Hidden layer neurons

Substituting in Equation (11)

$w_{ji}\left ( m+1 \right )=w_{ji\left ( m \right )}+\Delta w_{ji\left ( m \right )} =w_{ji\left ( m \right )}+\eta \delta _{j}^{h}\left ( m \right )s_{i}\left ( m \right )$ - (22)

This is exactly similar to update of output units

Back propagation algorithm uses the gradient descent weight update equations (10) and (22) iteratively until the output error reaches a minimum value.

Summarising the steps into four stages

Initialization of the network weights: During the first stage the network weights w_kj and w_ji for all the neuron units in the network are initialized randomnly to small values.

Feed Forward of input information: During the feed forward stage the input signals s = {s₀, s₁, s₂, s₃ , …… s_I}^T are fed forward from the input layer to the hidden layer. Each hidden layer neuron unit computes its output based on its activation function f(x_j). The activation function of hidden layer unit is a sigmoid function. This output is then fed forward to the neuron unit in the output layer.

Back Propagation of errors: During this stage the output (y_k) of each unit in the output layer is compared with its target value (t_k) and the local gradients δ_k^o(m) (for output layer) and δ_j^h(m) for the hidden layer are computed. The index term denotes the m^th step of the learning cycle.

Updating the weight and biases: During this stage the weights including that of the bias input are updated as per the generalized delta learning rule.

The above steps are iterated until the weights converge to steady values and the output error is minimised.

References:

Laurene V. Fausett, “Fundamentals of neural networks”, 1994, Pearson Education.
Simon S. Haykin, “Neural Networks and Learning Machines”, 2^nd Ed, Pearson Education.
J. Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Networks, Vol 61, p 85-117, 2015. DOI: 10.1016/j.neunet.2014.09.003
David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams, “Learning Internal Representations by Backpropagation”, Nature volume 323, pages 533–536(1986)
Wikipedia

Image Credits

Figure-1: www.bogotobogo.com

Comments

riyaSeptember 30, 2020 at 12:41 PM
Really valuable information! Very useful for aspiring Data Scientists!
digitalseoOctober 3, 2020 at 3:37 AM
hey, i liked reading your article. You may go through few of my creative works here thank you for sharing
AI Corporate Training
https://www.analyticspath.com/artificial-intelligence-corporate-training
techmienews.blogspot.comOctober 20, 2020 at 7:05 AM
This is such a great resource that you are providing and you give it away for free. This is really a nice and informative, containing all information and also has a great impact on the new technology. Also, if You Are Looking for similar article then visit our website, We are technology/news/smartphone company, If you want to read such useful news then
Visit us: https://techmie.com/

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence

Mathematics of Back Propagation for Deep Learning simplified

Figure-1

The algorithm for Output and Hidden layer weight update

Gradient descent update equation of Output Layer weights

Gradient descent update equation of Hidden Output Layer weights

Comments

Post a Comment

Popular posts from this blog

Artificial Intelligence and Machine Learning Life Cycle

Modeling Threshold Logic Neural Networks: McCulloch-Pitts model and Rosenblatt’s Perceptrons

Regularization and Generalization in Deep Learning