Mathematics of Back Propagation for Deep Learning simplified
Although without reference to neural networks back propagation method was perhaps first described in a 1970 master's thesis by Seppo Linnainmaa. In 1973 Stuart E. Dreyfus in published a simpler derivative based method to minimise cost function by adapting parameters of controllers in proportion to error gradients.
The first neural network specific application of BPA was proposed by Paul Werbos in 1982. This was later refined by Rumelhart, McClelland and Hinton in 1986 and made a radical change in solving supervised learning problems. Their research paper experimentally demonstated the usefulness of BPA for internal representations in the hidden layers and easy addressing of its synapses of Multi Layered Feed Forward Neural Networks (MLFFNN). This paper significantly contributed to the popularisation of BPA.
Description of BPA
BP uses ordered derivatives (as termed by Werbos - The Roots of Backpropagation: From Ordered Derivatives to Neural Networks.) to adapt weights of hidden layer synapses. Hidden layer nodes in a MLFFNN do not have target output values. BPA also called the generalised delta rule is a way to create target values for hidden layer nodes which do not have any desired or target value. (The ordinary delta rule works only for output layer units for which target values can be compared with its visible actual outputs.) An example for network with a single hidden layer is shown in figure below. BPA facilitates addressing any synaptic weight anywhere in the network and update its weight values.
The
term backpropagation refers to the manner in which information about gradients of the cost function is propagated into the hidden layers. Since this algorithm
is based on the error correction learning rule it is called error back-propagation algorithm. We describe the mathematical derivations of the method.
Do
not get intimidated by the seemingly long sequence but simple steps of the following
derivations using chain rule. Just enjoy the sheer poetry of derivative calculus
Backprop
is a gradient descent weight update algorithm
The
variables and notations
Notations
m –
iteration index for weight update step. We include this in all our equations
for clarity of logic.
k –
subscript index for output layer neurons.
j –
subscript index for hidden layer neurons (assume only one hidden layer for
simplicity).
i –
subscript index for input units.
w(m) – set of all trainable weight parameters of the network. This is often a weight matrix for MLFNNs.
E(w(m)) –
Error function at the output as a function of weight parameters.
η – learning rate.
wkjo (m) – scalar synaptic
weight from jth neuron
of hidden layer immediately before to kth
neuron of output layer at the mth
iteration step.
xko (m) – scalar
activation value of kth output
neuron.
f (xko (m)) – activation function of kth output neuron.
yko
(m) – activation function output of kth output neuron.
ek
(m) – error at the
output for kth neuron.
δko
(m) – error gradient
information of kth output neuron.
wjih
(m) – synaptic
weight between ith neuron
of input layer before jth
neuron of hidden layer.
xjh
(m) – scalar
activation value of jth hidden neuron.
f (xjh (m)) – activation function of jth hidden neuron.
yjh
(m) – activation function output of jth hidden neuron.
δjh
(m) – error gradient information for jth hidden neuron.
The algorithm for Output and Hidden layer weight update
Gradient descent update equation of Output Layer weights
The basic weight update equation for the (m+1)th step for the output layer is given by
Figure
shows how input data is forwarded to the output layer through the network
Deriving the
update equation for output layer weights in steps
Using the chain rule
Step -2
Step -3
Derivative of output layer neuron
activation function w.r.t its activation
and “J ” is the total number of hidden neuron connected
to the kth output neuron.
The summation interval includes the bias
input (j = 0 ).
Step
-4 : Defining local error gradient of output neurons
Local
error gradient information at the kth
output neuron
This
is the equal to the product of its error and the derivative of its output
function. This local error gradient is also called the error information. (This is also called sensitivity refer MIT notes
Backpropagation.pdf)
Step -5
The derivative of activation value of kth unit in output layer unit w.r.t. synaptic weight from the jth unit of the hidden layer equals the input from the same unit.
Step
-6
Therefore
The weight change
Weight value is updated by the
product of the learning parameter η,
the local error gradient δko
(m) of the kth
output neuron and its the input signal yjh
(m) from the jth hidden neuron. In the ordinary delta rule
Thus we see that for BPA the computation
of the local error gradient (error information) requires the
knowledge of the derivative of the activation function f (x). Logistic sigmoid functions and hyperbolic tangent functions are non-linear functions that are differentiable everywhere.
Finally the weight update equation the output layer
The key factor of the above equation in calculating the weight change is the error information signal of the kth output neuron is represented by
Gradient descent update equation of Hidden Output Layer weights
A change in the weight values of a hidden neuron can influence the all output neurons connected to it.
For hidden layers weight update equation for the synaptic between jth neuron and an ith unit in the previous layer at (m + 1)th step is given by the equation
The error of at hidden layer neurons are not directly computable. Hence error information of the hidden units must be determined using the error information from all units connected to it in the subsequent layer.
We assume that there is only one hidden layer and the input to the hidden layer is the signal vector received from input environment. Similar to the output layer the error information for the hidden layer is denoted by
Similar to output layer units the change in synaptic weight between neuron j in the hidden layer and neuron i in previous layer is computed as the product of its error information and si (m) that is signal received from the ith unit in previous layer scaled by the learning rate η, as follows
The local error gradient δjh (m) of the jth hidden layer neuron can be calculated as the product of partial derivative of total output error w.r.t its own output yj and derivative of its own activation function
Differentiating the total squared error at output w.r.t the output from hidden neuron yj(m) we get
Using chain rule the derivative of error of the kth output neuron w.r.t jth hidden neuron output can be written as the product of partial derivatives
where activation of kth output neuron is
This means that
the error information of each hidden neuron is computed using the error information
of all neurons in next layer that are connected to the hidden unit.
Step -9: Derivative of activation function f (xjh (m))
Derivative of hidden layer neuron activation function w.r.t its activation
Substituting equation (17) of step 8 and equation (18) of step 9 in equation 13 of step 7 we get the local error gradient of the jth hidden layer neuron is given by
The three factors of the
local gradient of the hidden neurons
- The first set of terms with δko(m) requires the knowledge of the error signals of all the output neurons directly connected to the jth hidden neuron.
- The second set of terms wko(m) consists of the corresponding connection weights of the hidden neuron.
- The third term is the derivative of the activation
corresponding to the jth
hidden neuron.
The error gradient terms represent the error propagated back to the
hidden layers from the subsequent layers. Hence the name back propagation algorithm.
Step -11
Similar to that in step 5, we get the derivative of activation value of jth unit in hidden layer equals its input
Step 12: Computation of weight change in hidden units
From steps 8,9,10,11, and 12 we get (similar to step 6)
The following figure shows how error information is propagated backward
Figure-4
Step 13: Putting it all together
Gradient descent weight update of Hidden layer neurons
Substituting in Equation (11)
This is exactly
similar to update of output units
Back
propagation algorithm uses the gradient descent weight update equations (10)
and (22) iteratively until the output error
reaches a minimum value.
Summarising the steps into four stages
Initialization
of the network weights: During the first stage the
network weights wkj and wji for all the neuron units
in the network are initialized randomnly to small values.
Feed
Forward of input information: During the feed
forward stage the input signals s = {s0, s1,
s2, s3 , …… sI}T are fed forward from the
input layer to the hidden layer. Each hidden layer neuron unit computes its
output based on its activation function f(xj). The activation function
of hidden layer unit is a sigmoid function. This output is then fed forward to
the neuron unit in the output layer.
Back
Propagation of errors: During this stage the output (yk) of each unit in the
output layer is compared with its target value (tk) and the local gradients δko
(m)
(for output layer) and δjh(m) for the hidden layer
are computed. The index term denotes the mth
step of the learning cycle.
Updating the weight and biases: During this stage the
weights including that of the bias input are updated as per the
generalized delta learning rule.
The above steps are iterated until the weights converge to steady values and the output error
is minimised.
References:
- Laurene V. Fausett, “Fundamentals of neural networks”, 1994, Pearson Education.
- Simon S. Haykin, “Neural Networks and Learning Machines”, 2nd Ed, Pearson Education.
- J. Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Networks, Vol 61, p 85-117, 2015. DOI: 10.1016/j.neunet.2014.09.003
- David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams, “Learning Internal Representations by Backpropagation”, Nature 533–536(1986)
- Wikipedia
Really valuable information! Very useful for aspiring Data Scientists!
ReplyDeletehey, i liked reading your article. You may go through few of my creative works here thank you for sharing
ReplyDeleteAI Corporate Training
https://www.analyticspath.com/artificial-intelligence-corporate-training
This is such a great resource that you are providing and you give it away for free. This is really a nice and informative, containing all information and also has a great impact on the new technology. Also, if You Are Looking for similar article then visit our website, We are technology/news/smartphone company, If you want to read such useful news then
ReplyDeleteVisit us: https://techmie.com/