In order to minimize the cost function, we expect small changes in weights lead to samll changes in output. So we can use this property to modify weights to make network getting closer to what we want step by step. This is also one reason why we need smooth activation function rather than {0, 1}
1. The intuitive idea
if we want to decrease Cost , which means
Here, let
2. Back-Propagation
Warming-up:
∇wCost or∂Cost∂w tells us how quickly the Cost changes when we update the w.
2.1 Introduction to Notations
(
On the other hand, for convenience reason, we define:
2.2 The Fundamental Equations
In the beginning, we set ERROR in the neuron:
This definition makes sense. Suppose it adds a little change (
In this case, if
-
An equation for the error in the output layer
δL=∂Cost∂zL(L represents the last layer) δL=∂Cost∂aL∂aL∂zL=∂Cost∂aLf′(zL)=∇aLCost⊙f′(zL)(⊙ represents componentwise production)
An equation for the error
δl in terms of the error in the next layer,δl+1 :
More specifically, for an individual neuron:
Meanwhile
So
using
(wl)T∗δl to obtainδl−1 seems propagate error backward through the network. We then take the Hadamard product⊙f′(zl−1) . This moves the error backward through the activation function in layerl−1
- An equation for the rate of change of the cost with respect to any weight in the network:
∂Cost∂wljk=∂Cost∂zlj∂zlj∂wljk=δljal−1k
For less-index version:
it’s understood that
ain is the activation of the neuron input to the weight w, andδout is the error of the neuron output from the weight w.
Note if
psudo code for updation w: (m is the num of batch)
wl=wl−ηm∑xδx,l(ax,l−1)T What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network