Let's begin with a notation which lets us refer to weights in the network in an unambiguous way. We'll use
We use a similar notation for the network's biases and activations. Explicitly, we use
With these notations, the activation
To rewrite this expression in a matrix form we define a weight matrix
The last ingredient we need to rewrite in a matrix form is the idea of vectorizing a function such as
With these notations in mind, the above equation can be rewritten in the beautiful and compact vectorized form: $$ a^{l} = \sigma(w^{l}a^{l-1}+b^{l}) $$ That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we've taken to now. Think of it as a way of escaping index hell, while remaining precise about what's going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization.
When using the above vectorized form equation to compute
So we also write $$ a^{l} = \sigma(z^{l}) $$
To train a neural network you need some measure of error between computed outputs and the desired target outputs of the training data. The most common measure of error is called mean squared error. However, there are some research results that suggest using a different measure, called cross entropy error, is sometimes preferable to using mean squared error.
So, which is better for neural network training: mean squared error or mean cross entropy error? The answer is, as usual, it depends on the particular problem. Research results in this area are rather difficult to compare. If one of the error functions were clearly superior to the other function in all situations, there would be no need for articles like this one. The consensus opinion among my immediate colleagues is that it's best to try mean cross entropy error first; then, if you have time, try mean squared error.
The goal of backpropagation is to compute the partial derivatives
The first assumption is that the cost function can be written as an average over cost functions for individual training examples.
$$
C = \frac{1}{M}\sum_{m=1}^{M}C_m
$$
where
What backpropagation actually lets us do is compute the partial derivatives
The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network.
The cost
Before we go to detail calculation of backpropagation, we need to introduce some related multivariable calculus knowledge.
Given the following multivariable functions
$f$ and$s$ $$ \begin{align} y &= y(x_1,\cdots,x_i,\cdots,x_n)\ x_i &= x_i(t_1,\cdots,t_j,\cdots,t_m); 1 \le i \le n \ \end{align} $$
Based on the Chain rule for partial derivatives of multivariable functions, we can get $$ \frac{\partial y}{\partial t_j} = \sum_{i=1}^{n} \frac{\partial y}{\partial x_i} \times \frac{\partial x_i}{\partial t_j}; 1 \le j \le m $$
If we slice a neural network and keep the layers from
Note:
$C$ is corresponding to$y$ in the previous example of the chain rule of multivariables.$(z_{1}^{l}, \cdots, z_{j}^{l}, \cdots, z_{n^l}^{l})$ is corresponding to$(x_1,\cdots,x_i,\cdots,x_n)$ in the previous example.
We can also rewrite
z_{j}^{l} &= \sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l} \
&= z_{j}^{l}(a_{1}^{l-1}, \cdots, a_{k}^{l-1}, \cdots, a_{n^{l-1}}^{l-1}) \
&= \sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times \sigma(z_{k}^{l-1})+b_j^{l} \
&= z_{j}^{l}(\sigma(z_{1}^{l-1}), \cdots, \sigma(z_{k}^{l-1}), \cdots, \sigma(z_{n^{l-1}}^{l-1})) \
&= s_j^{l}(z_{1}^{l-1}, \cdots, z_{k}^{l-1}, \cdots, z_{n^{l-1}}^{l-1}) \
\end{align}
$$
where
Note:
$z_{j}^{l}$ is corresponding to$x_i$ in the previous example of the chain rule of multivariables.$(z_{1}^{l-1}, \cdots, z_{k}^{l-1}, \cdots, z_{n^{l-1}}^{l-1})$ is corresponding to$(t_1,\cdots,t_j,\cdots,t_m)$ in the previous example.
We will show how to apply the above chain rule to Backpropagation algorithm in the neural networks soon.
Now we are going to introduce a linear algebraic operation used in Backpropagation.
Suppose
Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives
we define the error of neuron
If we get
The above equation can be written in the vectorized form, and we define the derivative of
We can also easily calculate $$ \begin{align} \frac{\partial C}{\partial w_{jk}^{l}} &= \frac{\partial C}{\partial z_{j}^{l}} \times \frac{\partial z_{j}^{l}}{\partial w_{jk}^{l}}\ &= \delta_{j}^{l} \times \frac{(\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l} )}{\partial w_{jk}^{l}}\ &= \delta_{j}^{l} \times a_{k}^{l-1} \end{align} $$
The above equation can be written in the vectorized form, and we define the derivative of
$w^{l}$ and$\nabla_{w^{l}}C$ are both$n^{l} \times n^{l-1}$ matrices.$\delta^{l}$ is a vector or a$n^{l} \times 1$ matrix and$(a^{l-1})^{T}$ is a$1 \times n^{l-1}$ matrix.
When the activation
$a_{k}^{l-1}$ is small,$a_{k}^{l-1} \approx 0$ , the gradient term$\frac{\partial C}{\partial w_{jk}^{l}}$ will also tend to be small. In this case, we'll say the weight learns slowly, meaning that it's not changing much during gradient descent. In other words, one consequence is that weights associated with low-activation neurons learn slowly.
The key issue is how to calculate
First, we can easily calculate
The above equation can be rewritten in the vectorized form: $$ \delta^{L} = \nabla_{a^{L}}C \odot \sigma^{\prime}(z^{L}) $$
We can start from the layer
\delta_{k}^{l-1} &= \frac{\partial C}{\partial z_{k}^{l-1}}\
&= \sum_{j=1}^{n^{l}} \frac{\partial f^{l}}{\partial z_{j}^{l}} \times \frac{\partial s_j^{l}}{\partial z_{k}^{l-1}}\
&= \sum_{j=1}^{n^{l}} \frac{\partial C}{\partial z_{j}^{l}} \times \frac{\partial z_{j}^{l}}{\partial z_{k}^{l-1}}
\end{align}
$$
We have $$ \begin{align} \frac{\partial z_{j}^{l}}{\partial z_{k}^{l-1}} &= \frac{\partial (\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l})}{\partial z_{k}^{l-1}}\ &=\frac{\partial (\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times \sigma(z_{k}^{l-1})+b_j^{l})}{\partial z_{k}^{l-1}}\ &= \frac{\partial (\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times \sigma(z_{k}^{l-1})+b_j^{l})}{\partial \sigma(z_{k}^{l-1})} \times \frac{\partial \sigma(z_{k}^{l-1})}{z_{k}^{l-1}}\ &= w_{jk}^{l} \times \sigma^{\prime}(z_{k}^{l-1}) \end{align} $$
So $$ \delta_{k}^{l-1} = \sum_{j=1}^{n^{l}} \delta_{j}^{l} \times w_{jk}^{l} \times \sigma^{\prime}(z_{k}^{l-1}) $$ The above equation can be written in the vectorized form: $$ \delta^{l-1} = (w^{l})^{T}\delta^{l} \odot \sigma^{\prime}(z^{l-1}) $$
Note:
From the above equation, we can get
$\delta^{l} = (w^{l+1})^{T}\delta^{l+1} \odot \sigma^{\prime}(z^{l})$ . Consider the term$\sigma^{\prime}(z_{k}^{l})$ , if the activation function$\sigma$ is the$sigmoid$ function that the$\sigma$ function becomes very flat when$\sigma(z_{k}^{l})$ is approximately$0$ or$1$ . When this occurs we will have$\sigma^{\prime}(z_{k}^{l}) \approx 0$ . We know$\frac{\partial C}{\partial w_{jk}^{l}} = \delta_{j}^{l} \times a_{k}^{l-1}$ , so the lesson is that a weight in the layer will learn slowly if the neuron activation$\sigma(z_{k}^{l})$ is either low ($\approx 0$ ) or high ($\approx 1$ ). In this case it's common to say the neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly). Similar remarks hold also for the biases of neuron.
We can summarize the above results in the vectorized form as follows $$ \begin{align} \delta^{L} &= \nabla_{a^{L}}C \odot \sigma^{\prime}(z^{L}) \ \delta^{l-1} &= (w^{l})^{T}\delta^{l} \odot \sigma^{\prime}(z^{l-1})\ \nabla_{b^{l}}C &= \delta^{l}\ \nabla_{w^{l}}C &= \delta^{l} \times (a^{l-1})^{T} \end{align} $$
The previous section introduces the inference of Backpropagation, and let's explicitly write its implementation out in the form of an algorithm.
First, we start with only a single training example
-
Input vector
$x^{(m)}$ : Assign$x^{(m)}$ to the activation$a^1$ for the input layer. -
Feedforward: For each layer
$l=2, 3, \cdots, L$ compute$z^{l}=w^{l}a^{l-1}+b^{l}$ and$a^{l} = \sigma(z^{l})$ . -
Error in the last layer
$\delta_L$ : Compute the vector$\delta^{L} = \nabla_{a^{L}}C_m \odot \sigma^{\prime}(z^{L})$ . -
Backpropagate the error to previous layer: For each
$l=L-1, L−2, \cdots, 2$ compute$\delta^{l} = (w^{l+1})^{T}\delta^{l+1} \odot \sigma^{\prime}(z^{l})$ . -
Output: The gradient of the cost function is given by
$\nabla_{b^{l}}C_m = \delta^{l}$ and$\nabla_{w^{l}}C_m = \delta^{l} \times (a^{l-1})^{T}$ .
In practice, it's common to combine backpropagation using stochastic gradient descent with mini-batch of
- Input
$M$ training examples of a mini-batch -
For each training example
$x^{(m)}$ : Assign$x^{(m)}$ to the activation$a^1$ for the input layer.-
Feedforward: For each layer
$l=2, 3, \cdots, L$ compute$z^{l}=w^{l}a^{l-1}+b^{l}$ and$a^{l} = \sigma(z^{l})$ . - Error in the last layer
$\delta_L$ **: Compute the vector$\delta^{L} = \nabla_{a^{L}}C_m \odot \sigma^{\prime}(z^{L})$ . -
Backpropagate the error to previous layer: For each
$l=L-1, L−2, \cdots, 2$ compute$\delta^{l} = (w^{l+1})^{T}\delta^{l+1} \odot \sigma^{\prime}(z^{l})$ .
-
Feedforward: For each layer
-
Gradient descent: For each
$l=L,L−1,…,2$ update the weights according to the rule $$ \begin{align} w^l &\to w^l - \frac{\eta}{M} \sum_{m=1}^{M} \nabla_{b^{l}}C_m \ b^l &\to b^l - \frac{\eta}{M} \sum_{m=1}^{M} \nabla_{w^{l}}C_m \end{align} $$
Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. It's omitted those for simplicity.
You decide to regard the cost as a function of the weights
Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight
What's clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives
so even though backpropagation appears superficially more complex than the previous obvious simple approach, it's actually much, much faster!
How the backpropagation algorithm works (http://neuralnetworksanddeeplearning.com/chap2.html)
Neural Network Cross Entropy Error (https://visualstudiomagazine.com/Articles/2014/04/01/Neural-Network-Cross-Entropy-Error.aspx?p=1)