Notes on Backpropagation

Notations

Let's begin with a notation which lets us refer to weights in the network in an unambiguous way. We'll use $w_{jk}^{l}$ to denote the weight for the connection from the $k^{th}$ neuron in the $(l−1)^{th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer. So, for example, the diagram the below shows the weight on a connection from the $4^{th}$ neuron in the $2^{nd}$ layer to the $2^{nd}$ neuron in the $3^{rd}$ layer of a network:

We use a similar notation for the network's biases and activations. Explicitly, we use $b_j^{l}$ for the bias of the $j^{th}$ neuron in the $l^{th}$ layer. And we use $a_j^{l}$ for the activation of the $j^{th}$ neuron in the $l^{th}$ layer. The following diagram shows examples of these notations in use:

With these notations, the activation $a_j^{l}$ of the $j^{th}$ neuron in the $l^{th}$ layer is related to the activations in the $(l-1)^{th}$ layer by the equation. $$ a_j^{l} = \sigma(\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l}) $$ where $n^{l-1}$ is the number of neurons in the $(l-1)^{th}$ layer, and $\sigma$ is an activation function, such as $sigmoid$, $tanh$ and $ReLU$.

To rewrite this expression in a matrix form we define a weight matrix $w^{l}$ for each layer, $l$. The entries of the weight matrix $w^{l}$ are just the weights connecting to the $l^{th}$ layer of neurons, that is, the entry in the $j^{th}$ row and $k^{th}$ column is $w_{jk}^{l}$. Similarly, for each layer $l$ we define a bias vector, $b^{l}$. $b_j^{l}$ is the $j^{th}$ entry of the bias vector for the $l^{th}$ layer. And finally, we define an activation vector $a^{l}$ whose components are the activations $a_j^{l}$.

The last ingredient we need to rewrite in a matrix form is the idea of vectorizing a function such as $\sigma$. The idea is that we want to apply a function such as $\sigma$ to every element in a vector $v$. We use the obvious notation $\sigma(v)$ to denote this kind of elementwise application of a function. That is, the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.

With these notations in mind, the above equation can be rewritten in the beautiful and compact vectorized form: $$ a^{l} = \sigma(w^{l}a^{l-1}+b^{l}) $$ That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we've taken to now. Think of it as a way of escaping index hell, while remaining precise about what's going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization.

When using the above vectorized form equation to compute $a^{l}$, we compute the intermediate quantity $$ z^{l}=w^{l}a^{l-1}+b^{l} $$ We call $z^{l}$ the weighted input to the neurons in layer $l$.

So we also write $$ a^{l} = \sigma(z^{l}) $$

$z^{l}$ has components $$ z_{j}^{l} = \sum_{k} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l} $$ $z_{j}^{l}$ is just the weighted input to the activation function for neuron $j$ in layer $l$.

Cost function

To train a neural network you need some measure of error between computed outputs and the desired target outputs of the training data. The most common measure of error is called mean squared error. However, there are some research results that suggest using a different measure, called cross entropy error, is sometimes preferable to using mean squared error.

So, which is better for neural network training: mean squared error or mean cross entropy error? The answer is, as usual, it depends on the particular problem. Research results in this area are rather difficult to compare. If one of the error functions were clearly superior to the other function in all situations, there would be no need for articles like this one. The consensus opinion among my immediate colleagues is that it's best to try mean cross entropy error first; then, if you have time, try mean squared error.

The goal of backpropagation is to compute the partial derivatives $\frac{\partial C}{\partial w^{l}}$ and $\frac{\partial C}{\partial b^{l}}$ of the cost function $C$ with respect to weight $w^{l}$ or bias $b^{l}$ in each layer $l$ of the network. For backpropagation to work we need to make two main assumptions about the form of the cost function.

The first assumption is that the cost function can be written as an average over cost functions for individual training examples. $$ C = \frac{1}{M}\sum_{m=1}^{M}C_m $$ where $M$ is the number of training examples, and $C_m$ is cost of the $m^{th}$ training example.

What backpropagation actually lets us do is compute the partial derivatives $\frac{\partial C_m}{\partial w^{l}}$ and $\frac{\partial C_m}{\partial b^{l}}$for a single training example. We then recover $\frac{\partial C}{\partial w^{l}}$ and $\frac{\partial C}{\partial b^{l}}$ by averaging over training examples.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network.

The cost $C$ is a function of the activations $a_{j}^{L}$ of the last layer $L$. $$ \begin{align} C &= C^{L}(a_{1}^{L}, \cdots, a_{i}^{L}, \cdots, a_{n^{L}}^{L}) \ &= C^{L}(\sigma(z_{1}^{L}), \cdots, \sigma(z_{i}^{L}), \cdots, \sigma(z_{n^{L}}^{L})) \ &= f^{L}(z_{1}^{L}, \cdots, z_{i}^{L}, \cdots, z_{n^{L}}^{L}) \end{align} $$ where $1 \le i \le n^{L}$ and $n^{L}$ is the number of neurons in the last layer $L$.

Before we go to detail calculation of backpropagation, we need to introduce some related multivariable calculus knowledge.

Chain rule for partial derivatives of multivariable functions

Given the following multivariable functions $f$ and $s$ $$ \begin{align} y &= y(x_1,\cdots,x_i,\cdots,x_n)\ x_i &= x_i(t_1,\cdots,t_j,\cdots,t_m); 1 \le i \le n \ \end{align} $$

Based on the Chain rule for partial derivatives of multivariable functions, we can get $$ \frac{\partial y}{\partial t_j} = \sum_{i=1}^{n} \frac{\partial y}{\partial x_i} \times \frac{\partial x_i}{\partial t_j}; 1 \le j \le m $$

If we slice a neural network and keep the layers from $l$ to $L$, and treat all neurons in these layers as one big black box, then we can also rewrite cost $C$ as a new function with input $(z_{1}^{l}, \cdots, z_{j}^{l}, \cdots, z_{n^l}^{l})$. $$ \begin{align} C &= C^{l}(a_{1}^{l}, \cdots, a_{j}^{l}, \cdots, a_{n^l}^{l}) \ &= C^{l}(\sigma(z_{1}^{l}), \cdots, \sigma(z_{j}^{l}), \cdots, \sigma(z_{n^{l}}^{l}))\ &= f^{l}(z_{1}^{l}, \cdots, z_{j}^{l}, \cdots, z_{n^l}^{l}) \end{align} $$ where $1 \le j \le n^{l}$ and $n^l$ is the number of neurons in the layer $l$.

Note:

$C$ is corresponding to $y$ in the previous example of the chain rule of multivariables. $(z_{1}^{l}, \cdots, z_{j}^{l}, \cdots, z_{n^l}^{l})$ is corresponding to $(x_1,\cdots,x_i,\cdots,x_n)$ in the previous example.

We can also rewrite $z_{j}^{l}$ as a new function with input $(z_{1}^{l-1}, \cdots, z_{k}^{l-1}, \cdots, z_{n^{l-1}}^{l-1})$. $$ \begin{align}
z_{j}^{l} &= \sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l} \ &= z_{j}^{l}(a_{1}^{l-1}, \cdots, a_{k}^{l-1}, \cdots, a_{n^{l-1}}^{l-1}) \ &= \sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times \sigma(z_{k}^{l-1})+b_j^{l} \ &= z_{j}^{l}(\sigma(z_{1}^{l-1}), \cdots, \sigma(z_{k}^{l-1}), \cdots, \sigma(z_{n^{l-1}}^{l-1})) \ &= s_j^{l}(z_{1}^{l-1}, \cdots, z_{k}^{l-1}, \cdots, z_{n^{l-1}}^{l-1}) \ \end{align} $$ where $1 \le k \le n^{l-1}$ and $n^{l-1}$ is the number of neurons in the layer $l-1$.

Note:

$z_{j}^{l}$ is corresponding to $x_i$ in the previous example of the chain rule of multivariables. $(z_{1}^{l-1}, \cdots, z_{k}^{l-1}, \cdots, z_{n^{l-1}}^{l-1})$ is corresponding to $(t_1,\cdots,t_j,\cdots,t_m)$ in the previous example.

We will show how to apply the above chain rule to Backpropagation algorithm in the neural networks soon.

Now we are going to introduce a linear algebraic operation used in Backpropagation.

The Hadamard product

Suppose $s$ and $t$ are two vectors of the same dimension. Then we use $s \odot t$ to denote the elementwise product of the two vectors. Thus the components of $s \odot t$ are just $(s \odot t)_j = s_j \times t_j$. For example: $$ \begin{bmatrix} s_1\ \vdots\ s_j\ \vdots\ s_n \end{bmatrix} \odot \begin{bmatrix} t_1\ \vdots\ t_j\ \vdots\ t_n \end{bmatrix} = \begin{bmatrix} s_1 \times t_1\ \vdots\ s_j \times t_j\ \vdots\ s_n \times t_n \end{bmatrix} $$

Calculation of Backpropagation

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives $\frac{\partial C}{\partial w_{jk}^{l}}$ and $\frac{\partial C}{\partial b_{j}^{l}}$. But to compute those, we first introduce an intermediate quantity, $\delta_{j}^{l}$, which we call the error in the $j^{th}$ neuron in the $l^{th}$ layer. Backpropagation will give us a procedure to compute the error $\delta_{j}^{l}$, and then will relate $\delta_{j}^{l}$ to $\frac{\partial C}{\partial w_{jk}^{l}}$ and $\frac{\partial C}{\partial b_{j}^{l}}$.

we define the error of neuron $j$ in layer $l$ by $$ \delta_{j}^{l} = \frac{\partial C}{\partial z_{j}^{l}} $$

If we get $\delta_{j}^{l}$, then we can easily calculate $$ \begin{align} \frac{\partial C}{\partial b_{j}^{l}} &= \frac{\partial C}{\partial z_{j}^{l}} \times \frac{\partial z_{j}^{l}}{\partial b_{j}^{l}}\ &= \delta_{j}^{l} \times \frac{(\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l} )}{\partial b_{j}^{l}}\ &= \delta_{j}^{l} \times 1 = \delta_{j}^{l} \end{align} $$

The above equation can be written in the vectorized form, and we define the derivative of $C$ with respect to $b^{l}$ as $$ \nabla_{b^{l}}C = \frac{\partial C}{\partial b^{l}} = \delta^{l} $$

We can also easily calculate $$ \begin{align} \frac{\partial C}{\partial w_{jk}^{l}} &= \frac{\partial C}{\partial z_{j}^{l}} \times \frac{\partial z_{j}^{l}}{\partial w_{jk}^{l}}\ &= \delta_{j}^{l} \times \frac{(\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l} )}{\partial w_{jk}^{l}}\ &= \delta_{j}^{l} \times a_{k}^{l-1} \end{align} $$

The above equation can be written in the vectorized form, and we define the derivative of $C$ with respect to $w^{l}$ as $$ \nabla_{w^{l}}C = \frac{\partial C}{\partial w^{l}} = \delta^{l} \times (a^{l-1})^{T} $$ Note:

$w^{l}$ and $\nabla_{w^{l}}C$ are both $n^{l} \times n^{l-1}$ matrices. $\delta^{l}$ is a vector or a $n^{l} \times 1$ matrix and $(a^{l-1})^{T}$ is a $1 \times n^{l-1}$ matrix.

When the activation $a_{k}^{l-1}$ is small, $a_{k}^{l-1} \approx 0$, the gradient term $\frac{\partial C}{\partial w_{jk}^{l}}$ will also tend to be small. In this case, we'll say the weight learns slowly, meaning that it's not changing much during gradient descent. In other words, one consequence is that weights associated with low-activation neurons learn slowly.

The key issue is how to calculate $\delta_{j}^{l}$ for each layer $l$ $(L \ge l \ge 2)$.

First, we can easily calculate $\delta_{j}^{L}$ of the last layer $L$. $$ \delta_{j}^{L} = \frac{\partial C}{\partial z_{j}^{L}} = \frac{\partial C}{\partial a_{j}^{L}} \frac{\partial a_{j}^{L}}{\partial z_{j}^{L}}= \frac{\partial C}{\partial a_{j}^{L}} \frac{\partial \sigma(z_{j}^{L})}{\partial z_{j}^{L}} = \frac{\partial C}{\partial a_{j}^{L}} \sigma^{\prime}(z_{j}^{L}) $$

The above equation can be rewritten in the vectorized form: $$ \delta^{L} = \nabla_{a^{L}}C \odot \sigma^{\prime}(z^{L}) $$

We can start from the layer $L$ to go backwards layer by layer, and calculate the $\delta_{k}^{l-1}$ of the previous layer $l-1$ based on $\delta_{j}^{l}$ of the current layer $l$ based on multivariable chain rule we introduced before. $$ \begin{align}
\delta_{k}^{l-1} &= \frac{\partial C}{\partial z_{k}^{l-1}}\ &= \sum_{j=1}^{n^{l}} \frac{\partial f^{l}}{\partial z_{j}^{l}} \times \frac{\partial s_j^{l}}{\partial z_{k}^{l-1}}\ &= \sum_{j=1}^{n^{l}} \frac{\partial C}{\partial z_{j}^{l}} \times \frac{\partial z_{j}^{l}}{\partial z_{k}^{l-1}} \end{align} $$

We have $$ \begin{align} \frac{\partial z_{j}^{l}}{\partial z_{k}^{l-1}} &= \frac{\partial (\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times a_{k}^{l-1}+b_j^{l})}{\partial z_{k}^{l-1}}\ &=\frac{\partial (\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times \sigma(z_{k}^{l-1})+b_j^{l})}{\partial z_{k}^{l-1}}\ &= \frac{\partial (\sum_{k=1}^{n^{l-1}} w_{jk}^{l} \times \sigma(z_{k}^{l-1})+b_j^{l})}{\partial \sigma(z_{k}^{l-1})} \times \frac{\partial \sigma(z_{k}^{l-1})}{z_{k}^{l-1}}\ &= w_{jk}^{l} \times \sigma^{\prime}(z_{k}^{l-1}) \end{align} $$

So $$ \delta_{k}^{l-1} = \sum_{j=1}^{n^{l}} \delta_{j}^{l} \times w_{jk}^{l} \times \sigma^{\prime}(z_{k}^{l-1}) $$ The above equation can be written in the vectorized form: $$ \delta^{l-1} = (w^{l})^{T}\delta^{l} \odot \sigma^{\prime}(z^{l-1}) $$

Note:

From the above equation, we can get $\delta^{l} = (w^{l+1})^{T}\delta^{l+1} \odot \sigma^{\prime}(z^{l})$. Consider the term $\sigma^{\prime}(z_{k}^{l})$, if the activation function $\sigma$ is the $sigmoid$ function that the $\sigma$ function becomes very flat when $\sigma(z_{k}^{l})$ is approximately $0$ or $1$. When this occurs we will have $\sigma^{\prime}(z_{k}^{l}) \approx 0$. We know $\frac{\partial C}{\partial w_{jk}^{l}} = \delta_{j}^{l} \times a_{k}^{l-1}$, so the lesson is that a weight in the layer will learn slowly if the neuron activation $\sigma(z_{k}^{l})$ is either low ($\approx 0$) or high ($\approx 1$). In this case it's common to say the neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly). Similar remarks hold also for the biases of neuron.

We can summarize the above results in the vectorized form as follows $$ \begin{align} \delta^{L} &= \nabla_{a^{L}}C \odot \sigma^{\prime}(z^{L}) \ \delta^{l-1} &= (w^{l})^{T}\delta^{l} \odot \sigma^{\prime}(z^{l-1})\ \nabla_{b^{l}}C &= \delta^{l}\ \nabla_{w^{l}}C &= \delta^{l} \times (a^{l-1})^{T} \end{align} $$

Implementation of Backpropagation

The previous section introduces the inference of Backpropagation, and let's explicitly write its implementation out in the form of an algorithm.

First, we start with only a single training example $x^{(m)}$.

Input vector $x^{(m)}$: Assign $x^{(m)}$ to the activation $a^1$ for the input layer.
Feedforward: For each layer $l=2, 3, \cdots, L$ compute $z^{l}=w^{l}a^{l-1}+b^{l}$ and $a^{l} = \sigma(z^{l})$.
Error in the last layer $\delta_L$: Compute the vector $\delta^{L} = \nabla_{a^{L}}C_m \odot \sigma^{\prime}(z^{L})$.
Backpropagate the error to previous layer: For each $l=L-1, L−2, \cdots, 2$ compute $\delta^{l} = (w^{l+1})^{T}\delta^{l+1} \odot \sigma^{\prime}(z^{l})$.
Output: The gradient of the cost function is given by $\nabla_{b^{l}}C_m = \delta^{l}$ and $\nabla_{w^{l}}C_m = \delta^{l} \times (a^{l-1})^{T}$.

In practice, it's common to combine backpropagation using stochastic gradient descent with mini-batch of $M$ training examples.

Input $M$ training examples of a mini-batch
For each training example $x^{(m)}$: Assign $x^{(m)}$ to the activation $a^1$ for the input layer.
- Feedforward: For each layer $l=2, 3, \cdots, L$ compute $z^{l}=w^{l}a^{l-1}+b^{l}$ and $a^{l} = \sigma(z^{l})$.
- Error in the last layer $\delta_L$**: Compute the vector $\delta^{L} = \nabla_{a^{L}}C_m \odot \sigma^{\prime}(z^{L})$.
- Backpropagate the error to previous layer: For each $l=L-1, L−2, \cdots, 2$ compute $\delta^{l} = (w^{l+1})^{T}\delta^{l+1} \odot \sigma^{\prime}(z^{l})$.
Gradient descent: For each $l=L,L−1,…,2$ update the weights according to the rule $$ \begin{align} w^l &\to w^l - \frac{\eta}{M} \sum_{m=1}^{M} \nabla_{b^{l}}C_m \ b^l &\to b^l - \frac{\eta}{M} \sum_{m=1}^{M} \nabla_{w^{l}}C_m \end{align} $$

Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. It's omitted those for simplicity.

In what sense is backpropagation a fast algorithm?

You decide to regard the cost as a function of the weights $C=C(w)$ alone (we'll get back to the biases in a moment). You number the weights $w_1,w_2,\cdots$ and want to compute $\frac{\partial C}{\partial w_j}$ for some particular weight $w_j$. An obvious way of doing that is to use the approximation $$ \frac{\partial C}{\partial w_j} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon} $$ where $\epsilon>0$ is a small positive number, and $e_j$ is the unit vector in the $j^{th}$ direction. In other words, we can estimate $\frac{\partial C}{\partial w_j}$ by computing the cost CC for two slightly different values of $w_j$. The same idea will let us compute the partial derivatives $\frac{\partial C}{\partial b}$ with respect to the biases.

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in order to compute $\frac{\partial C}{\partial w_j}$. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute $C(w)$ as well, so that's a total of a million and one passes through the network.

What's clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives $\frac{\partial C}{\partial w_j}$ using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass (This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost). And so the total cost of backpropagation is roughly the same as making just two forward passes through the network.

so even though backpropagation appears superficially more complex than the previous obvious simple approach, it's actually much, much faster!

Reference

How the backpropagation algorithm works (http://neuralnetworksanddeeplearning.com/chap2.html)

Neural Network Cross Entropy Error (https://visualstudiomagazine.com/Articles/2014/04/01/Neural-Network-Cross-Entropy-Error.aspx?p=1)

hustwj/Backpropagation.md

Notes on Backpropagation

Notations

Cost function

Chain rule for partial derivatives of multivariable functions

The Hadamard product

Calculation of Backpropagation

Implementation of Backpropagation

In what sense is backpropagation a fast algorithm?

Reference