Multivariate Chain Rule

Given a composition of functions:

L = L (f (θ))

Where:

$θ$ is your parameter (can be scalar, vector, matrix, tensor - any shape)
$f (θ)$ is some intermediate function(can output scalar, vector, matrix, tensor - any shape)
$L$ is the final output (can be scalar, vector, matrix, tensor - any shape)

The universal multivariable chain rule is given as:

\frac{\partial L}{\partial θ _{in d i c i es}} = all indicies of f \sum \frac{\partial L}{\partial f _{all indicies of f}} \cdot \frac{\partial f _{all indicies of f}}{\partial θ _{indicies}}

To translate: In order to determine how $L$ changes with an element $θ_{i}$ , we need to sum up the contributions of all the elements of $f$ that depend on $θ_{i}$ .

Why “Sum over all indices of $f$ “?

Because $f$ is the intermediate quantity that connects $θ$ to $L$ :

Each element $f_{β}$ depends on $θ$
$L$ depends on each element $f_{β}$
To find how $L$ depends on $θ$ , you add up contributions from all the $f_{β}$

Examples

Both are vectors

$θ = x \in R^{n}$ (parameter vector)
$f (x) = y \in R^{m}$ (intermediate vector)
$L (y) \in R$ (final scalar)

\frac{\partial L}{\partial x _{i}} = j = 1 \sum m \frac{\partial L}{\partial y _{j}} \cdot \frac{\partial y _{j}}{\partial x _{i}}

Sum over all $m$ components of the intermediate quantity $y$ .

Parameter is matrix, intermediate is vector

$θ = W \in R^{p \times q}$ (parameter matrix)
$f (W) = y \in R^{m}$ (intermediate vector)
$L (y) \in R$ (final scalar)

\frac{\partial L}{\partial W _{ab}} = i = 1 \sum m \frac{\partial L}{\partial y _{i}} \cdot \frac{\partial y _{i}}{\partial W _{ab}}

Sum over all $m$ components of the intermediate quantity $y$ .

Parameter is matrix, intermediate is matrix

$θ = W \in R^{p \times q}$ (parameter matrix)
$f (W) = Y \in R^{m \times k}$ (intermediate matrix)
$L (Y) \in R$ (final scalar)

\frac{\partial L}{\partial W _{ab}} = i = 1 \sum m j = 1 \sum k \frac{\partial L}{\partial Y _{ij}} \cdot \frac{\partial Y _{ij}}{\partial W _{ab}}

Sum over all $m \times k$ components of the intermediate quantity $Y$ .

How does chain rule work here?

Suppose we have:

L = L (f_{3} (f_{2} (f_{1} (θ))))

Where:

$θ$ is your parameter
$f_{1} (θ)$ is the first intermediate quantity
$f_{2} (f_{1} (θ))$ is the second intermediate quantity
$f_{3} (f_{2} (f_{1} (θ)))$ is the third intermediate quantity
$L$ is the final scalar

We apply chain rule repeatedly, but as separate summations

Step 1: From $f_{3}$ to $f_{2}$

\frac{\partial L}{\partial ( f _{2} ) _{β}} = γ \sum \frac{\partial L}{\partial ( f _{3} ) _{γ}} \cdot \frac{\partial ( f _{3} ) _{γ}}{\partial ( f _{2} ) _{β}}

Step 2: From $f_{2}$ to $f_{1}$

\frac{\partial L}{\partial ( f _{1} ) _{α}} = β \sum \frac{\partial L}{\partial ( f _{2} ) _{β}} \cdot \frac{\partial ( f _{2} ) _{β}}{\partial ( f _{1} ) _{α}}

Step 3: From $f_{1}$ to $θ$

\frac{\partial L}{\partial θ _{indices}} = α \sum \frac{\partial L}{\partial ( f _{1} ) _{α}} \cdot \frac{\partial ( f _{1} ) _{α}}{\partial θ _{indices}}

If you substitute Step 1 into Step 2 into Step 3:

\frac{\partial L}{\partial θ _{indices}} = α \sum β \sum γ \sum \frac{\partial L}{\partial ( f _{3} ) _{γ}} \cdot \frac{\partial ( f _{3} ) _{γ}}{\partial ( f _{2} ) _{β}} \cdot \frac{\partial ( f _{2} ) _{β}}{\partial ( f _{1} ) _{α}} \cdot \frac{\partial ( f _{1} ) _{α}}{\partial θ _{indices}}

Of course, deriving like this is a nightmare, which is why we need some sort of easier way to do it. For deep learning, we can follow a derive as you go pattern. In computing, this is usually done with some graph structure like the PyTorch Computational Graph.

Once you get to a certain point of writing out the component representation of the gradients with the given functions, that’s when you hit a wall of “how do I actually simplify this?“. See The Art of Simplifying Multivariate Gradients

Explorer

Multivariate Chain Rule

Multivariate Chain Rule

Why “Sum over all indices of $f$ “?

Examples

Both are vectors

Parameter is matrix, intermediate is vector

Parameter is matrix, intermediate is matrix

How does chain rule work here?

Graph View

Table of Contents

Backlinks

Explorer

Multivariate Chain Rule

Multivariate Chain Rule

Why “Sum over all indices of f“?

Examples

Both are vectors

Parameter is matrix, intermediate is vector

Parameter is matrix, intermediate is matrix

How does chain rule work here?

Graph View

Table of Contents

Backlinks

Why “Sum over all indices of $f$ “?