Following Multivariate Chain Rule, we end up with a large “Chain” of gradients that we have to determine.

This can be tiresome if you do it by hand, which is why people follow a Derive as you Go pattern.

Setup

We saw from previously that the chain rule in Multivariate Calculus contains some annoying summations and component-wise analysis that makes it hard to derive gradients in one shot.

Suppose we have: $L = L (f_{3} (f_{2} (f_{1} (θ))))$ Where:

$θ$ is your parameter
$f_{1} (θ)$ is the first intermediate quantity
$f_{2} (f_{1} (θ))$ is the second intermediate quantity
$f_{3} (f_{2} (f_{1} (θ)))$ is the third intermediate quantity
$L$ is the final scalar

We apply chain rule repeatedly, but as separate summations

Step 1: From $f_{3}$ to $f_{2}$

\frac{\partial L}{\partial ( f _{2} ) _{β}} = γ \sum \frac{\partial L}{\partial ( f _{3} ) _{γ}} \cdot \frac{\partial ( f _{3} ) _{γ}}{\partial ( f _{2} ) _{β}}

Step 2: From $f_{2}$ to $f_{1}$

\frac{\partial L}{\partial ( f _{1} ) _{α}} = β \sum \frac{\partial L}{\partial ( f _{2} ) _{β}} \cdot \frac{\partial ( f _{2} ) _{β}}{\partial ( f _{1} ) _{α}}

Step 3: From $f_{1}$ to $θ$

\frac{\partial L}{\partial θ _{indices}} = α \sum \frac{\partial L}{\partial ( f _{1} ) _{α}} \cdot \frac{\partial ( f _{1} ) _{α}}{\partial θ _{indices}}

If you substitute Step 1 into Step 2 into Step 3:

\frac{\partial L}{\partial θ _{indices}} = α \sum β \sum γ \sum \frac{\partial L}{\partial ( f _{3} ) _{γ}} \cdot \frac{\partial ( f _{3} ) _{γ}}{\partial ( f _{2} ) _{β}} \cdot \frac{\partial ( f _{2} ) _{β}}{\partial ( f _{1} ) _{α}} \cdot \frac{\partial ( f _{1} ) _{α}}{\partial θ _{indices}}

How to Computational Graph

The Key Insight

Instead of deriving the full nested sum all at once, we can build a graph structure where:

Nodes represent intermediate values ( $θ, f_{1}, f_{2}, f_{3}, L$ )
Edges represent dependencies (how outputs depend on inputs)

Then we compute gradients by traversing the graph backwards.

Doing it by hand

Given a network architecture:

Sketch the forward pass
Begin at Loss
Compute Layer Gradients
- Compute local gradients
- Apply chain rule
- pass the input gradient up to the next layer
Repeat until you get to the top

Forward Pass: Build the Graph

As we compute the forward pass, we:

Store each intermediate value
Record the operations used to create them
Build edges showing what depends on what

Example: $L = L (f_{3} (f_{2} (f_{1} (θ))))$

θ → f₁ → f₂ → f₃ → L

Backward Pass: Traverse the Graph

Starting from $L$ , we propagate gradients backwards through each edge:

Step 1: Initialize

\frac{\partial L}{\partial L} = 1

Step 2: From $L$ to $f_{3}$

\frac{\partial L}{\partial f _{3}} = \frac{\partial L}{\partial L} \cdot \frac{\partial L}{\partial f _{3}} = 1 \cdot \frac{\partial L}{\partial f _{3}}

Step 3: From $f_{3}$ to $f_{2}$ (using chain rule)

\frac{\partial L}{\partial f _{2}} = \frac{\partial L}{\partial f _{3}} \cdot \frac{\partial f _{3}}{\partial f _{2}}

Step 4: From $f_{2}$ to $f_{1}$

\frac{\partial L}{\partial f _{1}} = \frac{\partial L}{\partial f _{2}} \cdot \frac{\partial f _{2}}{\partial f _{1}}

Step 5: From $f_{1}$ to $θ$

\frac{\partial L}{\partial θ} = \frac{\partial L}{\partial f _{1}} \cdot \frac{\partial f _{1}}{\partial θ}

At each step, we:

Receive gradient from the next layer: $\frac{\partial L}{\partial f _{i + 1}}$
Compute local gradient: $\frac{\partial f _{i + 1}}{\partial f _{i}}$
Apply chain rule: $\frac{\partial L}{\partial f _{i}} = \frac{\partial L}{\partial f _{i + 1}} \cdot \frac{\partial f _{i + 1}}{\partial f _{i}}$
Pass gradient to previous layer

Branching: When Multiple Paths Exist

If a node has multiple children (is used in multiple places), we sum the gradients from all paths:

     ┌→ f₂ →┐
θ → f₁       ├→ L
     └→ f₃ →┘

\frac{\partial L}{\partial f _{1}} = \frac{\partial L}{\partial f _{2}} \cdot \frac{\partial f _{2}}{\partial f _{1}} + \frac{\partial L}{\partial f _{3}} \cdot \frac{\partial f _{3}}{\partial f _{1}}

This automatically handles the summation in the chain rule!

Node Types in the Graph

1. Leaf Nodes (Parameters)

Nodes like $θ$ (weights, biases)
Require gradients - these are what we want to update
.requires_grad = True in PyTorch

2. Intermediate Nodes (Activations)

Nodes like $f_{1}, f_{2}, f_{3}$
Store values during forward pass
Compute and pass gradients during backward pass
Can be freed after backward pass to save memory

3. Output Node (Loss)

Node like $L$
Starting point for backward pass
Always has gradient = 1

Operations Store Local Gradients

Each operation in the graph knows how to compute its local gradient:

Operation	Forward	Backward (local gradient)
$y = W x$	$y = W x$	$\frac{\partial y}{\partial W} = x^{T}$ , $\frac{\partial y}{\partial x} = W^{T}$
$y = σ (x)$	$y = σ (x)$	$\frac{\partial y}{\partial x} = σ^{'} (x)$
$y = x_{1} + x_{2}$	$y = x_{1} + x_{2}$	$\frac{\partial y}{\partial x _{1}} = 1$ , $\frac{\partial y}{\partial x _{2}} = 1$
$y = x_{1} ⊙ x_{2}$	$y = x_{1} ⊙ x_{2}$	$\frac{\partial y}{\partial x _{1}} = x_{2}$ , $\frac{\partial y}{\partial x _{2}} = x_{1}$

Explorer

PyTorch Computational Graph