The Art of Simplifying Multivariate GradientsA collection of notes on various topics I am interested in.

Once you’ve written out the component-wise chain rule, you’re often left with nasty sums full of indices. The art of gradient derivation is recognizing patterns that collapse into clean matrix operations.

Systematic Simplification Process

When faced with a component-wise gradient, follow these steps:

Step 1: Write Out the Component Form

Start with the chain rule:

\frac{\partial L}{\partial θ _{indices}} = all f \sum \frac{\partial L}{\partial f _{indices}} \cdot \frac{\partial f _{indices}}{\partial θ _{indices}}

Step 2: Identify Kronecker Deltas

Look for terms like $\frac{\partial W _{ij}}{\partial W _{ℓ k}}$ and replace with $δ_{i ℓ} δ_{jk}$ .

Step 3: Collapse Sums Using Deltas

Use $\sum_{k} a_{k} δ_{ik} = a_{i}$ to eliminate indices.

Step 4: Recognize Matrix Operation Patterns

Look at the structure of remaining indices:

Single index pair → outer product
Sum over middle index → matrix multiply
Same indices → element-wise (Hadamard)

Step 5: Write Matrix Form

Translate the simplified component form into matrix notation.

Step 6: Verify Shapes

Check that all matrix dimensions are compatible!

Core Simplification Tools

Kronecker Delta ( $δ_{ij}$ )

Definition:

δ_{ij} = {10 if i = j if i \neq = j

Power: It selects terms from sums and eliminates irrelevant indices. Key Property:

k \sum a_{k} δ_{ik} = a_{i}

The sum collapses to just the $i$ -th term!

notice how $a_{k}$ turns into $a_{i}$

Pattern 1: Derivatives of Parameters w.r.t. Themselves

\frac{\partial θ _{i}}{\partial θ _{j}} = δ_{ij}

For matrices:

\frac{\partial W _{ij}}{\partial W _{ℓ k}} = δ_{i ℓ} δ_{jk}

Why it matters: When you see $\frac{\partial W _{ij}}{\partial W _{ℓ k}}$ in a sum, it kills all terms except where $i = ℓ$ and $j = k$ .

Example: Linear layer $y = W x$ Component form:

y_{i} = j \sum W_{ij} x_{j}

Derivative:

\frac{\partial y _{i}}{\partial W _{ℓ k}} = \frac{\partial}{\partial W _{ℓ k}} j \sum W_{ij} x_{j} = j \sum x_{j} δ_{i ℓ} δ_{jk} = δ_{i ℓ} x_{k}

The double Kronecker delta collapses the sum instantly!

Pattern 2: Identity Functions

For a function $y = x$ (identity):

\frac{\partial y _{i}}{\partial x _{j}} = δ_{ij}

In chain rule:

\frac{\partial L}{\partial x _{j}} = i \sum \frac{\partial L}{\partial y _{i}} δ_{ij} = \frac{\partial L}{\partial y _{j}}

This is useful for linear layers like ReLU.

Recognizing Matrix Products

I got a good number of them in Matrix and Component Forms

Component Pattern	Matrix Form	Fundamental Pattern
$C_{ij} = a_{i} b_{j}$	$C = a b^{T}$	Outer product
$y_{i} = \sum_{j} A_{ij} x_{j}$	$y = A x$	Matrix-vector product
$Y_{ik} = \sum_{j} A_{ij} B_{jk}$	$Y = A B$	Matrix-matrix product
$z_{i} = f (x_{i})$	$z = f (x)$	Element-wise unary operation
$z_{i} = x_{i} \circ y_{i}$	$z = x ⊙ y$	Element-wise binary operation
$y = \sum_{i} x_{i}$	$y = 1^{T} x$	Reduction operation
$Y_{ij} = X_{ji}$	$Y = X^{T}$	Transpose
$\sum_{k} a_{k} δ_{ik}$	$a_{i}$	Kronecker delta collapse

All of the gradient component simplifications come from this core set!

For example when I was deriving the weight gradient of a linear layer, I got to

\frac{\partial L}{\partial W _{l k}} = \frac{\partial L}{\partial y _{l}} \cdot x_{k}

This is an outer product!!!!!

\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} * x^{T}

Explorer

The Art of Simplifying Multivariate Gradients

Systematic Simplification Process

Step 1: Write Out the Component Form

Step 2: Identify Kronecker Deltas

Step 3: Collapse Sums Using Deltas

Step 4: Recognize Matrix Operation Patterns

Step 5: Write Matrix Form

Step 6: Verify Shapes

Core Simplification Tools

Kronecker Delta ( $δ_{ij}$ )

Pattern 1: Derivatives of Parameters w.r.t. Themselves

Pattern 2: Identity Functions

Recognizing Matrix Products

Graph View

Table of Contents

Backlinks

Explorer

The Art of Simplifying Multivariate Gradients

Systematic Simplification Process

Step 1: Write Out the Component Form

Step 2: Identify Kronecker Deltas

Step 3: Collapse Sums Using Deltas

Step 4: Recognize Matrix Operation Patterns

Step 5: Write Matrix Form

Step 6: Verify Shapes

Core Simplification Tools

Kronecker Delta (δij​)

Pattern 1: Derivatives of Parameters w.r.t. Themselves

Pattern 2: Identity Functions

Recognizing Matrix Products

Graph View

Table of Contents

Backlinks

Kronecker Delta ( $δ_{ij}$ )