Once you’ve written out the component-wise chain rule, you’re often left with nasty sums full of indices. The art of gradient derivation is recognizing patterns that collapse into clean matrix operations.

Systematic Simplification Process

When faced with a component-wise gradient, follow these steps:

Step 1: Write Out the Component Form

Start with the chain rule:

Step 2: Identify Kronecker Deltas

Look for terms like and replace with .

Step 3: Collapse Sums Using Deltas

Use to eliminate indices.

Step 4: Recognize Matrix Operation Patterns

Look at the structure of remaining indices:

  • Single index pair → outer product
  • Sum over middle index → matrix multiply
  • Same indices → element-wise (Hadamard)

Step 5: Write Matrix Form

Translate the simplified component form into matrix notation.

Step 6: Verify Shapes

Check that all matrix dimensions are compatible!

Core Simplification Tools

Kronecker Delta ()

Definition:

Power: It selects terms from sums and eliminates irrelevant indices. Key Property:

The sum collapses to just the -th term!

  • notice how turns into

Pattern 1: Derivatives of Parameters w.r.t. Themselves

For matrices:

Why it matters: When you see in a sum, it kills all terms except where and .

Example: Linear layer Component form:

Derivative:

The double Kronecker delta collapses the sum instantly!

Pattern 2: Identity Functions

For a function (identity):

In chain rule:

This is useful for linear layers like ReLU.

Recognizing Matrix Products

I got a good number of them in Matrix and Component Forms

Component PatternMatrix FormFundamental Pattern
Outer product
Matrix-vector product
Matrix-matrix product
Element-wise unary operation
Element-wise binary operation
Reduction operation
Transpose
Kronecker delta collapse

All of the gradient component simplifications come from this core set!

For example when I was deriving the weight gradient of a linear layer, I got to

This is an outer product!!!!!