Once you’ve written out the component-wise chain rule, you’re often left with nasty sums full of indices. The art of gradient derivation is recognizing patterns that collapse into clean matrix operations.
Systematic Simplification Process
When faced with a component-wise gradient, follow these steps:
Step 1: Write Out the Component Form
Start with the chain rule:
Step 2: Identify Kronecker Deltas
Look for terms like and replace with .
Step 3: Collapse Sums Using Deltas
Use to eliminate indices.
Step 4: Recognize Matrix Operation Patterns
Look at the structure of remaining indices:
- Single index pair → outer product
- Sum over middle index → matrix multiply
- Same indices → element-wise (Hadamard)
Step 5: Write Matrix Form
Translate the simplified component form into matrix notation.
Step 6: Verify Shapes
Check that all matrix dimensions are compatible!
Core Simplification Tools
Kronecker Delta ()
Definition:
Power: It selects terms from sums and eliminates irrelevant indices. Key Property:
The sum collapses to just the -th term!
- notice how turns into
Pattern 1: Derivatives of Parameters w.r.t. Themselves
For matrices:
Why it matters: When you see in a sum, it kills all terms except where and .
Example: Linear layer Component form:
Derivative:
The double Kronecker delta collapses the sum instantly!
Pattern 2: Identity Functions
For a function (identity):
In chain rule:
This is useful for linear layers like ReLU.
Recognizing Matrix Products
I got a good number of them in Matrix and Component Forms
| Component Pattern | Matrix Form | Fundamental Pattern |
|---|---|---|
| Outer product | ||
| Matrix-vector product | ||
| Matrix-matrix product | ||
| Element-wise unary operation | ||
| Element-wise binary operation | ||
| Reduction operation | ||
| Transpose | ||
| Kronecker delta collapse |
All of the gradient component simplifications come from this core set!
For example when I was deriving the weight gradient of a linear layer, I got to
This is an outer product!!!!!
