Linear Layer (Fully Connected Layer)
Gradient w.r.t weights
Gradient w.r.t bias
Gradient w.r.t input of the linear layer (to be passed backwards to earlier layers)
Convolutional Layer (1D Case)
For 1D convolution with input length , kernel length , output length :
Bias gradient: Given that
And so
Kernel gradient:
Input gradient:
Convolutional Layer (2D Case)
Here, I’m going to generalize to any dimensional convolution to 2D since that’s when its used. → which is a tensor, → which is a tensor, is the weight matricies we use for convolutions AKA kernels.
Where is the convolution operator. are spatial indicies, are spatial indicies of the kernel, is the input channel index, , is the output channel index. Basically, we shift around a kernel over an input and compute the sum of the element-wise multiplication of the kernel and the area in the input.

Gradient w.r.t kernel weights. We convolve the gradient over the input
Gradient w.r.t input. We convolve the gradient with a 180 degree horizontally flipped kernel
Gradient w.r.t bias. We sum the incoming gradient over that specific input channel
Scaled Dot-Product Attention
where:
- (queries)
- (keys)
- (values)
- (attention weights)
Backward:
This is the most complex. Given :
Gradient w.r.t. :
Gradient w.r.t. attention weights :
Gradient through softmax: Let (pre-softmax scores)
(This comes from the softmax Jacobian - messy but necessary)
Gradient w.r.t. and :
Finally, gradients w.r.t. weight matrices:
