LayersA collection of notes on various topics I am interested in.

Linear Layer (Fully Connected Layer)

z = W x + b

Gradient w.r.t weights

\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot x^{T}

Gradient w.r.t bias

\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z}

Gradient w.r.t input of the linear layer (to be passed backwards to earlier layers)

\frac{\partial L}{\partial x} = W^{T} \cdot \frac{\partial L}{\partial z}

Convolutional Layer (1D Case)

For 1D convolution with input length $n$ , kernel length $k$ , output length $m = n - k + 1$ :

z_{i} = j = 1 \sum k w_{j} \cdot x_{i + j - 1} + b, i \in 1, ..., m

Bias gradient: Given that

\frac{\partial L}{\partial z} = [δ_{1}, δ_{2}, \dots, δ_{m}]

And so

\frac{\partial L}{\partial b} = i = 1 \sum m δ_{i}

Kernel gradient:

\frac{\partial L}{\partial w _{j}} = i = 1 \sum m δ_{i} \cdot x_{i + j - 1}, j \in 1, ..., k

Input gradient:

\frac{\partial L}{\partial x _{p}} = i = m a x (1, p - k + 1) \sum m i n (m, p) δ_{i} \cdot w_{p - i + 1}, p \in 1, ..., n

Convolutional Layer (2D Case)

Here, I’m going to generalize to any dimensional convolution to 2D since that’s when its used. $z$ → $Z$ which is a tensor, $x$ → $X$ which is a tensor, $W$ is the weight matricies we use for convolutions AKA kernels.

Z = W * X + b

Z_{i, j, c} = m, n, d \sum W_{m, n, d, c} \cdot X_{i + m, j + n, d} + b_{c}

Where $*$ is the convolution operator. $i, j$ are spatial indicies, $m, n$ are spatial indicies of the kernel, $d$ is the input channel index, $c$ , is the output channel index. Basically, we shift around a kernel over an input and compute the sum of the element-wise multiplication of the kernel and the area in the input.

Gradient w.r.t kernel weights. We convolve the gradient over the input

\frac{\partial L}{\partial W _{m, n, d, c}} = i, j \sum \frac{\partial L}{\partial Z _{i, j, c}} \cdot X_{i + m, j + n, d}

Gradient w.r.t input. We convolve the gradient with a 180 degree horizontally flipped kernel

\frac{\partial L}{\partial X _{i, j, d}} = m, n, c \sum \frac{\partial L}{\partial Z _{i - m, j - n, c}} \cdot W_{m, n, d, c}

Gradient w.r.t bias. We sum the incoming gradient over that specific input channel

\frac{\partial L}{\partial b _{c}} = i, j \sum \frac{\partial L}{\partial Z _{i, j, c}}

Scaled Dot-Product Attention

Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V

where:

$Q = X W_{Q}$ (queries)
$K = X W_{K}$ (keys)
$V = X W_{V}$ (values)
$A = softmax (\frac{Q K ^{T}}{d _{k}})$ (attention weights)

Backward:

This is the most complex. Given $\frac{\partial L}{\partial out}$ :

Gradient w.r.t. $V$ :

\frac{\partial L}{\partial V} = A^{T} \cdot \frac{\partial L}{\partial out}

Gradient w.r.t. attention weights $A$ :

\frac{\partial L}{\partial A} = \frac{\partial L}{\partial out} \cdot V^{T}

Gradient through softmax: Let $S = \frac{Q K ^{T}}{d _{k}}$ (pre-softmax scores)

\frac{\partial L}{\partial S} = A ⊙ (\frac{\partial L}{\partial A} - rowsum (\frac{\partial L}{\partial A} ⊙ A))

(This comes from the softmax Jacobian - messy but necessary)

Gradient w.r.t. $Q$ and $K$ :

\frac{\partial L}{\partial Q} = \frac{1}{d _{k}} \cdot \frac{\partial L}{\partial S} \cdot K

\frac{\partial L}{\partial K} = \frac{1}{d _{k}} \cdot \frac{\partial L}{\partial S}^{T} \cdot Q

Finally, gradients w.r.t. weight matrices:

\frac{\partial L}{\partial W _{Q}} = X^{T} \cdot \frac{\partial L}{\partial Q}

\frac{\partial L}{\partial W _{K}} = X^{T} \cdot \frac{\partial L}{\partial K}

\frac{\partial L}{\partial W _{V}} = X^{T} \cdot \frac{\partial L}{\partial V}

Explorer

Layers

Linear Layer (Fully Connected Layer)

Convolutional Layer (1D Case)

Convolutional Layer (2D Case)

Scaled Dot-Product Attention

Graph View

Backlinks