Activation FunctionsA collection of notes on various topics I am interested in.

ReLU - Rectified Linear Unit

Re LU (z) = ma x (0, z)

Its gradient is

\frac{\partial Re LU}{\partial z} = {10 if z > 0 if z \leq 0

Remember! This is a vector! So it becomes a vector of 1’s and 0’s

Leaky ReLU - Leaky Rectified Linear Unit

Re LU (z) = ma x (α z, z)

Its gradient is

\frac{\partial Re LU}{\partial z} = {1 α if z > 0 if z \leq 0

Remember! This is a vector! So it becomes a vector of 1’s and $α$ ‘s

Sigmoid

σ (z) = \frac{1}{1 + e ^{- z}}

Its gradient is

\frac{\partial σ}{\partial z} = σ (z) (1 - σ (z))

Remember! This is a vector! So we are performing this operation on each of the elements.

Softmax

Usually used on the output layer.

so f t ma x (z)_{i} = \frac{e ^{z_{i}}}{\sum _{j = 1}^{C} e ^{z_{j}}}

Its Jacobian is

\frac{\partial so f t ma x ( z ) _{i}}{\partial z _{j}} = {so f t ma x (z)_{i} (1 - so f t ma x (z)_{i}) - so f t ma x (z)_{i} so f t ma x (z)_{j} if i = j if i \neq = j

Thing is, if we pair Softmax with cross entropy loss, this gradient simplifies to:

\frac{\partial L}{\partial z} = \overset{y}{^} - y

Which is why you end up seeing that nn.CrossEntropyLoss has a builtin softmax.

We have:

$\overset{y}{^} = softmax (z)$ , so $\overset{y}{^}_{i} = \frac{e ^{z_{i}}}{\sum _{j} e ^{z_{j}}}$
Loss: $L = - \sum_{i = 1}^{C} y_{i} lo g (\overset{y}{^}_{i})$
Goal: find $\frac{\partial L}{\partial z _{k}}$ where $k$ denotes an arbitrary index of z (we need to find all $z_{k}$ to form $\frac{\partial L}{\partial z}$ )

\frac{\partial L}{\partial y ^ _{i}} = - \frac{y _{i}}{y ^ _{i}}

Chain rule to get gradient w.r.t. pre-softmax

\frac{\partial L}{\partial z _{k}} = i = 1 \sum C \frac{\partial L}{\partial y ^ _{i}} \cdot \frac{\partial y ^ _{i}}{\partial z _{k}}

Substitute what we know

\frac{\partial L}{\partial z _{k}} = i = 1 \sum C (- \frac{y _{i}}{y ^ _{i}}) \cdot \frac{\partial y ^ _{i}}{\partial z _{k}}

Split the sum into two cases

The Jacobian of softmax is:

\frac{\partial y ^ _{i}}{\partial z _{k}} = {\overset{y}{^}_{i} (1 - \overset{y}{^}_{i}) - \overset{y}{^}_{i} \overset{y}{^}_{k} if i = k if i \neq = k

this is because we have $\overset{y}{^} = softmax (z)$ !!!! All the activation functions correlate like this, its how the whole thing connects together.

a_{i} = σ (z) = \frac{1}{1 + e ^{- z}}

so

\frac{\partial a _{i}}{\partial z} = \frac{\partial σ}{\partial z} = σ (z) (1 - σ (z)) = a_{i} (1 - a_{i})

Anyways, back to it

So:

\frac{\partial L}{\partial z _{k}} = (- \frac{y _{k}}{y ^ _{k}}) \cdot \overset{y}{^}_{k} (1 - \overset{y}{^}_{k}) + i \neq = k \sum (- \frac{y _{i}}{y ^ _{i}}) \cdot (- \overset{y}{^}_{i} \overset{y}{^}_{k})

First term:

- \frac{y _{k}}{y ^ _{k}} \cdot \overset{y}{^}_{k} (1 - \overset{y}{^}_{k}) = - y_{k} (1 - \overset{y}{^}_{k}) = - y_{k} + y_{k} \overset{y}{^}_{k}

Second term:

i \neq = k \sum (- \frac{y _{i}}{y ^ _{i}}) \cdot (- \overset{y}{^}_{i} \overset{y}{^}_{k}) = i \neq = k \sum y_{i} \overset{y}{^}_{k} = \overset{y}{^}_{k} i \neq = k \sum y_{i}

Combine

\frac{\partial L}{\partial z _{k}} = - y_{k} + y_{k} \overset{y}{^}_{k} + \overset{y}{^}_{k} i \neq = k \sum y_{i}

= - y_{k} + \overset{y}{^}_{k} y_{k} + i \neq = k \sum y_{i}

$y_{k}$ here is just inserted into the summation and…

= - y_{k} + \overset{y}{^}_{k} i = 1 \sum C y_{i}

Since $y$ is one-hot encoded: $\sum_{i = 1}^{C} y_{i} = 1$

\frac{\partial L}{\partial z _{k}} = - y_{k} + \overset{y}{^}_{k} \cdot 1 = \overset{y}{^}_{k} - y_{k}

In vector form:

\frac{\partial L}{\partial z} = \overset{y}{^} - y

Explorer

Activation Functions

ReLU - Rectified Linear Unit

Leaky ReLU - Leaky Rectified Linear Unit

Sigmoid

Softmax

Graph View

Backlinks