ReLU - Rectified Linear Unit

Its gradient is

Remember! This is a vector! So it becomes a vector of 1’s and 0’s

Leaky ReLU - Leaky Rectified Linear Unit

Its gradient is

Remember! This is a vector! So it becomes a vector of 1’s and ‘s

Sigmoid

Its gradient is

Remember! This is a vector! So we are performing this operation on each of the elements.

Softmax

Usually used on the output layer.

Its Jacobian is

Thing is, if we pair Softmax with cross entropy loss, this gradient simplifies to:

Which is why you end up seeing that nn.CrossEntropyLoss has a builtin softmax.

We have:

  • , so
  • Loss:
  • Goal: find where denotes an arbitrary index of z (we need to find all to form )

Chain rule to get gradient w.r.t. pre-softmax

Substitute what we know

Split the sum into two cases

The Jacobian of softmax is:

this is because we have !!!! All the activation functions correlate like this, its how the whole thing connects together.

so

Anyways, back to it

So:

First term:

Second term:

Combine

here is just inserted into the summation and…

Since is one-hot encoded:

In vector form: