ReLU - Rectified Linear Unit
Its gradient is
Remember! This is a vector! So it becomes a vector of 1’s and 0’s
Leaky ReLU - Leaky Rectified Linear Unit
Its gradient is
Remember! This is a vector! So it becomes a vector of 1’s and ‘s
Sigmoid
Its gradient is
Remember! This is a vector! So we are performing this operation on each of the elements.
Softmax
Usually used on the output layer.
Its Jacobian is
Thing is, if we pair Softmax with cross entropy loss, this gradient simplifies to:
Which is why you end up seeing that nn.CrossEntropyLoss has a builtin softmax.
We have:
- , so
- Loss:
- Goal: find where denotes an arbitrary index of z (we need to find all to form )
Chain rule to get gradient w.r.t. pre-softmax
Substitute what we know
Split the sum into two cases
The Jacobian of softmax is:
this is because we have !!!! All the activation functions correlate like this, its how the whole thing connects together.
so
Anyways, back to it
So:
First term:
Second term:
Combine
here is just inserted into the summation and…
Since is one-hot encoded:
In vector form:
