10-Class Classification
- You are given many instances
- Ex: You are given many pairs, images of animals and their names
- An example could be a picture of a dog and the label “dog”
- will be the image of the dog, will be the key “dog”
- If there are 10 classes (10 animals), you can represent them in a 10 dimensional vector

Equation:

Goal: Let’s say you input , you want ’s probability to be the highest
After the Softmax function, the th element after the Softmax:
Given , we want to maximize the value above. This is to maximize the probability as the Softmax function returns the probability:
This is equivalent to:
What is your input? It’s your entire weight matrix, you want to minimize the incorrect rows while maximizing the correct row.
Therefore we have to find how:
Claim: You can only update specially, this is because the rest all appear in the denominator. Only appears specially in the numerator and denominator and must be updated specially. Proof:
Now if we were to take the gradient with respect to some where does not correspond to :
Note: Remember chain rule. Notice that this gradient would be the same for each and every where . This is why you can have a piecewise function where you only need to take two unique gradients, where and where .
What about SGD with multiple examples?
Place gradients as rows in