10-Class Classification

  • You are given many instances
    • Ex: You are given many pairs, images of animals and their names
    • An example could be a picture of a dog and the label “dog”
    • will be the image of the dog, will be the key “dog”
    • If there are 10 classes (10 animals), you can represent them in a 10 dimensional vector

Equation:

Goal: Let’s say you input , you want ’s probability to be the highest

After the Softmax function, the th element after the Softmax:

Given , we want to maximize the value above. This is to maximize the probability as the Softmax function returns the probability:

This is equivalent to:

What is your input? It’s your entire weight matrix, you want to minimize the incorrect rows while maximizing the correct row.

Therefore we have to find how:

Claim: You can only update specially, this is because the rest all appear in the denominator. Only appears specially in the numerator and denominator and must be updated specially. Proof:

Now if we were to take the gradient with respect to some where does not correspond to :

Note: Remember chain rule. Notice that this gradient would be the same for each and every where . This is why you can have a piecewise function where you only need to take two unique gradients, where and where .

What about SGD with multiple examples?

Place gradients as rows in