Multi-Class Classification

The Probabilistic Interpretation

  • models the probability
  • And thus
  • …or, more copactly

Maximizing the Data Likelihood

  • How to measure the likelihood of the observed data?
  • Assume: are independently generated
  • The probability of getting :
    • (Derived from definition of conditional probability)
  • Note that are constants determined by the data distribution (so, we do not need to model it)

Maximizing the Data Likelihood

  • Maximum likelihood:
  • How can we tailor this objective function for logistic regression?

Tailoring

Softmax Function

Given output logits , the softmax function will return output probabilities (in a matrix).

Softmax vs. Sigmoid

The softmax function is equivalent to the Sigmoid function in the binary case.

Training Objective for Multi-Class Classification

  • Maximum likelihood:
  • This is equivalent to cross entropy (often used in pytorch)
  • Cross entropy between distributions P and Q:
  • Given an estimated distribution (returned from Softmax function) and empirical distribution, we can use the cross entropy of the two distributions to determine the loss ( it will return ). The cross entropy of the two should equal the maximum likelihood.

SGD with Cross Entropy Loss

Loss Function:

Gradient w.r.t : Gradient w.r.t :

Pseudocode Initialization: