Multi-Class Classification

The Probabilistic Interpretation

$f (x) = Sigmoid (w^{T} x)$ models the probability $P (y = + 1∣ x)$

P (y ∣ x) = {Sigmoid (w^{T} x), 1 - Sigmoid (w^{T} x), for y = + 1 for y = - 1

And thus

P (y ∣ x) = {Sigmoid (w^{T} x), Sigmoid (- w^{T} x), for y = + 1 for y = - 1

…or, more copactly

P (y ∣ x) = Sigmoid (y \cdot w^{t} x)

Maximizing the Data Likelihood

How to measure the likelihood of the observed data?
Assume: $D = (x_{1}, y), ..., (x_{N}, y_{N})$ are independently generated
The probability of getting $(x_{1}, y), ..., (x_{N}, y_{N})$ :
- $P ((x_{1}, y_{1}), ..., (x_{N}, y_{N})) = P (x_{1}, y_{1}) ... P (x_{N}, y_{N})$ (Derived from definition of conditional probability)
Note that $P (x_{1}) ... P (x_{N})$ are constants determined by the data distribution (so, we do not need to model it)

Maximizing the Data Likelihood

Maximum likelihood:

w max L (D; w) = P (y_{1} ∣ x_{1}) P (x_{1}) ... P (y_{N} ∣ x_{N}) P \propto P (y_{1} ∣ x_{1}) \cdot P (y_{2} ∣ x_{2}) ... P (y_{N} ∣ x_{N}) = Π_{i = 1}^{N} P (y_{i} ∣ x_{i}) (Proportional To)

How can we tailor this objective function for logistic regression?

Tailoring

max i = 1 \prod N P (y_{i} ∣ x_{i}) ⟺ max l n i = 1 \prod N P (y_{i} ∣ x_{i}) = max i = 1 \sum N ln P (y_{i} ∣ x_{i}) ⟺ min - \frac{1}{N} i = 1 \sum N ln P (y_{i} ∣ x_{i}) = min \frac{1}{N} i = 1 \sum N ln \frac{1}{P ( y _{i} ∣ x _{i} )} ⟺ min \frac{1}{N} i = 1 \sum N ln \frac{1}{Sigmoid ( y _{i} \cdot w ^{T} x _{i} )}

Softmax Function

Given output logits $z$ , the softmax function will return output probabilities (in a matrix).

z = W x, W \in R^{C \times d}, z \in R^{C}, x \in R^{d} softmax (z_{1}, ..., z_{c}) = (\frac{e x p ( z _{1} )}{e x p ( z _{j} )}, ..., \frac{e x p ( z _{c} )}{\sum _{j} e x p ( z _{j} )})

Softmax vs. Sigmoid

The softmax function is equivalent to the Sigmoid function in the binary case.

Training Objective for Multi-Class Classification

Maximum likelihood:

W min - i = 1 \sum N lo g P (y_{i} ∣ x_{i}, W)

This is equivalent to cross entropy (often used in pytorch)
Cross entropy between distributions P and Q:

H (P, Q) = - x \in X \sum p (x) lo g q (x)

Given an estimated distribution (returned from Softmax function) and empirical distribution, we can use the cross entropy of the two distributions to determine the loss ( it will return $- lo g P (y_{i} ∣ x_{i}; W)$ ). The cross entropy of the two should equal the maximum likelihood.

SGD with Cross Entropy Loss

Loss Function:

l (W, x_{i}, y_{i}) = - lo g P (y_{i} ∣ x_{i}; W) = - w_{y_{i}}^{T} x_{i} + lo g (j \sum e x p (w_{j}^{T} x))

Gradient w.r.t $w_{y_{i}}$ : $(P (y_{i} ∣ x_{i}; W) - 1) x_{i}$ Gradient w.r.t $w_{x_{i}}$ : $(P (c ∣ x_{i}; W)) x_{i}$

Pseudocode Initialization:

edison zhang

Lecture 6

Multi-Class Classification

The Probabilistic Interpretation

Maximizing the Data Likelihood

Maximizing the Data Likelihood

Tailoring

Softmax Function

Softmax vs. Sigmoid

Training Objective for Multi-Class Classification

SGD with Cross Entropy Loss

Explorer

Graph View

Backlinks

edison zhang

Lecture 6

Multi-Class Classification §

The Probabilistic Interpretation §

Maximizing the Data Likelihood §

Maximizing the Data Likelihood §

Tailoring §

Softmax Function §

Softmax vs. Sigmoid §

Training Objective for Multi-Class Classification §

SGD with Cross Entropy Loss §

Explorer

Graph View

Backlinks

Multi-Class Classification

The Probabilistic Interpretation

Maximizing the Data Likelihood

Maximizing the Data Likelihood

Tailoring

Softmax Function

Softmax vs. Sigmoid

Training Objective for Multi-Class Classification

SGD with Cross Entropy Loss