Lecture 5

Regression vs. Classification

Regression: Observe a real-valued input x and predict real-valued target y.

Classification: observe a real-valued input x and predict categorical/discrete target y

Examples of Classification Problems

Text Classification

Classify the sentiment of a an online movie review (positive, neutral, negative)

Classification Example

MNIST Dataset (10 classes total): Classify 10 digits

To begin, let’s try a simple binary case: Differentiate 1 and 5.

First, represent images to a real-valued input x (feature extraction)
Possible Features:
- Raw number of pixels
- Strokes
- Symmetry
After extracting features, you can draw a line to separate.

f (x = 1 \lor 5) Sigmoid (w^{T} x) ⟹ {s_{1} score is 1, +1 class s_{- 1} score is -1, -1 class

Note: $s_{1}$ represents $P (y = + 1∣ x)$ and $s_{2}$ represents $P (y = - 1∣ x)$ and $s_{2} = 1 - s_{1}$ .

Sigmoid Function

Sigmoid is defined as:

Sigmoid (z) = \frac{e ^{z}}{1 + e ^{z}} = \frac{1}{1 + e ^{- z}} \in [0, 1]

Note how Sigmoid(-z) relates

Sigmoid (- z) = \frac{e ^{- z}}{1 + e ^{- z}} = \frac{1}{1 + e ^{z}} \in [0, 1]

What makes Sigmoid Function Good?

Our data is binary $\pm 1$

D = {(x_{1}, y_{1}), ..., (x_{N}, y_{n})}

Our model is good if $f (x_{i}) = Sigmoid (w^{T} x_{i}) = 1$ when $y_{i} = 1$ and $f (x_{i}) = Sigmoid (w^{T} x_{i}) = 0$ when $y_{i} = - 1$

The Logistic Loss Function

The logistic loss function is the objective function:

w min L (w) = \frac{1}{N} i = 1 \sum N ln (1 + e^{- y_{i} (w^{T} x)})

It looks complicated but it is based on an intuitive probability interpretation and is easy to calculate.

The function will encourage the correct outputs from the Sigmoid function.

The Decision Boundary

In 2-d space, $f (x) = w^{T} x$ defines a line that separates the space. The loss function helps find the optimal w*.

Example: Find the Optimal w*

There is no analytical solution to this problem:

w min L (w) = \frac{1}{N} i = 1 \sum N ln (1 + e^{- y_{i} (w^{T} x)})

We must use gradient descent:

\frac{\partial ( \cdot )}{\partial w} = \frac{1}{1 + e ^{- y_{i} (w^{T} x)}} \cdot e^{- y_{i} (w^{T} x)} (- y_{i} x_{i}) = \frac{- y _{i}}{1 + e ^{- y_{i} (w^{T} x)}} x_{i}

SGD for Logistic Regression

Initialize w(0) at step w = 0
for t = 0,1,2...:
	Sample a batch of K data points
	Let gradient = 0
	for each sampled data (x, y)
		gradient += -y x (sigmoid(w transpose * x))
	w(t+1) = w(t) - step size * gradient
	iterate until it is time to stop
end for loop
return the final parameters

edison zhang

Lecture 5

Regression vs. Classification

Examples of Classification Problems

Classification Example

Sigmoid Function

What makes Sigmoid Function Good?

The Logistic Loss Function

The Decision Boundary

Example: Find the Optimal w*

SGD for Logistic Regression

Explorer

Graph View

Backlinks

edison zhang

Lecture 5

Regression vs. Classification §

Examples of Classification Problems §

Classification Example §

Sigmoid Function §

What makes Sigmoid Function Good? §

The Logistic Loss Function §

The Decision Boundary §

Example: Find the Optimal w* §

SGD for Logistic Regression §

Explorer

Graph View

Backlinks

Regression vs. Classification

Examples of Classification Problems

Classification Example

Sigmoid Function

What makes Sigmoid Function Good?

The Logistic Loss Function

The Decision Boundary

Example: Find the Optimal w*

SGD for Logistic Regression