Simple Nonlinear Prediction

The “shallow” approach: nonlinear feature transformation (often by hand), followed by a linear classifier

Cons:

The design of a good feature transformation can be tricky
The number of new features

Multi-layer Neural Network

The deep approach: stack multiple layers of linear transformations interspersed with nonlinearity

Two layer neural network:

f (x) = W_{2} (W_{1} x + b_{1}) + b_{2}

For an N layer neural network:

f (x) = W_{N} g (... W_{2} (W_{1} x + b_{1}) + b_{2} ...) + b_{N}

Nonlinearity

Sigmoid: $g (x) = \frac{1}{1 + e ^{- x}}$

Historically very popular
Squashes numbers to range $[0, 1]$
Saturated neurons “kill” the gradients

Sigmoid: $g (x) = tanh (x)$

Squashes to range $[- 1, 1]$
Still kills gradients

Relu: $g (x) = max (0, x)$

Does not saturate (in + region)
Very little computation
What is the gradient when $x < 0$
MOST POPULAR

Leakey Relu: $g (x) = max (.05 x, x)$

Neuron will not die
In practice, use Relu on your first try.

Why We Need a Nonlinear Function

If there is no nonlinearity, no matter how many layers you stack, you will still be modeling a linear relationship.

Proof by Example (XOR)

$x_{1}$	$x_{2}$	label
0	0	-1
0	1	1
1	0	1
1	1	-1

Can we find a two layer MLP to solve this problem?

f (x) = W_{2} max (W_{1} x + b_{1}, 0) + b_{2} x \in R^{2}, W_{1} \in R^{d \times 2}, W_{2} \in R^{1 \times d} b_{1} R^{d}, b_{2} \in R

Let $d = 2$ , for the following derivations

MLP: $f (x) = W_{2} max (W_{1} x + b_{1}, 0) + b_{2}$ where

W_{1} W_{2} = 2 - 1 - 1 2, = (1, - 1), b_{1} = 00 b_{2} = - 1

Training Neural Networks

Steps:

Collect / clean data and labels
Specify model: select model class / architecture and loss function
Train model: find the parameters of the model that minimizes the empirical loss on the training data

How to best find Hyperparameters

Your goal is always to make a generalizable model. Not to overfit.

For training, there are typically two losses you should monitor
- Loss on the training data - how well the model will perform
- Loss on the validation data - how well the model generalizes

	Simple	Complex
Low	Ok	Underfitting
Higher	Overfitting	Ok

edison zhang

Lecture 9

Simple Nonlinear Prediction

Multi-layer Neural Network

Nonlinearity

Why We Need a Nonlinear Function

Proof by Example (XOR)

Training Neural Networks

How to best find Hyperparameters

Explorer

Graph View

Backlinks

edison zhang

Lecture 9

Simple Nonlinear Prediction §

Multi-layer Neural Network §

Nonlinearity §

Why We Need a Nonlinear Function §

Proof by Example (XOR) §

Training Neural Networks §

How to best find Hyperparameters §

Explorer

Graph View

Backlinks

Simple Nonlinear Prediction

Multi-layer Neural Network

Nonlinearity

Why We Need a Nonlinear Function

Proof by Example (XOR)

Training Neural Networks

How to best find Hyperparameters