Lecture 17

Language Modeling

During training, given the $t$ ground-truth words, predict word $t + 1$ .

max p (w_{1}, w_{2}, ..., w_{T}) = t = 1 \prod T p (w_{t} ∣ w_{1}, ..., w_{t - 1})

Example: Harry went back and saw Hermione

During inference, generate a new word each time and feed it back to the model.

Improved RNN Architecture

Recall the vanilla RNN recurrence formula.

h_{t} = tanh (W_{hh} h_{t - 1} + W_{x h} x_{t} + b_{h})

Problems:

vanishing / exploding gradients
cannot model long-range dependencies
difficulty with parallelization

There are architectures to alleviate these issues:

Long short-term memory (LSTM)
Gated recurrent unit (GRU)

They are still not enough to solve these problems:

LSTM is still short term!
Can’t model long sequences

Attention is All You Need

A revolutionary paper introducing the transformer architecture. 100,000 Citations in 7 years! Funnily used as a naming template for papers. (x is all you need)

Sequence to Sequence Generation

Seq2Seq generation has a wide variety of applications.

It can be used for machine translation, i.e English to Spanish, or even code generation, i.e English to Python.

Encoder-Decoder Paradigm

The encoder decoder paradigm takes in some input, encodes it, and then decodes it.

Step by Step Process:

The encoder takes in some input, and outputs some context vector. This context vector could be $h_{4}$ or mean $(h_{1}, h_{2}, h_{3}, h_{4})$ . It is flexible.
The decoder is similar to language modeling, it takes $c$ as an input and outputs the result.

Problem: The context vector causes an information bottleneck because of the fixed-size vector.

Note

This paradigm is already outdated, newer language models are decoder only.

Seq2Seq with RNN and Attention

Generation at position $t$ :

Compute Scalar importance score
Normalize importance score to get attention weights
Compute context vector as a linear combination of hidden states
Compute context vector at each generation step

Transformer

Scaled Dot-Product Attention

Transformers use a different attention mechanism.

The $[ST A RT]$ token attends to “we”, “are”, “eating”, and “bread”

Example:
Query: $[ST A RT]$

Keys: “we”, “are”, “eating”, and “bread”
Values: “we”, “are”, “eating”, and “bread” Each word is represented by a vector: $X = (x_{1}, ..., x_{4}) \in R^{4 \times d}$ . $[ST A RT]$ will attend to four vectors, $h \in R^{d}$ .

Instead of directly computing the similarity score between $h_{1}$ and $x_{1}, ..., x_{4}$ the transformer incorporates a projection operation first.

Given that $W_{q}, W_{k}, W_{v} \in R^{d \times d}$

Project to query/key/value space
- Query: $q_{1} = W_{q} h_{1}$
- Key: $K = X W_{K}$
- Value: V = $X W_{V}$
Computer the similarity score using the query and keys

1. e_{j} = \frac{q _{1} \cdot K _{j}}{d} 2. e = q_{1} \cdot K \in R^{4} 3. softmax (e)

Difference to RNN: Similarity score is calculated after projection

Output the “context vector”

c = i \sum a_{i} v_{i}

Difference to RNN: Context vector is also computed after projection

edison zhang

Lecture 17

Language Modeling

Improved RNN Architecture

Sequence to Sequence Generation

Encoder-Decoder Paradigm

Seq2Seq with RNN and Attention

Transformer

Scaled Dot-Product Attention

Explorer

Graph View

Backlinks

edison zhang

Lecture 17

Language Modeling §

Improved RNN Architecture §

Sequence to Sequence Generation §

Encoder-Decoder Paradigm §

Seq2Seq with RNN and Attention §

Transformer §

Scaled Dot-Product Attention §

Explorer

Graph View

Backlinks

Language Modeling

Improved RNN Architecture

Sequence to Sequence Generation

Encoder-Decoder Paradigm

Seq2Seq with RNN and Attention

Transformer

Scaled Dot-Product Attention