Lecture 18

Self-Attention Layer

Inputs:

Input Vectors: $X \in R^{L \times D}$
Query Matrix: $W_{Q} \in R^{D \times D_{Q}}$
Key Matrix: $W_{K} \in R^{D \times D_{Q}}$
Value Matrix: $W_{V} \in R^{D \times D_{V}}$

Computation:

Query Vectors: $Q = X W_{Q} \in R^{L \times D_{Q}}$
Key Vectors: $K = X W_{K} \in R^{L \times D_{Q}}$
Similarities: $E = \frac{Q K ^{T}}{D _{Q}} \in R^{L \times L}$
Attention Weights: $A = softmax (E, d im = 1)$
Value Vectors: $V = X W_{V} \in R^{L \times D_{V}}$

Permuting

If you permute your input vectors, the queries will have different positions. Same for the rest of the layer.

For instance if the input vectors are in order $X_{3}, X_{2}, X_{1}$ the query vectors will be in order $Q_{3}, Q_{2}, Q_{1}$ , same for the output vectors.

Positional Encoding

Self-attention layer doesn’t know the order of the vectors being processed.

In order to make it position-aware, concatenate the input with positional encodings. $P (\cdot)$ can be a learnable lookup table or a fixed function.

Masked Self-Attention Layer

Problem: For language modeling, you should only predict the next word using the previous words.

Solution: Set the future words to negative infinity, this will mask the output vectors.

Transformer Block

The transformer block is effectively the following: $M L P = W_{2} R e LU (W_{1} X + b_{1}) + b_{2}$ Properties:

Self-attention is the only interaction between vectors
Highly scalable, highly parallelizable

A transformer is a sequence of transformer blocks. In “Attention Is All You Need” they use 12 blocks, D = 512.

Multi-Head Attention

In real world applications, you don’t perform Self-Attention once. You perform Attention for multiple versions of $V, K, Q$ (with different values) and perform the operations multiple times. This is called Multi-Head Attention.

The goal of this is to handle various input sequences in various ways.

edison zhang

Lecture 18

Self-Attention Layer

Permuting

Positional Encoding

Masked Self-Attention Layer

Transformer Block

Multi-Head Attention

Explorer

Graph View

Backlinks

edison zhang

Lecture 18

Self-Attention Layer §

Permuting §

Positional Encoding §

Masked Self-Attention Layer §

Transformer Block §

Multi-Head Attention §

Explorer

Graph View

Backlinks

Self-Attention Layer

Permuting

Positional Encoding

Masked Self-Attention Layer

Transformer Block

Multi-Head Attention