Self-Attention Layer
Inputs:
- Input Vectors:
- Query Matrix:
- Key Matrix:
- Value Matrix:
Computation:
- Query Vectors:
- Key Vectors:
- Similarities:
- Attention Weights:
- Value Vectors:

Permuting
If you permute your input vectors, the queries will have different positions. Same for the rest of the layer.
For instance if the input vectors are in order the query vectors will be in order , same for the output vectors.
Positional Encoding
Self-attention layer doesn’t know the order of the vectors being processed.
In order to make it position-aware, concatenate the input with positional encodings. can be a learnable lookup table or a fixed function.
Masked Self-Attention Layer
Problem: For language modeling, you should only predict the next word using the previous words.
Solution: Set the future words to negative infinity, this will mask the output vectors.
Transformer Block
The transformer block is effectively the following: Properties:
- Self-attention is the only interaction between vectors
- Highly scalable, highly parallelizable
A transformer is a sequence of transformer blocks. In “Attention Is All You Need” they use 12 blocks, D = 512.

Multi-Head Attention
In real world applications, you don’t perform Self-Attention once. You perform Attention for multiple versions of (with different values) and perform the operations multiple times. This is called Multi-Head Attention.
The goal of this is to handle various input sequences in various ways.