Positional Encoding

Why Positional Encoding?

The Transformer architecture relies entirely on attention mechanisms, which are inherently permutation-invariant. This means that if we were to shuffle the order of the input tokens, the output of the self-attention layer would be the same.

However, the order of words in a sentence is crucial for understanding its meaning. To address this limitation, Transformers incorporate positional information through positional encodings.

Key Properties

Unique for Each Position: Each position in the sequence gets a unique encoding vector.
Deterministic: Unlike other parts of the Transformer, positional encodings are not learned but calculated using sine and cosine functions.
Fixed Dimensionality: The encoding vectors have the same dimensionality as the token embeddings, allowing them to be added directly.
Relative Position Information: The encoding allows the model to easily compute relative positions between tokens.

Sinusoidal Positional Encoding

The original Transformer paper uses sinusoidal positional encodings, which are calculated using sine and cosine functions of different frequencies:

PE_(pos,2i) = sin(pos / 10000^2i/d_model)

PE_(pos,2i+1) = cos(pos / 10000^2i/d_model)

Where:

pos: Position in the sequence (0, 1, 2, ...)
i: Dimension index (0, 1, 2, ...)
d_model: Dimensionality of the model

This approach was chosen because:

It allows the model to easily learn to attend to relative positions
It can handle sequences of arbitrary length, even those longer than seen during training
The pattern of values creates a unique fingerprint for each position

Interactive Visualization

Explore how positional encodings vary with position and dimension. The heatmap below shows the values of the positional encoding vectors, with colors representing values from -1 (blue) to 1 (red).

Maximum Position: 20

Embedding Dimension: 64

Positional Encoding Heatmap

Position

Selected Position: 5

Show Dimension: 10

Value:

0.9268

Function:

sin(pos / 10000^10/d_model)

Note: In practice, these positional encodings are added to the token embeddings before being fed into the Transformer layers.

Adding Positional Encodings to Embeddings

In the Transformer model, positional encodings are added directly to the token embeddings:

Input = TokenEmbedding + PositionalEncoding

This combined representation preserves both the semantic meaning of the tokens and their positions in the sequence. The model can then use this information to process the sequence effectively.

Alternative Approaches

While sinusoidal positional encodings are the most common, there are alternative approaches:

Learned Positional Embeddings: Instead of using fixed functions, the position vectors can be learned during training.
Relative Positional Encodings: Some models directly encode relative positions between tokens rather than absolute positions.
Rotary Position Embeddings (RoPE): A more recent approach that encodes positions by rotating the token embeddings in the complex plane.

Multi-Head Attention Full Architecture