Attention Mechanism

Understanding Attention

The attention mechanism is the core innovation behind the Transformer architecture. It allows the model to focus on different parts of the input sequence when producing each element of the output.

In traditional sequence models like RNNs, information flows sequentially through the network, making it difficult to capture long-range dependencies. Attention solves this by creating direct connections between all positions in the sequence.

Key Concepts

Query, Key, Value: The three main components of the attention mechanism. For each position, we compute a query vector, and it interacts with all key vectors to determine attention weights. These weights are then applied to the value vectors.
Attention Weights: Determine how much focus to put on other parts of the input sequence. Higher weights mean more attention is paid to that part.
Self-Attention: A specific form of attention where queries, keys, and values all come from the same sequence, allowing the model to relate different positions within a single sequence.

Interactive Attention Visualization

Enter a sentence below or use the default one. Hover over a word to see how it attends to other words in the sentence. The intensity of the color indicates the strength of attention.

Input Text

The

transformer

architecture

revolutionized

natural

language

processing.

Note: This is a simplified visualization. In a real Transformer model, attention weights are computed using dot products between query and key vectors, followed by scaling and softmax.

Mathematical Formulation

The attention function can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where:

Q: Query matrix
K: Key matrix
V: Value matrix
d_k: Dimension of the key vectors

Step-by-Step Process

Compute dot products of the query with all keys (QK^T)
Scale the dot products by 1/√d_k to prevent extremely small gradients
Apply softmax to obtain the weights on the values
Compute the weighted sum of the values

Home Multi-Head Attention