Attention Mechanism
Understanding Attention
The attention mechanism is the core innovation behind the Transformer architecture. It allows the model to focus on different parts of the input sequence when producing each element of the output.
In traditional sequence models like RNNs, information flows sequentially through the network, making it difficult to capture long-range dependencies. Attention solves this by creating direct connections between all positions in the sequence.
Key Concepts
- Query, Key, Value: The three main components of the attention mechanism. For each position, we compute a query vector, and it interacts with all key vectors to determine attention weights. These weights are then applied to the value vectors.
- Attention Weights: Determine how much focus to put on other parts of the input sequence. Higher weights mean more attention is paid to that part.
- Self-Attention: A specific form of attention where queries, keys, and values all come from the same sequence, allowing the model to relate different positions within a single sequence.
Interactive Attention Visualization
Enter a sentence below or use the default one. Hover over a word to see how it attends to other words in the sentence. The intensity of the color indicates the strength of attention.
Note: This is a simplified visualization. In a real Transformer model, attention weights are computed using dot products between query and key vectors, followed by scaling and softmax.
Mathematical Formulation
The attention function can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
Attention(Q, K, V) = softmax(QKT / √dk)V
Where:
- Q: Query matrix
- K: Key matrix
- V: Value matrix
- dk: Dimension of the key vectors
Step-by-Step Process
- Compute dot products of the query with all keys (QKT)
- Scale the dot products by 1/√dk to prevent extremely small gradients
- Apply softmax to obtain the weights on the values
- Compute the weighted sum of the values