Multi-Head Attention

Why Multiple Attention Heads?

Instead of performing a single attention function, the Transformer uses multiple attention heads in parallel. This allows the model to jointly attend to information from different representation subspaces at different positions.

With multi-head attention, the model can capture various types of relationships:

Syntactic relationships: Grammatical structures and dependencies
Semantic relationships: Meaning-based connections between words
Positional relationships: Proximity-based attention patterns
Global context: Sentence-wide or document-wide dependencies

Interactive Multi-Head Visualization

Explore how different attention heads focus on different aspects of the input text. Each head specializes in capturing specific types of relationships between words.

Input Text

Number of Attention Heads: 4

Select Attention Head

Head 1: Syntactic

Focuses on grammatical relationships

The

transformer

model

processes

all

tokens

parallel

using

multi-head

attention.

Attention Patterns:

From Word	To Word	Attention Strength
transformer	model	82%
processes	tokens	94%
using	attention	95%

Note: This is a simplified visualization. In a real Transformer model, each head would have its own set of learned query, key, and value projection matrices.

Multi-Head Attention Mechanism

Multi-head attention consists of several attention layers running in parallel. Each head has its own set of learned projection matrices.

MultiHead(Q, K, V) = Concat(head₁, ..., head_h)W^O

where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Where:

W_i^Q, W_i^K, W_i^V: Learned projection matrices for each head
W^O: Output projection matrix
h: Number of attention heads

Benefits of Multi-Head Attention

Diverse Representations: Different heads can focus on different aspects of the input, capturing a richer set of relationships.
Improved Performance: Multiple attention mechanisms in parallel lead to better model performance on various tasks.
Interpretability: Different heads often learn to attend to specific linguistic phenomena, providing insights into how the model processes language.

Attention Mechanism Positional Encoding