Multi-Head Attention

Why Multiple Attention Heads?

Instead of performing a single attention function, the Transformer uses multiple attention heads in parallel. This allows the model to jointly attend to information from different representation subspaces at different positions.

With multi-head attention, the model can capture various types of relationships:

  • Syntactic relationships: Grammatical structures and dependencies
  • Semantic relationships: Meaning-based connections between words
  • Positional relationships: Proximity-based attention patterns
  • Global context: Sentence-wide or document-wide dependencies

Interactive Multi-Head Visualization

Explore how different attention heads focus on different aspects of the input text. Each head specializes in capturing specific types of relationships between words.

Head 1: Syntactic

Focuses on grammatical relationships

The
transformer
model
processes
all
tokens
in
parallel
using
multi-head
attention.

Attention Patterns:

From WordTo WordAttention Strength
transformermodel
82%
processestokens
94%
usingattention
95%

Note: This is a simplified visualization. In a real Transformer model, each head would have its own set of learned query, key, and value projection matrices.

Multi-Head Attention Mechanism

Multi-head attention consists of several attention layers running in parallel. Each head has its own set of learned projection matrices.

MultiHead(Q, K, V) = Concat(head1, ..., headh)WO

where headi = Attention(QWiQ, KWiK, VWiV)

Where:

  • WiQ, WiK, WiV: Learned projection matrices for each head
  • WO: Output projection matrix
  • h: Number of attention heads

Benefits of Multi-Head Attention

  • Diverse Representations: Different heads can focus on different aspects of the input, capturing a richer set of relationships.
  • Improved Performance: Multiple attention mechanisms in parallel lead to better model performance on various tasks.
  • Interpretability: Different heads often learn to attend to specific linguistic phenomena, providing insights into how the model processes language.