Full Transformer Architecture
The Complete Picture
The Transformer architecture consists of an encoder and a decoder, each composed of a stack of identical layers. The encoder processes the input sequence, while the decoder generates the output sequence.
Now that we've explored the key components (attention mechanism, multi-head attention, and positional encoding), let's see how they all fit together in the complete architecture.
Interactive Architecture Diagram
Click on any component to learn more about its role in the Transformer architecture.
Note: Click on the encoder or decoder blocks to see their internal components. Then click on any component to learn more about it.
Data Flow Through the Transformer
Let's walk through how data flows through the Transformer architecture:
- Input Processing: The input tokens are converted to embeddings and combined with positional encodings.
- Encoder Processing: The embedded input passes through multiple encoder layers, each containing:
- Multi-head self-attention mechanism
- Feed-forward neural network
- Layer normalization and residual connections
- Decoder Input: The decoder starts with a special start token and previously generated tokens (during inference).
- Decoder Processing: Each decoder layer processes the data through:
- Masked multi-head self-attention (to prevent looking at future tokens)
- Multi-head cross-attention over the encoder output
- Feed-forward neural network
- Layer normalization and residual connections
- Output Generation: The decoder output is passed through a linear layer and softmax to produce probabilities for the next token.
- Autoregressive Generation: During inference, the predicted token is added to the decoder input, and the process repeats until an end token is generated or a maximum length is reached.
Transformer Variants
Since the original Transformer paper in 2017, many variants have been developed for different applications:
BERT (Bidirectional Encoder Representations from Transformers)
Uses only the encoder part of the Transformer for bidirectional context understanding. Pre-trained on a large corpus and fine-tuned for specific tasks.
GPT (Generative Pre-trained Transformer)
Uses only the decoder part of the Transformer for autoregressive text generation. Pre-trained on a large corpus and fine-tuned for specific tasks.
T5 (Text-to-Text Transfer Transformer)
Uses the full encoder-decoder architecture but frames all NLP tasks as text-to-text problems. Pre-trained on a diverse set of tasks.
Vision Transformer (ViT)
Adapts the Transformer architecture for computer vision tasks by treating image patches as tokens. Achieves state-of-the-art performance on image classification.
These variants demonstrate the versatility and power of the Transformer architecture across different domains and tasks.