Full Transformer Architecture

The Complete Picture

The Transformer architecture consists of an encoder and a decoder, each composed of a stack of identical layers. The encoder processes the input sequence, while the decoder generates the output sequence.

Now that we've explored the key components (attention mechanism, multi-head attention, and positional encoding), let's see how they all fit together in the complete architecture.

Interactive Architecture Diagram

Click on any component to learn more about its role in the Transformer architecture.

Encoder
Input Embedding + Positional Encoding
Encoder Layers Stack
Click to expand
Decoder
Decoder Layers Stack
Linear + Softmax Layer
Click to expand

Note: Click on the encoder or decoder blocks to see their internal components. Then click on any component to learn more about it.

Data Flow Through the Transformer

Let's walk through how data flows through the Transformer architecture:

  1. Input Processing: The input tokens are converted to embeddings and combined with positional encodings.
  2. Encoder Processing: The embedded input passes through multiple encoder layers, each containing:
    • Multi-head self-attention mechanism
    • Feed-forward neural network
    • Layer normalization and residual connections
  3. Decoder Input: The decoder starts with a special start token and previously generated tokens (during inference).
  4. Decoder Processing: Each decoder layer processes the data through:
    • Masked multi-head self-attention (to prevent looking at future tokens)
    • Multi-head cross-attention over the encoder output
    • Feed-forward neural network
    • Layer normalization and residual connections
  5. Output Generation: The decoder output is passed through a linear layer and softmax to produce probabilities for the next token.
  6. Autoregressive Generation: During inference, the predicted token is added to the decoder input, and the process repeats until an end token is generated or a maximum length is reached.

Transformer Variants

Since the original Transformer paper in 2017, many variants have been developed for different applications:

BERT (Bidirectional Encoder Representations from Transformers)

Uses only the encoder part of the Transformer for bidirectional context understanding. Pre-trained on a large corpus and fine-tuned for specific tasks.

GPT (Generative Pre-trained Transformer)

Uses only the decoder part of the Transformer for autoregressive text generation. Pre-trained on a large corpus and fine-tuned for specific tasks.

T5 (Text-to-Text Transfer Transformer)

Uses the full encoder-decoder architecture but frames all NLP tasks as text-to-text problems. Pre-trained on a diverse set of tasks.

Vision Transformer (ViT)

Adapts the Transformer architecture for computer vision tasks by treating image patches as tokens. Achieves state-of-the-art performance on image classification.

These variants demonstrate the versatility and power of the Transformer architecture across different domains and tasks.