Transformers underpin modern AI — GPT, BERT, ChatGPT, AlphaFold.
They are neural networks built on attention, enabling models to link distant tokens in parallel.
This post explains how.
1. The Problem: Reading with Blindfolds
You’re reading a detective novel, solving a mystery.
The author drops clues in chapter 3, then reveals the answer in chapter 20.
Your brain connects them: chapter 3’s clue jumps to the present.
Before Transformers, neural networks couldn’t do this. They read like a person with blindfolds on, processing words strictly left-to-right, forgetting earlier words by the end.
Recurrent Neural Networks (RNNs) tried to remember by maintaining a hidden state — a tiny notebook summarising what they’d seen.
But it was too small. In a long sentence, early words faded.
This vanishing memory problem was fundamental. To understand word position 50, they needed to process positions 1–49 sequentially.
Processing couldn’t be parallelized; attention was local.
In 2017, a Google team proposed something new. Rather than sequential processing, let every word attend to every other word at once.
That is the Transformer.
2. From Fixed Weights to Dynamic Attention
Traditional networks use fixed weights learned during training and applied to all inputs.
Transformers compute input-dependent attention weights on the fly for each example, enabling context-aware mixing.
2.1 The Naive Approach
Simple learning: one weight matrix fixed across examples.
Example: predict tomorrow’s temperature from today’s: .
Learning: adjust via gradient descent.
Limitation: no signal about which parts of the input matter.
2.2 The Transformer Innovation
Transformers build new attention weights for every token and example:
Naive: (fixed )
Transformer: where
is recomputed per input; indicates how much token attends to token .
Example: “The animal didn’t cross the street because it was too tired.”
- Naive network: uniform mixing; blurs roles.
- Transformer: , ; identifies “it = animal”.
3. Attention: The Mechanism
3.1 The Core Concept
Attention computes relevance-weighted sums of inputs.
For each position, score all others, normalize to probabilities, and blend.
Consider translating “The cat sat on the mat” to French.
When producing “chat”, we need gender and context.
Attention gathers that information.
3.2 Query, Key, Value
Each word uses three vectors:
- Query (Q): what I’m searching for
- Key (K): what I’m offering
- Value (V): my content
For word :
- Dot-product similarities:
- Softmax:
- Weighted sum:
Each output is a context-aware blend.
4. Self-Attention: Words Looking at Words
In self-attention, the same sequence supplies Q, K, V.
Each word attends to all others within the sequence.
Consider “The animal didn’t cross the street because it was too tired.”
Self-attention lets “it” focus on “animal” (high weight) and not “street” (low weight).
It captures local syntax and long-range dependencies in one pass.
4.1 Multi-Head Attention
Use multiple attention heads in parallel to capture distinct patterns.
One head may focus on syntax, another on semantics, another on long-range links.
The model learns these roles.
Stack 6–12 layers; each refines patterns from the previous.
5. Encoder-Decoder Architecture and Cross-Attention
Machine translation uses two stacks: encoder (read) and decoder (write).
5.1 The Encoder
The encoder reads the source with self-attention across the input.
Stacked layers accumulate richer representations.
5.2 The Decoder
The decoder combines two attention mechanisms:
- Self-attention over generated tokens (with masking)
- Cross-attention to the encoder
Cross-attention lets the decoder query the encoder.
The decoder’s Q comes from its outputs; K and V come from the encoder’s outputs.
Each generation step attends to the full source.
5.3 Why Two Types Matter
- Self-attention: internal consistency within a sequence
- Cross-attention: alignment across sequences
For translation, self-attention handles surface form; cross-attention handles correspondence.
6. Language Translation: A Concrete Example
For “The cat sat on the mat” → “Le chat s’assit sur le tapis”:
Encoder Phase:
- Ingest English tokens
- 6 layers of self-attention
- Produce context vectors
Decoder Phase:
- Start with
<start> - Self-attention over the current output prefix
- Cross-attention to encoder outputs
- Predict the next token
- Append; repeat until
</end>
Cross-Attention in Action: When generating “chat”, the decoder attends across the encoder: “cat” receives high weight; “le” appears; grammatical features align. Cross-attention forges these links.
Similarly, self-attention within the French sequence ensures agreement: “chat” → “s’assit” (3rd person singular), “Le” → “chat” (masculine article) with the gender flowing through the decoder.
7. Positional Encoding
Attention is permutation-invariant; word order matters.
Positional encoding adds positional signals.
Common choices: fixed sinusoidal patterns or learned embeddings.
Add them to word embeddings before the first layer.
8. Why Transformers Work
| Feature | Previous Approaches | Transformers |
|---|---|---|
| Processing | Sequential | Fully parallel |
| Context | Limited by memory | Full sequence |
| Weights | Fixed | Dynamic attention |
| Depth | Degrades after ~10 layers | 100+ layers |
| Scalability | Sequential bottleneck | GPU-friendly |
Together, these support modeling complex, long-range dependencies at scale.
9. Context Windows and Limitations
Limits:
- Fixed-size positional embeddings
- Quadratic attention cost in length
- Memory pressure on long contexts
Prompts beyond the window are truncated.
Extending windows is active: sparse attention, sliding windows, hierarchical patterns.
This bounds practical model sizes.
10. Transformers: Revolutionizing NLP and Beyond
Transformers changed NLP and are widely applied elsewhere. The main changes are:
10.1 NLP Applications
Language Translation: Strong results in machine translation (e.g., Google Translate).
Text Generation: Coherent, context-aware text for chat, content, and translation.
Text Classification: Sentiment analysis, spam detection, topic modeling. Often superior to traditional methods.
Question Answering: Answer complex questions via context, supporting assistants and chatbots.
Summarization: Summarize long documents and articles by extracting main points.
10.2 Beyond NLP
Computer Vision: Image classification, detection, and generation. Vision Transformers (ViTs) match or beat CNNs.
Speech Recognition: Speech-to-text for assistants and voice interfaces.
Time Series Analysis: Forecasting (stock, weather, traffic).
10.3 What Transformers Enable
- Accuracy: strong performance across NLP and vision tasks
- Efficiency: parallel processing vs. sequential RNNs
- Flexibility: NLP, vision, audio, time series
- Scalability: large models and datasets; supports big outputs
11. Summary
Transformers rely on attention: self-attention for context, cross-attention to bridge sequences.
Parallel processing of long sequences underpins modern language models and broader AI.