The Transformer — How Machines Pay Attention

Transformers underpin modern AI — GPT, BERT, ChatGPT, AlphaFold.
They are neural networks built on attention, enabling models to link distant tokens in parallel.
This post explains how.

1. The Problem: Reading with Blindfolds

You’re reading a detective novel, solving a mystery.
The author drops clues in chapter 3, then reveals the answer in chapter 20.
Your brain connects them: chapter 3’s clue jumps to the present.

Before Transformers, neural networks couldn’t do this. They read like a person with blindfolds on, processing words strictly left-to-right, forgetting earlier words by the end.

Recurrent Neural Networks (RNNs) tried to remember by maintaining a hidden state — a tiny notebook summarising what they’d seen.
But it was too small. In a long sentence, early words faded.

This vanishing memory problem was fundamental. To understand word position 50, they needed to process positions 1–49 sequentially.
Processing couldn’t be parallelized; attention was local.

In 2017, a Google team proposed something new. Rather than sequential processing, let every word attend to every other word at once.
That is the Transformer.

2. From Fixed Weights to Dynamic Attention

Traditional networks use fixed weights $W$ learned during training and applied to all inputs.
Transformers compute input-dependent attention weights on the fly for each example, enabling context-aware mixing.

2.1 The Naive Approach

Simple learning: one weight matrix $W$ fixed across examples.
Example: predict tomorrow’s temperature from today’s: $\hat{y} = W \cdot x + b$ .
Learning: adjust $W$ via gradient descent.
Limitation: no signal about which parts of the input matter.

2.2 The Transformer Innovation

Transformers build new attention weights for every token and example:

Naive: $y = Wx$ (fixed $W$ )
Transformer: $y_i = \sum_j A_{ij} V_j$ where $A_{ij} = \text{softmax}_j(\frac{Q_i \cdot K_j}{\sqrt{d_k}})$

$A$ is recomputed per input; $A_{ij}$ indicates how much token $i$ attends to token $j$ .

Example: “The animal didn’t cross the street because it was too tired.”

Naive network: uniform mixing; blurs roles.
Transformer: $A_{\text{it},\text{animal}} \approx 0.92$ , $A_{\text{it},\text{street}} \approx 0.03$ ; identifies “it = animal”.

3. Attention: The Mechanism

3.1 The Core Concept

Attention computes relevance-weighted sums of inputs.
For each position, score all others, normalize to probabilities, and blend.

Consider translating “The cat sat on the mat” to French.
When producing “chat”, we need gender and context.
Attention gathers that information.

3.2 Query, Key, Value

Each word uses three vectors:

Query (Q): what I’m searching for
Key (K): what I’m offering
Value (V): my content

For word $i$ :

Dot-product similarities: $\text{score}_{i,j} = Q_i \cdot K_j$
Softmax: $\alpha_{i,j} = \frac{e^{\text{score}_{i,j}}}{\sum_k e^{\text{score}_{i,k}}}$
Weighted sum: $\text{Attention}(i) = \sum_j \alpha_{i,j} V_j$

Each output is a context-aware blend.

4. Self-Attention: Words Looking at Words

In self-attention, the same sequence supplies Q, K, V.
Each word attends to all others within the sequence.

Consider “The animal didn’t cross the street because it was too tired.”
Self-attention lets “it” focus on “animal” (high weight) and not “street” (low weight).
It captures local syntax and long-range dependencies in one pass.

4.1 Multi-Head Attention

Use multiple attention heads in parallel to capture distinct patterns.
One head may focus on syntax, another on semantics, another on long-range links.
The model learns these roles.

Stack 6–12 layers; each refines patterns from the previous.

5. Encoder-Decoder Architecture and Cross-Attention

Machine translation uses two stacks: encoder (read) and decoder (write).

5.1 The Encoder

The encoder reads the source with self-attention across the input.
Stacked layers accumulate richer representations.

5.2 The Decoder

The decoder combines two attention mechanisms:

Self-attention over generated tokens (with masking)
Cross-attention to the encoder

Cross-attention lets the decoder query the encoder.
The decoder’s Q comes from its outputs; K and V come from the encoder’s outputs.
Each generation step attends to the full source.

5.3 Why Two Types Matter

Self-attention: internal consistency within a sequence
Cross-attention: alignment across sequences
For translation, self-attention handles surface form; cross-attention handles correspondence.

6. Language Translation: A Concrete Example

For “The cat sat on the mat” → “Le chat s’assit sur le tapis”:

Encoder Phase:

Ingest English tokens
6 layers of self-attention
Produce context vectors

Decoder Phase:

Start with <start>
Self-attention over the current output prefix
Cross-attention to encoder outputs
Predict the next token
Append; repeat until </end>

Cross-Attention in Action: When generating “chat”, the decoder attends across the encoder: “cat” receives high weight; “le” appears; grammatical features align. Cross-attention forges these links.
Similarly, self-attention within the French sequence ensures agreement: “chat” → “s’assit” (3rd person singular), “Le” → “chat” (masculine article) with the gender flowing through the decoder.

7. Positional Encoding

Attention is permutation-invariant; word order matters.
Positional encoding adds positional signals.

Common choices: fixed sinusoidal patterns or learned embeddings.
Add them to word embeddings before the first layer.

8. Why Transformers Work

Feature	Previous Approaches	Transformers
Processing	Sequential	Fully parallel
Context	Limited by memory	Full sequence
Weights	Fixed $W$	Dynamic attention $A$
Depth	Degrades after ~10 layers	100+ layers
Scalability	Sequential bottleneck	GPU-friendly

Together, these support modeling complex, long-range dependencies at scale.

9. Context Windows and Limitations

Limits:

Fixed-size positional embeddings
Quadratic attention cost in length
Memory pressure on long contexts

Prompts beyond the window are truncated.
Extending windows is active: sparse attention, sliding windows, hierarchical patterns.

This bounds practical model sizes.

10. Transformers: Revolutionizing NLP and Beyond

Transformers changed NLP and are widely applied elsewhere. The main changes are:

10.1 NLP Applications

Language Translation: Strong results in machine translation (e.g., Google Translate).

Text Generation: Coherent, context-aware text for chat, content, and translation.

Text Classification: Sentiment analysis, spam detection, topic modeling. Often superior to traditional methods.

Question Answering: Answer complex questions via context, supporting assistants and chatbots.

Summarization: Summarize long documents and articles by extracting main points.

10.2 Beyond NLP

Computer Vision: Image classification, detection, and generation. Vision Transformers (ViTs) match or beat CNNs.

Speech Recognition: Speech-to-text for assistants and voice interfaces.

Time Series Analysis: Forecasting (stock, weather, traffic).

10.3 What Transformers Enable

Accuracy: strong performance across NLP and vision tasks
Efficiency: parallel processing vs. sequential RNNs
Flexibility: NLP, vision, audio, time series
Scalability: large models and datasets; supports big outputs

11. Summary

Transformers rely on attention: self-attention for context, cross-attention to bridge sequences.
Parallel processing of long sequences underpins modern language models and broader AI.