Neural networks learn by adjusting weights to minimize error.
But how do we know which weight to change and by how much?
The answer is backpropagation — the chain rule applied recursively through layers.
This post derives backpropagation from scratch, then walks through a concrete example where you can trace every number.
Table of Contents
Open Table of Contents
1. The Problem: Credit Assignment
Imagine a factory assembly line with 10 workers. A defective product comes out. Who is responsible?
In a neural network:
- Workers = neurons
- Assembly line = layers
- Defective product = wrong prediction
- Responsibility = gradient
The credit assignment problem: how much did each weight contribute to the error?
Backpropagation solves this by propagating error signals backward, layer by layer, using the chain rule.
2. Notation and Setup
Consider a network with L layers. For layer l:
| Symbol | Meaning |
|---|
| wijl | Weight from neuron j in layer l−1 to neuron i in layer l |
| bil | Bias of neuron i in layer l |
| zil | Pre-activation (weighted sum) of neuron i in layer l |
| ail | Activation (output) of neuron i in layer l |
| f | Activation function (e.g., sigmoid, ReLU) |
Layer structure:
graph LR
subgraph "Layer l-1"
A1["a₁ˡ⁻¹"]
A2["a₂ˡ⁻¹"]
A3["a₃ˡ⁻¹"]
end
subgraph "Layer l"
Z["zᵢˡ = Σ wᵢⱼˡ·aⱼˡ⁻¹ + bᵢˡ"]
ACT["aᵢˡ = f(zᵢˡ)"]
end
A1 -->|"wᵢ₁ˡ"| Z
A2 -->|"wᵢ₂ˡ"| Z
A3 -->|"wᵢ₃ˡ"| Z
Z --> ACT
3. Forward Pass
The forward pass computes predictions layer by layer.
Step 1: Weighted Sum (Pre-activation)
zil=∑jwijl⋅ajl−1+bil
Step 2: Apply Activation
ail=f(zil)
Common Activation Functions:
| Function | Formula | Derivative |
|---|
| Sigmoid | σ(z)=1+e−z1 | σ(z)(1−σ(z)) |
| Tanh | tanh(z)=ez+e−zez−e−z | 1−tanh2(z) |
| ReLU | max(0,z) | {10z>0z≤0 |
4. The Loss Function
The loss measures how wrong the prediction is.
For regression (MSE):
E=21(y−o)2
where y is the target and o is the output.
For binary classification (Binary Cross-Entropy):
E=−[ylog(o)+(1−y)log(1−o)]
For multi-class (Cross-Entropy):
E=−∑kyklog(ok)
5. Backward Pass: The Chain Rule
The key insight: E depends on wijl only through zil.
By the chain rule:
∂wijl∂E=∂zil∂E⋅∂wijl∂zil
The second term is simple:
∂wijl∂zil=ajl−1
Define the delta term:
δil=∂zil∂E
Then:
∂wijl∂E=δil⋅ajl−1
The gradient is just delta times the incoming activation. Simple!
6. The Delta Rule
The magic of backpropagation: computing deltas recursively from output to input.
6.1 Delta at Output Layer
For the output layer L, the delta depends on the loss function and activation.
Sigmoid activation + MSE loss:
δiL=(oi−yi)⋅σ′(ziL)=(oi−yi)⋅oi(1−oi)
Sigmoid activation + Binary Cross-Entropy:
δiL=oi−yi
Softmax + Cross-Entropy:
δiL=oi−yi
6.2 Delta at Hidden Layers
For hidden layer l, the delta depends on deltas from layer l+1:
δil=f′(zil)⋅k∑wkil+1⋅δkl+1
Interpretation:
- δkl+1: Error signal from neuron k in the next layer
- wkil+1: How much neuron i contributed to neuron k
- f′(zil): How sensitive the activation was to the input
- Sum: Total error attributed to neuron i
graph RL
subgraph "Layer l+1"
D1["δ₁ˡ⁺¹"]
D2["δ₂ˡ⁺¹"]
end
subgraph "Layer l"
DI["δᵢˡ = f'(zᵢˡ) · Σ wₖᵢˡ⁺¹·δₖˡ⁺¹"]
end
D1 -->|"w₁ᵢˡ⁺¹"| DI
D2 -->|"w₂ᵢˡ⁺¹"| DI
7. Complete Algorithm
BACKPROPAGATION(network, x, y, α)
────────────────────────────────────────
1. FORWARD PASS
For l = 1 to L:
z[l] = W[l] · a[l-1] + b[l]
a[l] = f(z[l])
2. COMPUTE OUTPUT DELTA
δ[L] = ∂E/∂z[L] (depends on loss + activation)
3. BACKWARD PASS
For l = L-1 down to 1:
δ[l] = f'(z[l]) ⊙ (W[l+1]ᵀ · δ[l+1])
4. GRADIENT DESCENT
For l = 1 to L:
W[l] ← W[l] - α · δ[l] · a[l-1]ᵀ
b[l] ← b[l] - α · δ[l]
────────────────────────────────────────
8. Worked Example: Numbers You Can Trace
Let’s trace backpropagation through a concrete 2-layer network.
Network Setup
graph LR
subgraph "Input"
X1["x₁ = 1"]
X2["x₂ = 0"]
end
subgraph "Hidden Layer"
H1["h₁"]
H2["h₂"]
end
subgraph "Output"
O["o"]
end
X1 -->|"w₁₁=0.5"| H1
X1 -->|"w₂₁=0.3"| H2
X2 -->|"w₁₂=0.4"| H1
X2 -->|"w₂₂=0.6"| H2
H1 -->|"v₁=0.7"| O
H2 -->|"v₂=0.8"| O
Parameters:
- Input: x1=1, x2=0
- Hidden weights: w11=0.5, w12=0.4, w21=0.3, w22=0.6
- Output weights: v1=0.7, v2=0.8
- Biases: all = 0 (for simplicity)
- Activation: Sigmoid σ(z)=1+e−z1
- Target: y=1
- Learning rate: α=0.5
Step 1: Forward Pass
Hidden layer:
z1(1)=w11x1+w12x2=0.5×1+0.4×0=0.5
h1=σ(0.5)=1+e−0.51=0.6225
z2(1)=w21x1+w22x2=0.3×1+0.6×0=0.3
h2=σ(0.3)=1+e−0.31=0.5744
Output layer:
z(2)=v1h1+v2h2=0.7×0.6225+0.8×0.5744=0.8953
o=σ(0.8953)=0.7101
Error: E=21(y−o)2=21(1−0.7101)2=0.0420
Step 2: Output Delta
Using sigmoid + MSE:
δ(2)=(o−y)⋅o(1−o)=(0.7101−1)⋅0.7101⋅(1−0.7101)
δ(2)=−0.2899×0.2058=−0.0596
Step 3: Hidden Deltas (Backpropagation)
For h1:
δ1(1)=h1(1−h1)⋅v1⋅δ(2)
δ1(1)=0.6225×0.3775×0.7×(−0.0596)=−0.0098
For h2:
δ2(1)=h2(1−h2)⋅v2⋅δ(2)
δ2(1)=0.5744×0.4256×0.8×(−0.0596)=−0.0117
Step 4: Gradient Computation
Output weights:
∂v1∂E=δ(2)⋅h1=−0.0596×0.6225=−0.0371
∂v2∂E=δ(2)⋅h2=−0.0596×0.5744=−0.0342
Hidden weights:
∂w11∂E=δ1(1)⋅x1=−0.0098×1=−0.0098
∂w12∂E=δ1(1)⋅x2=−0.0098×0=0
∂w21∂E=δ2(1)⋅x1=−0.0117×1=−0.0117
∂w22∂E=δ2(1)⋅x2=−0.0117×0=0
Step 5: Weight Update
Using wnew=wold−α⋅∇w:
v1new=0.7−0.5×(−0.0371)=0.7186
v2new=0.8−0.5×(−0.0342)=0.8171
w11new=0.5−0.5×(−0.0098)=0.5049
w21new=0.3−0.5×(−0.0117)=0.3059
Verification: New Forward Pass
With updated weights:
h1new=σ(0.5049×1)=0.6236
h2new=σ(0.3059×1)=0.5759
onew=σ(0.7186×0.6236+0.8171×0.5759)=σ(0.9190)=0.7148
Output increased: 0.7101→0.7148 (closer to target y=1) ✓
9. Why It Works: Intuition
The Chain of Blame
Think of backpropagation as assigning blame:
- Output error: “The prediction was too low by 0.29”
- Output weights: “I contributed h1×0.29 of that error”
- Hidden neurons: “I received blame proportional to my output weight”
- Hidden weights: “I contributed xi×my neuron’s blame“
graph RL
subgraph "Error Flow"
E["Loss: 0.042"]
DO["δ_output = -0.0596"]
DH1["δ_h1 = -0.0098"]
DH2["δ_h2 = -0.0117"]
end
E -->|"∂E/∂o"| DO
DO -->|"× v₁ × σ'"| DH1
DO -->|"× v₂ × σ'"| DH2
Why Gradients Vanish
Notice: δ1(1)=−0.0098 but δ(2)=−0.0596.
The hidden delta is 6x smaller because:
- Sigmoid derivative: max 0.25 at z=0
- Each layer multiplies by σ′(z)≤0.25
With 10 layers: 0.2510=0.00000095
This is the vanishing gradient problem — why ReLU and residual connections matter.
10. Common Pitfalls
10.1 Numerical Instability
Problem: Sigmoid saturates for ∣z∣>5, making σ′(z)≈0.
Solution: Use ReLU, batch normalization, or careful weight initialization.
10.2 Learning Rate
| Too Large | Too Small |
|---|
| Overshoots minimum | Converges slowly |
| Loss oscillates | Gets stuck |
| May diverge | Wastes compute |
Rule of thumb: Start with α=0.001, use learning rate schedulers.
10.3 Gradient Checking
Always verify your implementation with numerical gradients:
∂w∂E≈2ϵE(w+ϵ)−E(w−ϵ)
Compare with your analytical gradient. They should match to ∼10−7.
Summary
| Concept | Formula |
|---|
| Pre-activation | zil=∑jwijlajl−1+bil |
| Activation | ail=f(zil) |
| Delta (output) | δiL=∂oi∂E⋅f′(ziL) |
| Delta (hidden) | δil=f′(zil)∑kwkil+1δkl+1 |
| Weight gradient | ∂wijl∂E=δil⋅ajl−1 |
| Update rule | wijl←wijl−α⋅δil⋅ajl−1 |
Further Reading
References:
- Rumelhart, Hinton, Williams. “Learning representations by back-propagating errors.” Nature, 1986.
- Goodfellow, Bengio, Courville. “Deep Learning.” MIT Press, 2016.