Vanishing Gradient Problem

definition.md

// gradient → 0 as depth increases, early layers stop learning

The vanishing gradient problem occurs when gradients shrink to near-zero as they travel backward through deep networks, causing early layers to receive almost no learning signal and effectively stop training.

Backpropagation works by multiplying gradients together as it moves backward through each layer. If each of those gradients is a small number, the product shrinks very fast. Across ten layers with gradients of 0.1: 0.1 multiplied by itself ten times is 0.0000000001. The signal essentially disappears before it reaches the early layers.

The result is a network that is deep on paper but shallow in practice. Only the last few layers actually update. The early layers, which are supposed to learn basic features that everything else builds on, are effectively frozen.

Why It Blocked Deep Learning for Years

This problem is the reason deep networks were largely abandoned through the 1990s and early 2000s. Researchers could design architectures with many layers, but they simply would not train. Adding depth made things worse, not better. The prevailing view became that shallow networks with hand-engineered features were more practical.

The breakthrough came from multiple directions at once. ReLU activation functions replaced sigmoid: sigmoid's gradient collapses toward zero at extreme values, directly worsening the problem, while ReLU keeps the gradient at 1 for positive inputs. Better weight initialization kept starting gradients in a stable range. Batch normalization prevented activations from drifting into bad ranges during training.

Residual connections, introduced in ResNet in 2015, were the biggest single step forward. They add a shortcut from each layer's input directly to its output, creating a path for gradients to travel without passing through every intermediate multiplication. With residual connections, networks with hundreds of layers became trainable.

The problem also explains why recurrent networks struggled with long sequences. To learn something from the beginning of a long sequence, the gradient had to survive many time steps of backpropagation. It usually did not. LSTMs added gating mechanisms to help. Transformers sidestep the issue more completely by not being sequential at all, so there is no long chain of multiplications to collapse.

Why It Blocked Deep Learning for Years