SARATH THARAYILHS.T.[W] WRITEUPSWWRITEUPS[K] CONCEPTSKCONCEPTS[P] PROJECTSPPROJECTS[A] ABOUTAABOUT
മ
/ SYSTEM

Building thoughtful software, writing notes, and shipping experiments across data, AI, and the web.

No cookies, no tracking. Preferences are stored locally in your browser. Anonymous view counts are kept server-side.

© 2026 Sarath Tharayil/IST --:--:--
← CONCEPTS

Vanishing Gradient Problem

Why deep networks and recurrent models struggle to learn when the error signal fades out before reaching the early layers.

Deep LearningTrainingRNN
2 MIN READ · May 11, 2025
definition.md
123
// gradient → 0 as depth increases, early layers stop learning
The vanishing gradient problem occurs when gradients shrink to near-zero as they travel backward through deep networks, causing early layers to receive almost no learning signal and effectively stop training.

Backpropagation works by multiplying gradients together as it moves backward through each layer. If each of those gradients is a small number, the product shrinks very fast. Across ten layers with gradients of 0.1: 0.1 multiplied by itself ten times is 0.0000000001. The signal essentially disappears before it reaches the early layers.

The result is a network that is deep on paper but shallow in practice. Only the last few layers actually update. The early layers, which are supposed to learn basic features that everything else builds on, are effectively frozen.

Why It Blocked Deep Learning for Years

This problem is the reason deep networks were largely abandoned through the 1990s and early 2000s. Researchers could design architectures with many layers, but they simply would not train. Adding depth made things worse, not better. The prevailing view became that shallow networks with hand-engineered features were more practical.

The breakthrough came from multiple directions at once. ReLU activation functions replaced sigmoid: sigmoid's gradient collapses toward zero at extreme values, directly worsening the problem, while ReLU keeps the gradient at 1 for positive inputs. Better weight initialization kept starting gradients in a stable range. Batch normalization prevented activations from drifting into bad ranges during training.

Residual connections, introduced in ResNet in 2015, were the biggest single step forward. They add a shortcut from each layer's input directly to its output, creating a path for gradients to travel without passing through every intermediate multiplication. With residual connections, networks with hundreds of layers became trainable.

The problem also explains why recurrent networks struggled with long sequences. To learn something from the beginning of a long sequence, the gradient had to survive many time steps of backpropagation. It usually did not. LSTMs added gating mechanisms to help. Transformers sidestep the issue more completely by not being sequential at all, so there is no long chain of multiplications to collapse.

/ RELATED CONCEPTS
BackpropagationThe algorithm that calculates how much each weight in a network contributed to the error, making gradient descent possible at scale.Gradient DescentThe algorithm that teaches a neural network to get better over time by nudging its weights in the right direction.Activation FunctionsThe non-linear functions applied after each layer that give neural networks the ability to learn complex patterns.TransformersThe architecture behind almost every modern AI model, from ChatGPT to translation to image generation.
← BACK TO CONCEPTS