SARATH THARAYILHS.T.[W] WRITEUPSWWRITEUPS[K] CONCEPTSKCONCEPTS[P] PROJECTSPPROJECTS[A] ABOUTAABOUT
മ
/ SYSTEM

Building thoughtful software, writing notes, and shipping experiments across data, AI, and the web.

No cookies, no tracking. Preferences are stored locally in your browser. Anonymous view counts are kept server-side.

© 2026 Sarath Tharayil/IST --:--:--
← CONCEPTS

Gradient Descent

The algorithm that teaches a neural network to get better over time by nudging its weights in the right direction.

Deep LearningOptimizationTraining
2 MIN READ · May 11, 2025
definition.md
123
// iterative weight optimization via loss minimization
Gradient descent is an optimization algorithm that adjusts a model's weights by repeatedly moving in the direction that reduces the loss, one small step at a time.

Think of the loss as a hilly landscape. Every possible combination of weights corresponds to a point in that landscape, and the height at each point is the loss. You want to find the lowest valley. Gradient descent works by asking: which direction is downhill from here? Then it takes a small step that way.

The gradient is the mathematical answer to that question. It tells you the slope at your current position, how much the loss changes if you nudge each weight slightly. By moving in the opposite direction of the gradient, you move downhill.

The size of each step is controlled by the learning rate. Too large and you overshoot the bottom and bounce around. Too small and training takes forever.

How It Became the Engine of Deep Learning

Before neural networks had a practical way to improve their weights, training deep models was essentially impossible. The network could make predictions, you could measure how wrong they were, but there was no efficient way to translate that wrongness into specific weight changes across millions of parameters.

Gradient descent, combined with backpropagation, solved this. Every training step follows the same loop: the network makes a prediction, the error is measured, gradients are computed for every weight, and the weights are updated. Then it repeats, millions of times. This loop is what training is.

In practice, computing the gradient over the entire dataset at once is slow. Modern training uses mini-batches, small random subsets of data per step. The gradient is noisier but training is much faster. This variant is called stochastic gradient descent, and nearly everything in deep learning runs on it.

One honest limitation: gradient descent does not guarantee finding the absolute best set of weights. The loss landscape has many hills and valleys, and you might settle into a local minimum rather than the global one. In practice this matters less than you would expect, because large networks have enough redundancy that most local minima are good enough.

/ RELATED CONCEPTS
BackpropagationThe algorithm that calculates how much each weight in a network contributed to the error, making gradient descent possible at scale.Vanishing Gradient ProblemWhy deep networks and recurrent models struggle to learn when the error signal fades out before reaching the early layers.Activation FunctionsThe non-linear functions applied after each layer that give neural networks the ability to learn complex patterns.TransformersThe architecture behind almost every modern AI model, from ChatGPT to translation to image generation.
← BACK TO CONCEPTS