Gradient Descent
The algorithm that teaches a neural network to get better over time by nudging its weights in the right direction.
Think of the loss as a hilly landscape. Every possible combination of weights corresponds to a point in that landscape, and the height at each point is the loss. You want to find the lowest valley. Gradient descent works by asking: which direction is downhill from here? Then it takes a small step that way.
The gradient is the mathematical answer to that question. It tells you the slope at your current position, how much the loss changes if you nudge each weight slightly. By moving in the opposite direction of the gradient, you move downhill.
The size of each step is controlled by the learning rate. Too large and you overshoot the bottom and bounce around. Too small and training takes forever.
How It Became the Engine of Deep Learning
Before neural networks had a practical way to improve their weights, training deep models was essentially impossible. The network could make predictions, you could measure how wrong they were, but there was no efficient way to translate that wrongness into specific weight changes across millions of parameters.
Gradient descent, combined with backpropagation, solved this. Every training step follows the same loop: the network makes a prediction, the error is measured, gradients are computed for every weight, and the weights are updated. Then it repeats, millions of times. This loop is what training is.
In practice, computing the gradient over the entire dataset at once is slow. Modern training uses mini-batches, small random subsets of data per step. The gradient is noisier but training is much faster. This variant is called stochastic gradient descent, and nearly everything in deep learning runs on it.
One honest limitation: gradient descent does not guarantee finding the absolute best set of weights. The loss landscape has many hills and valleys, and you might settle into a local minimum rather than the global one. In practice this matters less than you would expect, because large networks have enough redundancy that most local minima are good enough.