Activation Functions

definition.md

// non-linearity that makes stacking layers meaningful

An activation function is a non-linear function applied after each layer in a neural network, giving the network the ability to learn complex patterns beyond simple straight-line relationships.

Without activation functions, a neural network is just a sequence of matrix multiplications. No matter how many layers you stack, the math collapses into a single linear transformation. The model could only ever learn straight-line relationships, which rules out almost everything interesting in the real world.

After each layer's computation, you apply a non-linear function to the result. This breaks the linearity and gives the network the ability to learn curves, edges, patterns, and structure. Without it, depth would be pointless.

The Search for the Right Function

The original choice was the sigmoid function, a smooth S-curve that squashes any input to a value between 0 and 1. It made intuitive sense as a way to model whether a neuron is on or off. But it had a serious flaw: when inputs get very large or very small, the sigmoid's gradient shrinks toward zero. This directly fed the vanishing gradient problem and made deep networks slow to train.

ReLU replaced it and became the dominant choice for most of the past decade. The function is almost embarrassingly simple: if the input is positive, pass it through unchanged; if negative, output zero. That is the whole thing. But the gradient for positive values is always exactly 1, not shrinking, which made deep networks far more practical to train. Something that simple should not have made such a large difference. It did.

Variants followed. Leaky ReLU keeps a tiny gradient for negative inputs instead of cutting to zero, which helps when large portions of a network become inactive. GELU produces a smooth curve similar to ReLU but with a gentler transition around zero and is now the standard in modern transformers. SiLU is another smooth variant used in several recent architectures.

The choice rarely gets much attention outside of research papers, but it shapes what a network can represent and how fast it learns. Most practitioners default to ReLU for general use and GELU for transformer-based models, and those defaults hold up well in practice.

The Search for the Right Function