Transformers

definition.md

// parallel sequence processing via self-attention

A transformer is a neural network architecture that processes sequences by letting every element attend directly to every other element at the same time, rather than reading through them one by one.

The key mechanism is called self-attention. For each word in a sentence, the model asks: which other words are most relevant to understanding this one? It computes a relationship score between every pair of words and uses those scores to build a richer representation of each word in context.

This happens for every word simultaneously, in a single step. There is no sequential chain. Every position can influence every other position directly.

Why It Changed Everything

Before transformers, the dominant approach was recurrent networks. They read text word by word, carrying a compressed memory forward at each step. The problem was that memory had limits. By the time the model reached word 50, it had largely forgotten word 5. Long-range relationships were hard to capture, training was slow because nothing could run in parallel, and making networks deeper made the problems worse.

The transformer, introduced in a 2017 paper called Attention Is All You Need, replaced all of that. Because everything runs in parallel and every word can directly attend to every other word, long-range dependencies stopped being a problem. Training became dramatically faster. And the architecture scaled: as you add more data and compute, performance keeps improving in a predictable way.

The result is what we have now. GPT, BERT, Claude, Gemini, the models behind translation, image generation, protein folding: all of them are built on the transformer. The core idea turned out to be general enough to apply far beyond language, and we are still discovering where else it works.

Why It Changed Everything