SARATH THARAYILHS.T.[W] WRITEUPSWWRITEUPS[K] CONCEPTSKCONCEPTS[P] PROJECTSPPROJECTS[A] ABOUTAABOUT
മ
/ SYSTEM

Building thoughtful software, writing notes, and shipping experiments across data, AI, and the web.

No cookies, no tracking. Preferences are stored locally in your browser. Anonymous view counts are kept server-side.

© 2026 Sarath Tharayil/IST --:--:--
← CONCEPTS

Transformers

The architecture behind almost every modern AI model, from ChatGPT to translation to image generation.

Deep LearningNLPAttention
1 MIN READ · May 11, 2025
definition.md
123
// parallel sequence processing via self-attention
A transformer is a neural network architecture that processes sequences by letting every element attend directly to every other element at the same time, rather than reading through them one by one.

The key mechanism is called self-attention. For each word in a sentence, the model asks: which other words are most relevant to understanding this one? It computes a relationship score between every pair of words and uses those scores to build a richer representation of each word in context.

This happens for every word simultaneously, in a single step. There is no sequential chain. Every position can influence every other position directly.

Why It Changed Everything

Before transformers, the dominant approach was recurrent networks. They read text word by word, carrying a compressed memory forward at each step. The problem was that memory had limits. By the time the model reached word 50, it had largely forgotten word 5. Long-range relationships were hard to capture, training was slow because nothing could run in parallel, and making networks deeper made the problems worse.

The transformer, introduced in a 2017 paper called Attention Is All You Need, replaced all of that. Because everything runs in parallel and every word can directly attend to every other word, long-range dependencies stopped being a problem. Training became dramatically faster. And the architecture scaled: as you add more data and compute, performance keeps improving in a predictable way.

The result is what we have now. GPT, BERT, Claude, Gemini, the models behind translation, image generation, protein folding: all of them are built on the transformer. The core idea turned out to be general enough to apply far beyond language, and we are still discovering where else it works.

/ RELATED CONCEPTS
Activation FunctionsThe non-linear functions applied after each layer that give neural networks the ability to learn complex patterns.BackpropagationThe algorithm that calculates how much each weight in a network contributed to the error, making gradient descent possible at scale.Gradient DescentThe algorithm that teaches a neural network to get better over time by nudging its weights in the right direction.Vanishing Gradient ProblemWhy deep networks and recurrent models struggle to learn when the error signal fades out before reaching the early layers.
← BACK TO CONCEPTS