Transformers — The Architecture Behind Everything
The Transformer (Vaswani et al., 2017 — Attention Is All You Need) is the foundation of every modern LLM. GPT-4, Claude, LLaMA, Gemini — all of them are Transformer variants. You do not need to train one from scratch, but you need to understand every component at a conceptual and mathematical level.
Why Transformers? The Problem They Solved
Before Transformers, sequence models were Recurrent Neural Networks (RNNs). RNNs read tokens one by one, left to right, compressing everything seen so far into a single fixed-size hidden state vector.
This created two crippling problems:
- The bottleneck problem — all information had to squeeze through one vector. A 10,000-word document reduced to ~512 numbers. Long-range dependencies evaporated.
- The speed problem — sequential processing cannot be parallelised. You had to wait for token to finish before processing token . Training on billions of tokens was not practical.
Transformers solved both by abandoning sequential processing entirely. Every token attends to every other token simultaneously and in parallel. This is the core insight.
The Attention Mechanism
Attention lets each token decide which other tokens are most relevant to understanding its meaning.
Consider: “The cat sat on the mat because it was tired.”
To understand “it”, you need to know it refers to “cat” and not “mat”. A human reading the sentence has no trouble — but an RNN loses “cat” to the bottleneck by the time it processes “it”. Attention solves this by letting “it” directly query every other token and discover that “cat” is the most relevant.
Intuition (The search engine analogy)
Think of attention as a tiny search engine inside the model. For each token (your query), you have a database of all other tokens. Each database entry has a key (a summary of what it offers) and a value (the actual information it carries). You compute how relevant your query is to each key, turn those scores into weights with softmax, then retrieve a weighted blend of values. High relevance = more of that token’s meaning flows into yours.
For each token, three vectors are computed by multiplying the token embedding by three separate learned weight matrices:
- Q (Query) — “What am I looking for?”
- K (Key) — “What do I have to offer?”
- V (Value) — “What information do I actually carry?”
The attention weights are computed and applied in one formula:
The scaling is crucial: in high dimensions (e.g., ), raw dot products grow proportionally to , pushing softmax into a near-zero-gradient region and killing learning. Dividing keeps the scores in a numerically stable range.
Try it below — click any word to see which other tokens it attends to. Start with “it”:
Self-Attention Visualiser
it → ?Click any word to see what it “attends to” — which tokens it considers most relevant when building its representation. Start with “it” to see pronoun resolution.
it: "it" strongly attends to "cat" (38%) — the model correctly resolves the pronoun! This is coreference resolution emerging from training, not explicit rules.
import torchimport torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None): d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5) if mask is not None: # mask == 0 → set to -1e9 so softmax gives ~0 weight scores = scores.masked_fill(mask == 0, -1e9) attn_weights = F.softmax(scores, dim=-1) return torch.matmul(attn_weights, V), attn_weightsMulti-Head Attention
A single attention head learns one type of relationship. Run heads in parallel, each with its own , , matrices — each head specialises in something different.
# MultiHeadAttention intuition:# 1. Project Q, K, V into h subspaces (each of dim d_model / h)# 2. Run scaled dot-product attention independently in each head# 3. Concatenate all h outputs → shape (seq_len, h * d_v)# 4. Apply linear projection W_O → back to (seq_len, d_model)
# Different heads learn different relationship types:# Head 1: syntactic (subject–verb agreement)# Head 2: coreference (pronouns → nouns)# Head 3: positional (adjacent token patterns)# Head 4+: semantic clusters, domain-specific patternsTip (GPT-2 small numbers to have ready)
d_model = 768 | n_heads = 12 (each head is 64-dim) | n_layers = 12 | vocab_size = 50,257 | context = 1,024 tokens. These are the classic reference dimensions interviewers expect you to know.
Full Transformer Architecture
Encoder block (BERT-style — bidirectional):
Input → Tokenisation → Token Embeddings + Positional Encoding → Multi-Head Self-Attention (attends to ALL positions) → Add & Norm (residual connection + LayerNorm) → Feed-Forward Network (2-layer MLP, GELU activation, 4× wider hidden dim) → Add & Norm → Repeat N times (N=12 for BERT-base, N=24 for BERT-large)Decoder block (GPT-style — causal/autoregressive):
Same as encoder, but with CAUSAL (masked) self-attention.The mask sets all future positions to -∞ before softmax.Token i can only attend to positions 0 … i — never future tokens.This enables autoregressive generation: predict next token given all previous.Positional Encoding — Transformers have no built-in sense of order. Attention is set-like, not sequence-like. Positional encodings are added to token embeddings to inject position information. Early models used fixed sinusoidal functions. Modern LLMs use Rotary Position Embedding (RoPE) (LLaMA, Qwen, Mistral) or ALiBi for better length generalisation beyond the training context window.
BERT vs GPT — The Two Paradigms
| BERT (Encoder-only) | GPT (Decoder-only) | |
|---|---|---|
| Context | Bidirectional — sees all tokens simultaneously | Causal — only sees tokens to the left |
| Pre-training | MLM: mask 15% of tokens, predict the masked ones | CLM: predict the next token at every position |
| Output | Contextualised embeddings for every token | Probability distribution over the next token |
| Best for | Classification, NER, extractive QA, sentence embeddings | Text generation, chat, summarisation, code |
| Examples | BERT, RoBERTa, DeBERTa, sentence-transformers | GPT-4, Claude, LLaMA, Gemini, Qwen |
| Can generate freely? | No — no causal masking | Yes — autoregressive decoding |
Note (The third family: Encoder-Decoder (T5, BART, Flan-T5))
Encoder reads the input bidirectionally; decoder generates output autoregressively, attending to encoder outputs via cross-attention. Best for translation, abstractive summarisation, and multi-task instruction following (Flan-T5). Less dominant in the LLM era but important to recognise.
LLM Landscape — Know Your Models
You must be able to compare models fluently. Interviewers will ask: “Which model would you choose for X and why?” Always structure your answer around four axes: context window, cost, open vs closed, and specialisation.
| Model | Provider | Context | Strengths | Best Use Cases |
|---|---|---|---|---|
| GPT-4o | OpenAI | 128K | Multimodal, fast, SOTA reasoning | Enterprise apps, vision+text, coding, function calling |
| Claude 3.7 | Anthropic | 200K | Long context, safety, nuanced instruction following | Document analysis, compliance, long-form writing |
| LLaMA 3.1 | Meta (OSS) | 128K | Open weights, customisable, private deployment | Fine-tuning, on-prem RAG, cost-sensitive production |
| Gemini 1.5 Pro | 1M | Massive context window, multimodal, Google ecosystem | Video/audio analysis, processing enormous documents | |
| Mixtral 8x7B | Mistral (OSS) | 32K | MoE architecture — 8 experts, activates 2 per token | Efficient serving, edge deployment, open fine-tuning |
| Qwen2.5 | Alibaba (OSS) | 128K | Strong multilingual and code, open weights | Multilingual apps, Asian markets, coding assistants |
| Nova Pro | Amazon | 300K | AWS native, multimodal, Bedrock integration | AWS-native pipelines, enterprise AWS shops |
Tip (The interview answer formula)
When asked which model to use: (1) identify the key constraint (latency / cost / accuracy / data privacy), (2) eliminate models that fail it, (3) name your choice with a one-sentence justification. Example: “For on-premise deployment with fine-tuning on proprietary data, I’d use LLaMA 3.1 — open weights, competitive performance, no data leaving our infrastructure.”
Prompt Engineering — From Basics to Expert
Prompt engineering is the practice of designing inputs to reliably elicit the best outputs from LLMs. It is heavily tested at senior DS interviews — not because it replaces engineering, but because it reveals how deeply you understand the model.
The Technique Hierarchy
| Technique | When to use | The core mechanism |
|---|---|---|
| Zero-shot | Simple, well-defined tasks the model knows well | Clear instruction only, no examples |
| Few-shot | Custom formats, domain-specific classification | 3–5 input/output examples prime in-context learning |
| Chain-of-Thought | Maths, multi-step logic, complex reasoning | Forces the model to allocate compute to intermediate steps |
| ReAct | Tool-using agents, multi-hop information gathering | Interleaves reasoning (Thought) and actions (Act/Observation) |
| Self-consistency | High-stakes reasoning where accuracy matters most | Sample multiple CoT paths, take the majority answer |
Zero-shot — instruction only:
Classify the sentiment of this review as Positive, Negative, or Neutral:
Review: "The battery life is amazing but the camera is disappointing."
Sentiment:Few-shot — examples prime the model’s in-context learning:
Review: "Best product I've ever bought!" → PositiveReview: "Arrived damaged, very disappointed." → NegativeReview: "It works, nothing special." → Neutral
Review: "Incredible speed but the interface is confusing." →Note (Recency bias in few-shot prompting)
LLMs give more weight to examples that appear later in the prompt. Always put your most representative and cleanest examples last. Shuffle edge cases to the middle.
Chain-of-Thought — force the model to reason before answering:
Q: A store starts with 48 apples. They sell 15, then receive 20% of their original stock as a restock. How many apples do they have now?
Think step by step:1. Start: 48 apples2. After selling: 48 − 15 = 33 apples3. Restock: 20% of 48 = 9.6 ≈ 9 apples received4. Final: 33 + 9 = 42 apples
Answer: 42 applesReAct — interleave Thought, Action, and Observation:
Thought: I need the current AAPL price to calculate market cap.Action: search("AAPL stock price today")Obs: AAPL is trading at $192.50.
Thought: Apple has ~15.5B shares outstanding. I can calculate now.Action: calculate(192.50 * 15_500_000_000)Obs: $2,983,750,000,000
Answer: Apple's current market cap is approximately $2.98 trillion.System Prompts & Prompt Structure
Production prompts follow a consistent structure. Every missing element is a potential failure mode:
SYSTEM_PROMPT = """You are a customer support specialist for Acme Corp.
Role: Answer customer questions accurately and helpfully.
Rules:- Only answer questions about Acme Corp products.- If unsure, say "I don't have that information" — never fabricate.- Keep responses under 150 words.- End every response with: "Is there anything else I can help you with?"
Tone: Professional, warm, empathetic.Format: Plain text. Use bullet points only for multi-step instructions."""
USER_MSG = """Context: {retrieved_docs}
Customer question: {question}
Response:"""
# The five elements of a strong system prompt:# 1. Persona — who the model IS# 2. Task — what it should do# 3. Rules — what it must / must not do# 4. Format — how to structure output# 5. Examples — few-shot if format is non-standardStructured Outputs & Function Calling
Function calling forces the LLM to return structured JSON — critical for production integrations where you need reliable parsing:
import openai, json
client = openai.OpenAI()
tools = [{ 'type': 'function', 'function': { 'name': 'extract_order_info', 'description': 'Extract order details from a customer message', 'parameters': { 'type': 'object', 'properties': { 'order_id': { 'type': 'string', 'description': 'Order ID if mentioned' }, 'issue_type': { 'type': 'string', 'enum': ['delay', 'wrong_item', 'damaged', 'missing', 'other'] }, 'urgency': { 'type': 'string', 'enum': ['low', 'medium', 'high'] }, 'customer_emotion': { 'type': 'string', 'enum': ['neutral', 'frustrated', 'angry', 'happy'] } }, 'required': ['issue_type', 'urgency', 'customer_emotion'] } }}]
response = client.chat.completions.create( model='gpt-4o', messages=[{ 'role': 'user', 'content': 'My order #12345 is 2 weeks late and I need it ASAP!' }], tools=tools, tool_choice={'type': 'function', 'function': {'name': 'extract_order_info'}})
result = json.loads( response.choices[0].message.tool_calls[0].function.arguments)# {'order_id': '12345', 'issue_type': 'delay', 'urgency': 'high', 'customer_emotion': 'frustrated'}Tip (Function calling vs output parsers)
Use native function calling when your provider supports it (OpenAI, Anthropic) — it is more reliable than instructing the model to output JSON. For models without native tool support, use a Pydantic output parser with retry logic in LangChain.
Retrieval-Augmented Generation (RAG)
RAG is the most important production pattern for LLM applications. Every senior GenAI DS must be able to build, evaluate, and optimise a RAG system from scratch.
Why RAG? The Three Problems It Solves
LLMs have three fatal flaws for production use:
- Knowledge cutoff — they know nothing after their training date. GPT-4’s world ends in April 2023.
- Hallucination — when they don’t know something, they confidently make it up. They cannot say “I don’t know” unless trained to.
- No private data — they cannot access your internal documents, databases, or real-time information.
RAG solves all three: retrieve relevant documents at query time and inject them into the context window as grounding evidence. The LLM answers from evidence, not from memory.
The Architecture
RAG Pipeline
Click each step to learn what happens. The two phases run at different times — indexing is done once, querying happens per request.
Important (The same-model rule — most common RAG bug)
Always embed documents and queries with the exact same embedding model. Using different models puts your vectors in different spaces — similarity search returns garbage. This is the single most common cause of “my RAG retrieves nonsense” in production.
Build a Full RAG Pipeline
from langchain.document_loaders import DirectoryLoader, PyPDFLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.embeddings import OpenAIEmbeddingsfrom langchain.vectorstores import FAISSfrom langchain.chat_models import ChatOpenAIfrom langchain.chains import RetrievalQAfrom langchain.prompts import PromptTemplate
# ── Step 1: LOAD ──────────────────────────────────────────────────────────────loader = DirectoryLoader('./docs/', glob='**/*.pdf', loader_cls=PyPDFLoader)documents = loader.load()
# ── Step 2: CHUNK ─────────────────────────────────────────────────────────────splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50, # 10% overlap prevents info loss at boundaries separators=['\n\n', '\n', '.', ' ', ''] # Tries each in order (prefers natural breaks))chunks = splitter.split_documents(documents)
# ── Step 3: EMBED + INDEX ─────────────────────────────────────────────────────embeddings = OpenAIEmbeddings(model='text-embedding-3-small')vectordb = FAISS.from_documents(chunks, embeddings)vectordb.save_local('./faiss_index') # Persist for reuse
# ── Step 4: BUILD RETRIEVAL CHAIN ─────────────────────────────────────────────llm = ChatOpenAI(model='gpt-4o', temperature=0)
PROMPT = PromptTemplate( input_variables=['context', 'question'], template="""You are a helpful assistant. Answer using ONLY the context below.If the answer is not in the context, say "I don't have that information."
Context: {context}
Question: {question}
Answer:""")
qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type='stuff', # 'map_reduce' when retrieved chunks overflow context retriever=vectordb.as_retriever(search_kwargs={'k': 5}), chain_type_kwargs={'prompt': PROMPT}, return_source_documents=True)
# ── Step 5: QUERY ─────────────────────────────────────────────────────────────result = qa_chain({'query': 'What are the return policies?'})print(result['result']) # The grounded answerprint(result['source_documents']) # Evidence chunks — show these as citationsAdvanced RAG — What Separates Good from Great
| Technique | What it does | When to add it |
|---|---|---|
| Hybrid Search | Combine vector (dense) + BM25 (sparse) via Reciprocal Rank Fusion | When exact keyword matches matter (product codes, names, IDs) |
| Reranking | Retrieve top-20 with ANN, rerank with cross-encoder, keep top-5 | High-accuracy apps where 50–100ms extra latency is acceptable |
| HyDE | Generate a hypothetical ideal answer first, embed it, then search | Factoid queries where the question and answer embeddings naturally differ |
| Query Expansion | Break a complex query into sub-queries, answer each, synthesise | Multi-hop questions spanning multiple documents |
| Parent-Child Chunking | Index small chunks for precision, return larger parent chunks | When retrieval is precise but generation needs more surrounding context |
| Metadata Filtering | Pre-filter by date/source/category before semantic search | Large knowledge bases with clear categorical structure |
Tip (Chunking is the most underrated RAG decision)
Chunk size dramatically affects quality. Small chunks (128 tokens) — precise retrieval, thin context. Large chunks (1024 tokens) — rich context, imprecise retrieval. Good default: 512 tokens with 50-token overlap. For single-sentence factoid QA, go smaller. For summarisation, go larger. Measure retrieval recall first — most RAG failures are retrieval failures, not generation failures.
Vector Databases & Embeddings
Embeddings — Converting Meaning to Numbers
An embedding model converts text into a dense vector — typically 768 to 3072 numbers. The remarkable property: texts with similar meanings land near each other in this high-dimensional space, regardless of exact wording.
from openai import OpenAIimport numpy as np
client = OpenAI()
def embed(text: str) -> np.ndarray: response = client.embeddings.create( model='text-embedding-3-small', # 1536 dimensions, $0.02 per 1M tokens input=text ) return np.array(response.data[0].embedding)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
v1 = embed('machine learning')v2 = embed('deep learning')v3 = embed('cooking recipes')
print(cosine_similarity(v1, v2)) # ~0.88 — high similarityprint(cosine_similarity(v1, v3)) # ~0.42 — low similarityIntuition (The city map analogy)
Think of embedding space as a city. “Machine learning” and “deep learning” are in the same neighbourhood — a short walk apart. “Cooking recipes” is across the city. When you embed a query, you drop a pin on this map, then find all documents within a certain radius. Cosine similarity measures the angle between vectors (not distance) — this works better for high-dimensional text because all vectors sit on a high-dimensional sphere.
FAISS — Fast Approximate Nearest-Neighbour Search
import faissimport numpy as np
dimension = 1536 # Must match your embedding model's output
# ── IndexFlatL2 ───────────────────────────────────────────────────────────────# Exact L2 (Euclidean) search — brute force, 100% recall# Good for < 100K vectors or when perfect accuracy is requiredindex_flat = faiss.IndexFlatL2(dimension)
# ── IndexIVFFlat ──────────────────────────────────────────────────────────────# Approximate — clusters vectors into nlist cells, searches nprobe cells at query time# Faster but ~1% recall loss. Must train before adding vectors.quantizer = faiss.IndexFlatL2(dimension)index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist=100)index_ivf.train(embeddings_np)index_ivf.nprobe = 10 # Higher = more accurate but slower
# ── IndexHNSWFlat ─────────────────────────────────────────────────────────────# Graph-based — best speed/recall tradeoff for millions of vectors# No training required, very fast at query timeindex_hnsw = faiss.IndexHNSWFlat(dimension, M=32) # M = graph connectivity
# Add and searchindex_flat.add(embeddings_np) # shape: (n_documents, dimension)D, I = index_flat.search(query_emb, k=5) # D=distances, I=indicesretrieved_chunks = [all_chunks[i] for i in I[0]]Vector Database Comparison
| Database | Type | Best for | Strengths | Weaknesses |
|---|---|---|---|---|
| FAISS | Open-source | Research, offline batch | Fastest raw search, flexible index types | No metadata filtering, no HTTP API |
| ChromaDB | Open-source | Prototyping, local RAG | Simplest setup, embedded Python | Slower at scale |
| Pinecone | Managed SaaS | Production startups | Fully managed, excellent filtering, hybrid search | Cost scales with usage |
| Weaviate | OSS + managed | Complex data models | GraphQL API, built-in vectorisers, multi-modal | Steeper learning curve |
| Qdrant | OSS (Rust) | Self-hosted production | Very fast, excellent filtering, rich payloads | Less managed-service ecosystem |
| pgvector | PostgreSQL ext. | Existing Postgres stacks | No new infrastructure | Slower than dedicated DBs at scale |
Tip (Interview rule of thumb)
Prototype → ChromaDB. Production managed → Pinecone. Production self-hosted → Qdrant. Already on PostgreSQL → pgvector. Complex data model or multi-modal → Weaviate.
LangChain & LlamaIndex — Building LLM Applications
LangChain: LCEL Chains
LangChain’s modern interface is LCEL (LangChain Expression Language) — a composable pipe syntax that makes chains readable and streamable:
from langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([ ('system', 'You are a helpful data scientist.'), ('user', '{question}')])llm = ChatOpenAI(model='gpt-4o', temperature=0)parser = StrOutputParser()
# Pipe operator: left → right. Each component's output feeds the next.chain = prompt | llm | parserresponse = chain.invoke({'question': 'What is overfitting?'})Memory — maintain conversation history across turns:
from langchain.memory import ConversationBufferWindowMemory
# Keeps only the last k exchanges — prevents unbounded context growthmemory = ConversationBufferWindowMemory(k=5)Structured extraction with Pydantic:
from langchain.output_parsers import PydanticOutputParserfrom pydantic import BaseModel
class SentimentResult(BaseModel): sentiment: str confidence: float key_phrases: list[str]
parser = PydanticOutputParser(pydantic_object=SentimentResult)# Add parser.get_format_instructions() to your prompt for reliable extractionLangChain Agents — The Power Feature
Agents let LLMs decide which tools to call and in what order — the model becomes an autonomous decision-maker rather than a single-shot responder:
from langchain.agents import create_react_agent, AgentExecutorfrom langchain.tools import toolfrom langchain_openai import ChatOpenAIfrom langchain import hub
@tooldef get_current_date() -> str: """Returns today's date in YYYY-MM-DD format.""" from datetime import date return str(date.today())
@tooldef calculate_metrics(data_json: str) -> str: """Calculate mean, median, and std from a JSON array of numbers.
Input must be a JSON string like: [1, 2, 3, 4, 5] """ import json, statistics data = json.loads(data_json) return json.dumps({ 'mean': statistics.mean(data), 'median': statistics.median(data), 'std': statistics.stdev(data) })
tools = [get_current_date, calculate_metrics]llm = ChatOpenAI(model='gpt-4o', temperature=0)prompt = hub.pull('hwchase17/react') # Standard ReAct prompt template
agent = create_react_agent(llm, tools, prompt)executor = AgentExecutor( agent=agent, tools=tools, verbose=True, max_iterations=10 # Always cap — prevents infinite loops)result = executor.invoke({'input': 'What is the mean of [5, 10, 15, 20, 25]?'})print(result['output'])Warning (Agents are non-deterministic — plan for failure)
Agents call tools based on the LLM’s reasoning chain, which can fail, loop, or take unexpected paths. Always set max_iterations, add defensive error handling to every tool, and test with adversarial inputs. Never give agents write access to production systems without a human-in-the-loop confirmation step.
LlamaIndex — When RAG Is the Core Product
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settingsfrom llama_index.llms.openai import OpenAIfrom llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model='gpt-4o', temperature=0)Settings.embed_model = OpenAIEmbedding(model='text-embedding-3-small')
# Load, embed, and index in three linesdocuments = SimpleDirectoryReader('./docs/').load_data()index = VectorStoreIndex.from_documents(documents)
# Simple queryquery_engine = index.as_query_engine(similarity_top_k=5)response = query_engine.query('What is the Q3 revenue?')print(response)print(response.source_nodes) # Retrieved chunks used as evidence
# Advanced: SubQuestion Query Engine# Automatically decomposes "Compare Q2 vs Q3 revenue and profit"# into sub-questions, answers each, then synthesisesfrom llama_index.core.query_engine import SubQuestionQueryEnginefrom llama_index.core.tools import QueryEngineTool
tools = [QueryEngineTool.from_defaults( query_engine=index.as_query_engine(), name='company_docs', description='Financial and operational documents for the company')]sub_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)Tip (LangChain vs LlamaIndex — when to use which)
LangChain: general LLM orchestration, multi-tool agents, complex conditional workflows, broad ecosystem. Choose when building flexible agent systems or complex pipelines. LlamaIndex: specialised for data indexing and querying — better out-of-the-box RAG, richer indexing strategies (knowledge graphs, tree indices), cleaner query abstractions. If your primary use case is querying a large document collection, start with LlamaIndex.
LLM Fine-Tuning — When and How
Fine-tuning adapts a pre-trained LLM to a specific task, domain, or style by continuing training on a curated dataset. It is expensive and often unnecessary. A great senior DS knows when not to fine-tune.
The Decision Matrix
| Approach | Best for | Pros | Cons |
|---|---|---|---|
| Prompting only | General tasks, rapid prototyping | Zero cost, instant, flexible | Limited customisation, token cost per call |
| RAG | Up-to-date info, private data, citations | No training, updatable knowledge | Retrieval quality dependent, adds latency |
| Full fine-tuning | Deep domain adaptation, strong style changes | Best task performance, no retrieval overhead | Extremely expensive, needs thousands of examples, static |
| LoRA / QLoRA | Efficient adaptation on limited GPU budget | 10–100× less compute, weights can be merged | Less powerful than full fine-tuning |
| RLHF / DPO | Aligning the model to human preferences | Best for chat quality and helpfulness | Most complex pipeline, needs preference pairs |
Important (Exhaust these options before fine-tuning)
Before committing to a fine-tuning run: (1) exhaust prompt engineering with 10+ few-shot examples, (2) check if RAG solves the knowledge gap, (3) test system prompt variations. Fine-tuning is best at changing format and style — if the model lacks factual knowledge, fine-tuning often just makes the hallucinations more confident and fluent.
LoRA — Low-Rank Adaptation
LoRA is the dominant fine-tuning technique. Instead of updating all parameters (billions of numbers), it injects small trainable matrices into specific layers, leaving the base model frozen.
The core idea: original weight matrix is frozen. LoRA adds a low-rank update:
where and , with the rank . Only and are trained. With on a 7B model, trainable parameters drop from 7 billion to roughly 8 million — about 0.1%.
After training, merge back: — zero inference overhead, same model size.
QLoRA adds 4-bit quantisation of the base model before applying LoRA, making fine-tuning possible on a single consumer GPU (e.g., an RTX 4090 can fine-tune a 13B model).
from peft import LoraConfig, get_peft_model, TaskTypefrom transformers import AutoModelForCausalLM, BitsAndBytesConfigimport torch
# QLoRA: 4-bit quantise the base model firstbnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained( 'meta-llama/Meta-Llama-3.1-8B-Instruct', quantization_config=bnb_config, device_map='auto')
# LoRA configurationlora_config = LoraConfig( r=16, # Rank — higher r = more capacity but more params lora_alpha=32, # Scaling factor — rule of thumb: lora_alpha = 2 × r target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'], # Attention layers lora_dropout=0.05, bias='none', task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, lora_config)model.print_trainable_parameters()# trainable params: 13,631,488 || all params: 8,044,941,312 || trainable%: 0.1695%Training Data Format
# ── Alpaca-style instruction format ───────────────────────────────────────────training_example = { 'instruction': 'Classify the sentiment of this customer review.', 'input': 'The product quality is excellent but delivery was very slow.', 'output': 'Mixed sentiment: Positive about product quality, Negative about delivery.'}
# ── ChatML format (OpenAI-compatible, LLaMA 3 native) ─────────────────────────chatml_example = """<|im_start|>systemYou are a helpful data scientist assistant.<|im_end|><|im_start|>userWhat is precision in machine learning?<|im_end|><|im_start|>assistantPrecision is TP/(TP+FP) — the fraction of positive predictions that are actually positive.<|im_end|>"""
# ── Data quantity guidelines ──────────────────────────────────────────────────# Style / format changes: 500–2,000 examples# Domain adaptation: 1,000–10,000 examples# Full task fine-tuning: 10,000+ examples## Quality >> Quantity. 100 clean, diverse examples consistently beat# 1,000 noisy or repetitive ones.Tip (Quality beats quantity — always)
The three most common data quality issues: (1) duplicate or near-duplicate examples, (2) answers that are too generic and do not demonstrate the target behaviour, (3) inconsistent formatting across examples. Always do a manual review pass of at least 50 random samples before launching a training run.