Understanding the Technology Behind Large Language Models
Overview

Understanding the Technology Behind Large Language Models

Understanding the Technology Behind Large Language Models
March 4, 2026 · 20 min read
· ·

Transformers — The Architecture Behind Everything

The Transformer (Vaswani et al., 2017 — Attention Is All You Need) is the foundation of every modern LLM. GPT-4, Claude, LLaMA, Gemini — all of them are Transformer variants. You do not need to train one from scratch, but you need to understand every component at a conceptual and mathematical level.

Why Transformers? The Problem They Solved

Before Transformers, sequence models were Recurrent Neural Networks (RNNs). RNNs read tokens one by one, left to right, compressing everything seen so far into a single fixed-size hidden state vector.

This created two crippling problems:

  • The bottleneck problem — all information had to squeeze through one vector. A 10,000-word document reduced to ~512 numbers. Long-range dependencies evaporated.
  • The speed problem — sequential processing cannot be parallelised. You had to wait for token tt to finish before processing token t+1t+1. Training on billions of tokens was not practical.

Transformers solved both by abandoning sequential processing entirely. Every token attends to every other token simultaneously and in parallel. This is the core insight.

The Attention Mechanism

Attention lets each token decide which other tokens are most relevant to understanding its meaning.

Consider: “The cat sat on the mat because it was tired.”

To understand “it”, you need to know it refers to “cat” and not “mat”. A human reading the sentence has no trouble — but an RNN loses “cat” to the bottleneck by the time it processes “it”. Attention solves this by letting “it” directly query every other token and discover that “cat” is the most relevant.

Intuition (The search engine analogy)

Think of attention as a tiny search engine inside the model. For each token (your query), you have a database of all other tokens. Each database entry has a key (a summary of what it offers) and a value (the actual information it carries). You compute how relevant your query is to each key, turn those scores into weights with softmax, then retrieve a weighted blend of values. High relevance = more of that token’s meaning flows into yours.

For each token, three vectors are computed by multiplying the token embedding by three separate learned weight matrices:

Q=WQx,K=WKx,V=WVxQ = W_Q \cdot x, \quad K = W_K \cdot x, \quad V = W_V \cdot x

  • Q (Query) — “What am I looking for?”
  • K (Key) — “What do I have to offer?”
  • V (Value) — “What information do I actually carry?”

The attention weights are computed and applied in one formula:

Attention(Q,K,V)=softmax ⁣(QKTdk) ⁣V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)\!V

The dk\sqrt{d_k} scaling is crucial: in high dimensions (e.g., dk=64d_k = 64), raw dot products grow proportionally to dk\sqrt{d_k}, pushing softmax into a near-zero-gradient region and killing learning. Dividing keeps the scores in a numerically stable range.

Try it below — click any word to see which other tokens it attends to. Start with “it”:

Self-Attention Visualiser

it → ?

Click any word to see what it “attends to” — which tokens it considers most relevant when building its representation. Start with “it” to see pronoun resolution.

The
4%
cat
38%
sat
10%
on
4%
the
8%
mat
14%
because
8%
it
8%
was
3%
tired
3%

it: "it" strongly attends to "cat" (38%) — the model correctly resolves the pronoun! This is coreference resolution emerging from training, not explicit rules.

import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
# mask == 0 → set to -1e9 so softmax gives ~0 weight
scores = scores.masked_fill(mask == 0, -1e9)
attn_weights = F.softmax(scores, dim=-1)
return torch.matmul(attn_weights, V), attn_weights

Multi-Head Attention

A single attention head learns one type of relationship. Run hh heads in parallel, each with its own WQW_Q, WKW_K, WVW_V matrices — each head specialises in something different.

# MultiHeadAttention intuition:
# 1. Project Q, K, V into h subspaces (each of dim d_model / h)
# 2. Run scaled dot-product attention independently in each head
# 3. Concatenate all h outputs → shape (seq_len, h * d_v)
# 4. Apply linear projection W_O → back to (seq_len, d_model)
# Different heads learn different relationship types:
# Head 1: syntactic (subject–verb agreement)
# Head 2: coreference (pronouns → nouns)
# Head 3: positional (adjacent token patterns)
# Head 4+: semantic clusters, domain-specific patterns
Tip (GPT-2 small numbers to have ready)

d_model = 768 | n_heads = 12 (each head is 64-dim) | n_layers = 12 | vocab_size = 50,257 | context = 1,024 tokens. These are the classic reference dimensions interviewers expect you to know.

Full Transformer Architecture

Encoder block (BERT-style — bidirectional):

Input → Tokenisation → Token Embeddings + Positional Encoding
→ Multi-Head Self-Attention (attends to ALL positions)
→ Add & Norm (residual connection + LayerNorm)
→ Feed-Forward Network (2-layer MLP, GELU activation, 4× wider hidden dim)
→ Add & Norm
→ Repeat N times (N=12 for BERT-base, N=24 for BERT-large)

Decoder block (GPT-style — causal/autoregressive):

Same as encoder, but with CAUSAL (masked) self-attention.
The mask sets all future positions to -∞ before softmax.
Token i can only attend to positions 0 … i — never future tokens.
This enables autoregressive generation: predict next token given all previous.

Positional Encoding — Transformers have no built-in sense of order. Attention is set-like, not sequence-like. Positional encodings are added to token embeddings to inject position information. Early models used fixed sinusoidal functions. Modern LLMs use Rotary Position Embedding (RoPE) (LLaMA, Qwen, Mistral) or ALiBi for better length generalisation beyond the training context window.

BERT vs GPT — The Two Paradigms

BERT (Encoder-only)GPT (Decoder-only)
ContextBidirectional — sees all tokens simultaneouslyCausal — only sees tokens to the left
Pre-trainingMLM: mask 15% of tokens, predict the masked onesCLM: predict the next token at every position
OutputContextualised embeddings for every tokenProbability distribution over the next token
Best forClassification, NER, extractive QA, sentence embeddingsText generation, chat, summarisation, code
ExamplesBERT, RoBERTa, DeBERTa, sentence-transformersGPT-4, Claude, LLaMA, Gemini, Qwen
Can generate freely?No — no causal maskingYes — autoregressive decoding
Note (The third family: Encoder-Decoder (T5, BART, Flan-T5))

Encoder reads the input bidirectionally; decoder generates output autoregressively, attending to encoder outputs via cross-attention. Best for translation, abstractive summarisation, and multi-task instruction following (Flan-T5). Less dominant in the LLM era but important to recognise.

LLM Landscape — Know Your Models

You must be able to compare models fluently. Interviewers will ask: “Which model would you choose for X and why?” Always structure your answer around four axes: context window, cost, open vs closed, and specialisation.

ModelProviderContextStrengthsBest Use Cases
GPT-4oOpenAI128KMultimodal, fast, SOTA reasoningEnterprise apps, vision+text, coding, function calling
Claude 3.7Anthropic200KLong context, safety, nuanced instruction followingDocument analysis, compliance, long-form writing
LLaMA 3.1Meta (OSS)128KOpen weights, customisable, private deploymentFine-tuning, on-prem RAG, cost-sensitive production
Gemini 1.5 ProGoogle1MMassive context window, multimodal, Google ecosystemVideo/audio analysis, processing enormous documents
Mixtral 8x7BMistral (OSS)32KMoE architecture — 8 experts, activates 2 per tokenEfficient serving, edge deployment, open fine-tuning
Qwen2.5Alibaba (OSS)128KStrong multilingual and code, open weightsMultilingual apps, Asian markets, coding assistants
Nova ProAmazon300KAWS native, multimodal, Bedrock integrationAWS-native pipelines, enterprise AWS shops
Tip (The interview answer formula)

When asked which model to use: (1) identify the key constraint (latency / cost / accuracy / data privacy), (2) eliminate models that fail it, (3) name your choice with a one-sentence justification. Example: “For on-premise deployment with fine-tuning on proprietary data, I’d use LLaMA 3.1 — open weights, competitive performance, no data leaving our infrastructure.”

Prompt Engineering — From Basics to Expert

Prompt engineering is the practice of designing inputs to reliably elicit the best outputs from LLMs. It is heavily tested at senior DS interviews — not because it replaces engineering, but because it reveals how deeply you understand the model.

The Technique Hierarchy

TechniqueWhen to useThe core mechanism
Zero-shotSimple, well-defined tasks the model knows wellClear instruction only, no examples
Few-shotCustom formats, domain-specific classification3–5 input/output examples prime in-context learning
Chain-of-ThoughtMaths, multi-step logic, complex reasoningForces the model to allocate compute to intermediate steps
ReActTool-using agents, multi-hop information gatheringInterleaves reasoning (Thought) and actions (Act/Observation)
Self-consistencyHigh-stakes reasoning where accuracy matters mostSample multiple CoT paths, take the majority answer

Zero-shot — instruction only:

Classify the sentiment of this review as Positive, Negative, or Neutral:
Review: "The battery life is amazing but the camera is disappointing."
Sentiment:

Few-shot — examples prime the model’s in-context learning:

Review: "Best product I've ever bought!" → Positive
Review: "Arrived damaged, very disappointed." → Negative
Review: "It works, nothing special." → Neutral
Review: "Incredible speed but the interface is confusing." →
Note (Recency bias in few-shot prompting)

LLMs give more weight to examples that appear later in the prompt. Always put your most representative and cleanest examples last. Shuffle edge cases to the middle.

Chain-of-Thought — force the model to reason before answering:

Q: A store starts with 48 apples. They sell 15, then receive 20% of their
original stock as a restock. How many apples do they have now?
Think step by step:
1. Start: 48 apples
2. After selling: 48 − 15 = 33 apples
3. Restock: 20% of 48 = 9.6 ≈ 9 apples received
4. Final: 33 + 9 = 42 apples
Answer: 42 apples

ReAct — interleave Thought, Action, and Observation:

Thought: I need the current AAPL price to calculate market cap.
Action: search("AAPL stock price today")
Obs: AAPL is trading at $192.50.
Thought: Apple has ~15.5B shares outstanding. I can calculate now.
Action: calculate(192.50 * 15_500_000_000)
Obs: $2,983,750,000,000
Answer: Apple's current market cap is approximately $2.98 trillion.

System Prompts & Prompt Structure

Production prompts follow a consistent structure. Every missing element is a potential failure mode:

SYSTEM_PROMPT = """
You are a customer support specialist for Acme Corp.
Role: Answer customer questions accurately and helpfully.
Rules:
- Only answer questions about Acme Corp products.
- If unsure, say "I don't have that information" — never fabricate.
- Keep responses under 150 words.
- End every response with: "Is there anything else I can help you with?"
Tone: Professional, warm, empathetic.
Format: Plain text. Use bullet points only for multi-step instructions.
"""
USER_MSG = """
Context: {retrieved_docs}
Customer question: {question}
Response:"""
# The five elements of a strong system prompt:
# 1. Persona — who the model IS
# 2. Task — what it should do
# 3. Rules — what it must / must not do
# 4. Format — how to structure output
# 5. Examples — few-shot if format is non-standard

Structured Outputs & Function Calling

Function calling forces the LLM to return structured JSON — critical for production integrations where you need reliable parsing:

import openai, json
client = openai.OpenAI()
tools = [{
'type': 'function',
'function': {
'name': 'extract_order_info',
'description': 'Extract order details from a customer message',
'parameters': {
'type': 'object',
'properties': {
'order_id': {
'type': 'string',
'description': 'Order ID if mentioned'
},
'issue_type': {
'type': 'string',
'enum': ['delay', 'wrong_item', 'damaged', 'missing', 'other']
},
'urgency': {
'type': 'string',
'enum': ['low', 'medium', 'high']
},
'customer_emotion': {
'type': 'string',
'enum': ['neutral', 'frustrated', 'angry', 'happy']
}
},
'required': ['issue_type', 'urgency', 'customer_emotion']
}
}
}]
response = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': 'My order #12345 is 2 weeks late and I need it ASAP!'
}],
tools=tools,
tool_choice={'type': 'function', 'function': {'name': 'extract_order_info'}}
)
result = json.loads(
response.choices[0].message.tool_calls[0].function.arguments
)
# {'order_id': '12345', 'issue_type': 'delay', 'urgency': 'high', 'customer_emotion': 'frustrated'}
Tip (Function calling vs output parsers)

Use native function calling when your provider supports it (OpenAI, Anthropic) — it is more reliable than instructing the model to output JSON. For models without native tool support, use a Pydantic output parser with retry logic in LangChain.

Retrieval-Augmented Generation (RAG)

RAG is the most important production pattern for LLM applications. Every senior GenAI DS must be able to build, evaluate, and optimise a RAG system from scratch.

Why RAG? The Three Problems It Solves

LLMs have three fatal flaws for production use:

  1. Knowledge cutoff — they know nothing after their training date. GPT-4’s world ends in April 2023.
  2. Hallucination — when they don’t know something, they confidently make it up. They cannot say “I don’t know” unless trained to.
  3. No private data — they cannot access your internal documents, databases, or real-time information.

RAG solves all three: retrieve relevant documents at query time and inject them into the context window as grounding evidence. The LLM answers from evidence, not from memory.

The Architecture

RAG Pipeline

Click each step to learn what happens. The two phases run at different times — indexing is done once, querying happens per request.

One-time setup
Important (The same-model rule — most common RAG bug)

Always embed documents and queries with the exact same embedding model. Using different models puts your vectors in different spaces — similarity search returns garbage. This is the single most common cause of “my RAG retrieves nonsense” in production.

Build a Full RAG Pipeline

from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# ── Step 1: LOAD ──────────────────────────────────────────────────────────────
loader = DirectoryLoader('./docs/', glob='**/*.pdf', loader_cls=PyPDFLoader)
documents = loader.load()
# ── Step 2: CHUNK ─────────────────────────────────────────────────────────────
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50, # 10% overlap prevents info loss at boundaries
separators=['\n\n', '\n', '.', ' ', ''] # Tries each in order (prefers natural breaks)
)
chunks = splitter.split_documents(documents)
# ── Step 3: EMBED + INDEX ─────────────────────────────────────────────────────
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vectordb = FAISS.from_documents(chunks, embeddings)
vectordb.save_local('./faiss_index') # Persist for reuse
# ── Step 4: BUILD RETRIEVAL CHAIN ─────────────────────────────────────────────
llm = ChatOpenAI(model='gpt-4o', temperature=0)
PROMPT = PromptTemplate(
input_variables=['context', 'question'],
template="""
You are a helpful assistant. Answer using ONLY the context below.
If the answer is not in the context, say "I don't have that information."
Context: {context}
Question: {question}
Answer:"""
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type='stuff', # 'map_reduce' when retrieved chunks overflow context
retriever=vectordb.as_retriever(search_kwargs={'k': 5}),
chain_type_kwargs={'prompt': PROMPT},
return_source_documents=True
)
# ── Step 5: QUERY ─────────────────────────────────────────────────────────────
result = qa_chain({'query': 'What are the return policies?'})
print(result['result']) # The grounded answer
print(result['source_documents']) # Evidence chunks — show these as citations

Advanced RAG — What Separates Good from Great

TechniqueWhat it doesWhen to add it
Hybrid SearchCombine vector (dense) + BM25 (sparse) via Reciprocal Rank FusionWhen exact keyword matches matter (product codes, names, IDs)
RerankingRetrieve top-20 with ANN, rerank with cross-encoder, keep top-5High-accuracy apps where 50–100ms extra latency is acceptable
HyDEGenerate a hypothetical ideal answer first, embed it, then searchFactoid queries where the question and answer embeddings naturally differ
Query ExpansionBreak a complex query into sub-queries, answer each, synthesiseMulti-hop questions spanning multiple documents
Parent-Child ChunkingIndex small chunks for precision, return larger parent chunksWhen retrieval is precise but generation needs more surrounding context
Metadata FilteringPre-filter by date/source/category before semantic searchLarge knowledge bases with clear categorical structure
Tip (Chunking is the most underrated RAG decision)

Chunk size dramatically affects quality. Small chunks (128 tokens) — precise retrieval, thin context. Large chunks (1024 tokens) — rich context, imprecise retrieval. Good default: 512 tokens with 50-token overlap. For single-sentence factoid QA, go smaller. For summarisation, go larger. Measure retrieval recall first — most RAG failures are retrieval failures, not generation failures.

Vector Databases & Embeddings

Embeddings — Converting Meaning to Numbers

An embedding model converts text into a dense vector — typically 768 to 3072 numbers. The remarkable property: texts with similar meanings land near each other in this high-dimensional space, regardless of exact wording.

from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str) -> np.ndarray:
response = client.embeddings.create(
model='text-embedding-3-small', # 1536 dimensions, $0.02 per 1M tokens
input=text
)
return np.array(response.data[0].embedding)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
v1 = embed('machine learning')
v2 = embed('deep learning')
v3 = embed('cooking recipes')
print(cosine_similarity(v1, v2)) # ~0.88 — high similarity
print(cosine_similarity(v1, v3)) # ~0.42 — low similarity
Intuition (The city map analogy)

Think of embedding space as a city. “Machine learning” and “deep learning” are in the same neighbourhood — a short walk apart. “Cooking recipes” is across the city. When you embed a query, you drop a pin on this map, then find all documents within a certain radius. Cosine similarity measures the angle between vectors (not distance) — this works better for high-dimensional text because all vectors sit on a high-dimensional sphere.

import faiss
import numpy as np
dimension = 1536 # Must match your embedding model's output
# ── IndexFlatL2 ───────────────────────────────────────────────────────────────
# Exact L2 (Euclidean) search — brute force, 100% recall
# Good for < 100K vectors or when perfect accuracy is required
index_flat = faiss.IndexFlatL2(dimension)
# ── IndexIVFFlat ──────────────────────────────────────────────────────────────
# Approximate — clusters vectors into nlist cells, searches nprobe cells at query time
# Faster but ~1% recall loss. Must train before adding vectors.
quantizer = faiss.IndexFlatL2(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist=100)
index_ivf.train(embeddings_np)
index_ivf.nprobe = 10 # Higher = more accurate but slower
# ── IndexHNSWFlat ─────────────────────────────────────────────────────────────
# Graph-based — best speed/recall tradeoff for millions of vectors
# No training required, very fast at query time
index_hnsw = faiss.IndexHNSWFlat(dimension, M=32) # M = graph connectivity
# Add and search
index_flat.add(embeddings_np) # shape: (n_documents, dimension)
D, I = index_flat.search(query_emb, k=5) # D=distances, I=indices
retrieved_chunks = [all_chunks[i] for i in I[0]]

Vector Database Comparison

DatabaseTypeBest forStrengthsWeaknesses
FAISSOpen-sourceResearch, offline batchFastest raw search, flexible index typesNo metadata filtering, no HTTP API
ChromaDBOpen-sourcePrototyping, local RAGSimplest setup, embedded PythonSlower at scale
PineconeManaged SaaSProduction startupsFully managed, excellent filtering, hybrid searchCost scales with usage
WeaviateOSS + managedComplex data modelsGraphQL API, built-in vectorisers, multi-modalSteeper learning curve
QdrantOSS (Rust)Self-hosted productionVery fast, excellent filtering, rich payloadsLess managed-service ecosystem
pgvectorPostgreSQL ext.Existing Postgres stacksNo new infrastructureSlower than dedicated DBs at scale
Tip (Interview rule of thumb)

Prototype → ChromaDB. Production managed → Pinecone. Production self-hosted → Qdrant. Already on PostgreSQL → pgvector. Complex data model or multi-modal → Weaviate.

LangChain & LlamaIndex — Building LLM Applications

LangChain: LCEL Chains

LangChain’s modern interface is LCEL (LangChain Expression Language) — a composable pipe syntax that makes chains readable and streamable:

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
('system', 'You are a helpful data scientist.'),
('user', '{question}')
])
llm = ChatOpenAI(model='gpt-4o', temperature=0)
parser = StrOutputParser()
# Pipe operator: left → right. Each component's output feeds the next.
chain = prompt | llm | parser
response = chain.invoke({'question': 'What is overfitting?'})

Memory — maintain conversation history across turns:

from langchain.memory import ConversationBufferWindowMemory
# Keeps only the last k exchanges — prevents unbounded context growth
memory = ConversationBufferWindowMemory(k=5)

Structured extraction with Pydantic:

from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel
class SentimentResult(BaseModel):
sentiment: str
confidence: float
key_phrases: list[str]
parser = PydanticOutputParser(pydantic_object=SentimentResult)
# Add parser.get_format_instructions() to your prompt for reliable extraction

LangChain Agents — The Power Feature

Agents let LLMs decide which tools to call and in what order — the model becomes an autonomous decision-maker rather than a single-shot responder:

from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain import hub
@tool
def get_current_date() -> str:
"""Returns today's date in YYYY-MM-DD format."""
from datetime import date
return str(date.today())
@tool
def calculate_metrics(data_json: str) -> str:
"""Calculate mean, median, and std from a JSON array of numbers.
Input must be a JSON string like: [1, 2, 3, 4, 5]
"""
import json, statistics
data = json.loads(data_json)
return json.dumps({
'mean': statistics.mean(data),
'median': statistics.median(data),
'std': statistics.stdev(data)
})
tools = [get_current_date, calculate_metrics]
llm = ChatOpenAI(model='gpt-4o', temperature=0)
prompt = hub.pull('hwchase17/react') # Standard ReAct prompt template
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
max_iterations=10 # Always cap — prevents infinite loops
)
result = executor.invoke({'input': 'What is the mean of [5, 10, 15, 20, 25]?'})
print(result['output'])
Warning (Agents are non-deterministic — plan for failure)

Agents call tools based on the LLM’s reasoning chain, which can fail, loop, or take unexpected paths. Always set max_iterations, add defensive error handling to every tool, and test with adversarial inputs. Never give agents write access to production systems without a human-in-the-loop confirmation step.

LlamaIndex — When RAG Is the Core Product

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model='gpt-4o', temperature=0)
Settings.embed_model = OpenAIEmbedding(model='text-embedding-3-small')
# Load, embed, and index in three lines
documents = SimpleDirectoryReader('./docs/').load_data()
index = VectorStoreIndex.from_documents(documents)
# Simple query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query('What is the Q3 revenue?')
print(response)
print(response.source_nodes) # Retrieved chunks used as evidence
# Advanced: SubQuestion Query Engine
# Automatically decomposes "Compare Q2 vs Q3 revenue and profit"
# into sub-questions, answers each, then synthesises
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool
tools = [QueryEngineTool.from_defaults(
query_engine=index.as_query_engine(),
name='company_docs',
description='Financial and operational documents for the company'
)]
sub_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
Tip (LangChain vs LlamaIndex — when to use which)

LangChain: general LLM orchestration, multi-tool agents, complex conditional workflows, broad ecosystem. Choose when building flexible agent systems or complex pipelines. LlamaIndex: specialised for data indexing and querying — better out-of-the-box RAG, richer indexing strategies (knowledge graphs, tree indices), cleaner query abstractions. If your primary use case is querying a large document collection, start with LlamaIndex.

LLM Fine-Tuning — When and How

Fine-tuning adapts a pre-trained LLM to a specific task, domain, or style by continuing training on a curated dataset. It is expensive and often unnecessary. A great senior DS knows when not to fine-tune.

The Decision Matrix

ApproachBest forProsCons
Prompting onlyGeneral tasks, rapid prototypingZero cost, instant, flexibleLimited customisation, token cost per call
RAGUp-to-date info, private data, citationsNo training, updatable knowledgeRetrieval quality dependent, adds latency
Full fine-tuningDeep domain adaptation, strong style changesBest task performance, no retrieval overheadExtremely expensive, needs thousands of examples, static
LoRA / QLoRAEfficient adaptation on limited GPU budget10–100× less compute, weights can be mergedLess powerful than full fine-tuning
RLHF / DPOAligning the model to human preferencesBest for chat quality and helpfulnessMost complex pipeline, needs preference pairs
Important (Exhaust these options before fine-tuning)

Before committing to a fine-tuning run: (1) exhaust prompt engineering with 10+ few-shot examples, (2) check if RAG solves the knowledge gap, (3) test system prompt variations. Fine-tuning is best at changing format and style — if the model lacks factual knowledge, fine-tuning often just makes the hallucinations more confident and fluent.

LoRA — Low-Rank Adaptation

LoRA is the dominant fine-tuning technique. Instead of updating all parameters (billions of numbers), it injects small trainable matrices into specific layers, leaving the base model frozen.

The core idea: original weight matrix WRd×dW \in \mathbb{R}^{d \times d} is frozen. LoRA adds a low-rank update:

W=W+ΔW=W+ABW' = W + \Delta W = W + AB

where ARd×rA \in \mathbb{R}^{d \times r} and BRr×dB \in \mathbb{R}^{r \times d}, with the rank rdr \ll d. Only AA and BB are trained. With r=8r = 8 on a 7B model, trainable parameters drop from 7 billion to roughly 8 million — about 0.1%.

After training, merge back: Wfinal=W+ABW_{\text{final}} = W + AB — zero inference overhead, same model size.

QLoRA adds 4-bit quantisation of the base model before applying LoRA, making fine-tuning possible on a single consumer GPU (e.g., an RTX 4090 can fine-tune a 13B model).

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# QLoRA: 4-bit quantise the base model first
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Meta-Llama-3.1-8B-Instruct',
quantization_config=bnb_config,
device_map='auto'
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank — higher r = more capacity but more params
lora_alpha=32, # Scaling factor — rule of thumb: lora_alpha = 2 × r
target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'], # Attention layers
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,044,941,312 || trainable%: 0.1695%

Training Data Format

# ── Alpaca-style instruction format ───────────────────────────────────────────
training_example = {
'instruction': 'Classify the sentiment of this customer review.',
'input': 'The product quality is excellent but delivery was very slow.',
'output': 'Mixed sentiment: Positive about product quality, Negative about delivery.'
}
# ── ChatML format (OpenAI-compatible, LLaMA 3 native) ─────────────────────────
chatml_example = """<|im_start|>system
You are a helpful data scientist assistant.<|im_end|>
<|im_start|>user
What is precision in machine learning?<|im_end|>
<|im_start|>assistant
Precision is TP/(TP+FP) — the fraction of positive predictions that are actually positive.<|im_end|>"""
# ── Data quantity guidelines ──────────────────────────────────────────────────
# Style / format changes: 500–2,000 examples
# Domain adaptation: 1,000–10,000 examples
# Full task fine-tuning: 10,000+ examples
#
# Quality >> Quantity. 100 clean, diverse examples consistently beat
# 1,000 noisy or repetitive ones.
Tip (Quality beats quantity — always)

The three most common data quality issues: (1) duplicate or near-duplicate examples, (2) answers that are too generic and do not demonstrate the target behaviour, (3) inconsistent formatting across examples. Always do a manual review pass of at least 50 random samples before launching a training run.


Liked this article? this post and share it with a friend. Have a question, feedback or simply wish to contact me privately? Shoot me a DM and I'll do my best to get back to you.

Have a wonderful day.

– Sarath