Understanding the Technology Behind Large Language Models

Transformers — The Architecture Behind Everything

The Transformer (Vaswani et al., 2017 — Attention Is All You Need) is the foundation of every modern LLM. GPT-4, Claude, LLaMA, Gemini — all of them are Transformer variants. You do not need to train one from scratch, but you need to understand every component at a conceptual and mathematical level.

Why Transformers? The Problem They Solved

Before Transformers, sequence models were Recurrent Neural Networks (RNNs). RNNs read tokens one by one, left to right, compressing everything seen so far into a single fixed-size hidden state vector.

This created two crippling problems:

The bottleneck problem — all information had to squeeze through one vector. A 10,000-word document reduced to ~512 numbers. Long-range dependencies evaporated.
The speed problem — sequential processing cannot be parallelised. You had to wait for token $t$ to finish before processing token $t+1$ . Training on billions of tokens was not practical.

Transformers solved both by abandoning sequential processing entirely. Every token attends to every other token simultaneously and in parallel. This is the core insight.

The Attention Mechanism

Attention lets each token decide which other tokens are most relevant to understanding its meaning.

Consider: “The cat sat on the mat because it was tired.”

To understand “it”, you need to know it refers to “cat” and not “mat”. A human reading the sentence has no trouble — but an RNN loses “cat” to the bottleneck by the time it processes “it”. Attention solves this by letting “it” directly query every other token and discover that “cat” is the most relevant.

Intuition (The search engine analogy)

Think of attention as a tiny search engine inside the model. For each token (your query), you have a database of all other tokens. Each database entry has a key (a summary of what it offers) and a value (the actual information it carries). You compute how relevant your query is to each key, turn those scores into weights with softmax, then retrieve a weighted blend of values. High relevance = more of that token’s meaning flows into yours.

For each token, three vectors are computed by multiplying the token embedding by three separate learned weight matrices:

$Q = W_Q \cdot x, \quad K = W_K \cdot x, \quad V = W_V \cdot x$

Q (Query) — “What am I looking for?”
K (Key) — “What do I have to offer?”
V (Value) — “What information do I actually carry?”

The attention weights are computed and applied in one formula:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)\!V$

The $\sqrt{d_k}$ scaling is crucial: in high dimensions (e.g., $d_k = 64$ ), raw dot products grow proportionally to $\sqrt{d_k}$ , pushing softmax into a near-zero-gradient region and killing learning. Dividing keeps the scores in a numerically stable range.

Try it below — click any word to see which other tokens it attends to. Start with “it”:

Self-Attention Visualiser

it → ?

Click any word to see what it “attends to” — which tokens it considers most relevant when building its representation. Start with “it” to see pronoun resolution.

The

cat

38%

sat

10%

the

mat

14%

because

was

tired

it: "it" strongly attends to "cat" (38%) — the model correctly resolves the pronoun! This is coreference resolution emerging from training, not explicit rules.

1
import torch
2
import torch.nn.functional as F
3

4
def scaled_dot_product_attention(Q, K, V, mask=None):
5
    d_k = Q.size(-1)
6
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
7
    if mask is not None:
8
        # mask == 0 → set to -1e9 so softmax gives ~0 weight
9
        scores = scores.masked_fill(mask == 0, -1e9)
10
    attn_weights = F.softmax(scores, dim=-1)
11
    return torch.matmul(attn_weights, V), attn_weights

Multi-Head Attention

A single attention head learns one type of relationship. Run $h$ heads in parallel, each with its own $W_Q$ , $W_K$ , $W_V$ matrices — each head specialises in something different.

1
# MultiHeadAttention intuition:
2
# 1. Project Q, K, V into h subspaces (each of dim d_model / h)
3
# 2. Run scaled dot-product attention independently in each head
4
# 3. Concatenate all h outputs → shape (seq_len, h * d_v)
5
# 4. Apply linear projection W_O → back to (seq_len, d_model)
6

7
# Different heads learn different relationship types:
8
#   Head 1:  syntactic (subject–verb agreement)
9
#   Head 2:  coreference (pronouns → nouns)
10
#   Head 3:  positional (adjacent token patterns)
11
#   Head 4+: semantic clusters, domain-specific patterns

Tip (GPT-2 small numbers to have ready)

d_model = 768 | n_heads = 12 (each head is 64-dim) | n_layers = 12 | vocab_size = 50,257 | context = 1,024 tokens. These are the classic reference dimensions interviewers expect you to know.

Full Transformer Architecture

Encoder block (BERT-style — bidirectional):

1
Input → Tokenisation → Token Embeddings + Positional Encoding
2
     → Multi-Head Self-Attention (attends to ALL positions)
3
     → Add & Norm  (residual connection + LayerNorm)
4
     → Feed-Forward Network  (2-layer MLP, GELU activation, 4× wider hidden dim)
5
     → Add & Norm
6
     → Repeat N times  (N=12 for BERT-base, N=24 for BERT-large)

Decoder block (GPT-style — causal/autoregressive):

1
Same as encoder, but with CAUSAL (masked) self-attention.
2
The mask sets all future positions to -∞ before softmax.
3
Token i can only attend to positions 0 … i — never future tokens.
4
This enables autoregressive generation: predict next token given all previous.

Positional Encoding — Transformers have no built-in sense of order. Attention is set-like, not sequence-like. Positional encodings are added to token embeddings to inject position information. Early models used fixed sinusoidal functions. Modern LLMs use Rotary Position Embedding (RoPE) (LLaMA, Qwen, Mistral) or ALiBi for better length generalisation beyond the training context window.

BERT vs GPT — The Two Paradigms

	BERT (Encoder-only)	GPT (Decoder-only)
Context	Bidirectional — sees all tokens simultaneously	Causal — only sees tokens to the left
Pre-training	MLM: mask 15% of tokens, predict the masked ones	CLM: predict the next token at every position
Output	Contextualised embeddings for every token	Probability distribution over the next token
Best for	Classification, NER, extractive QA, sentence embeddings	Text generation, chat, summarisation, code
Examples	BERT, RoBERTa, DeBERTa, sentence-transformers	GPT-4, Claude, LLaMA, Gemini, Qwen
Can generate freely?	No — no causal masking	Yes — autoregressive decoding

Note (The third family: Encoder-Decoder (T5, BART, Flan-T5))

Encoder reads the input bidirectionally; decoder generates output autoregressively, attending to encoder outputs via cross-attention. Best for translation, abstractive summarisation, and multi-task instruction following (Flan-T5). Less dominant in the LLM era but important to recognise.

LLM Landscape — Know Your Models

You must be able to compare models fluently. Interviewers will ask: “Which model would you choose for X and why?” Always structure your answer around four axes: context window, cost, open vs closed, and specialisation.

Model	Provider	Context	Strengths	Best Use Cases
GPT-4o	OpenAI	128K	Multimodal, fast, SOTA reasoning	Enterprise apps, vision+text, coding, function calling
Claude 3.7	Anthropic	200K	Long context, safety, nuanced instruction following	Document analysis, compliance, long-form writing
LLaMA 3.1	Meta (OSS)	128K	Open weights, customisable, private deployment	Fine-tuning, on-prem RAG, cost-sensitive production
Gemini 1.5 Pro	Google	1M	Massive context window, multimodal, Google ecosystem	Video/audio analysis, processing enormous documents
Mixtral 8x7B	Mistral (OSS)	32K	MoE architecture — 8 experts, activates 2 per token	Efficient serving, edge deployment, open fine-tuning
Qwen2.5	Alibaba (OSS)	128K	Strong multilingual and code, open weights	Multilingual apps, Asian markets, coding assistants
Nova Pro	Amazon	300K	AWS native, multimodal, Bedrock integration	AWS-native pipelines, enterprise AWS shops

Tip (The interview answer formula)

When asked which model to use: (1) identify the key constraint (latency / cost / accuracy / data privacy), (2) eliminate models that fail it, (3) name your choice with a one-sentence justification. Example: “For on-premise deployment with fine-tuning on proprietary data, I’d use LLaMA 3.1 — open weights, competitive performance, no data leaving our infrastructure.”

Prompt Engineering — From Basics to Expert

Prompt engineering is the practice of designing inputs to reliably elicit the best outputs from LLMs. It is heavily tested at senior DS interviews — not because it replaces engineering, but because it reveals how deeply you understand the model.

The Technique Hierarchy

Technique	When to use	The core mechanism
Zero-shot	Simple, well-defined tasks the model knows well	Clear instruction only, no examples
Few-shot	Custom formats, domain-specific classification	3–5 input/output examples prime in-context learning
Chain-of-Thought	Maths, multi-step logic, complex reasoning	Forces the model to allocate compute to intermediate steps
ReAct	Tool-using agents, multi-hop information gathering	Interleaves reasoning (Thought) and actions (Act/Observation)
Self-consistency	High-stakes reasoning where accuracy matters most	Sample multiple CoT paths, take the majority answer

Zero-shot — instruction only:

1
Classify the sentiment of this review as Positive, Negative, or Neutral:
2

3
Review: "The battery life is amazing but the camera is disappointing."
4

5
Sentiment:

Few-shot — examples prime the model’s in-context learning:

1
Review: "Best product I've ever bought!"            → Positive
2
Review: "Arrived damaged, very disappointed."       → Negative
3
Review: "It works, nothing special."               → Neutral
4

5
Review: "Incredible speed but the interface is confusing." →

Note (Recency bias in few-shot prompting)

LLMs give more weight to examples that appear later in the prompt. Always put your most representative and cleanest examples last. Shuffle edge cases to the middle.

Chain-of-Thought — force the model to reason before answering:

1
Q: A store starts with 48 apples. They sell 15, then receive 20% of their
2
   original stock as a restock. How many apples do they have now?
3

4
Think step by step:
5
1. Start: 48 apples
6
2. After selling: 48 − 15 = 33 apples
7
3. Restock: 20% of 48 = 9.6 ≈ 9 apples received
8
4. Final: 33 + 9 = 42 apples
9

10
Answer: 42 apples

ReAct — interleave Thought, Action, and Observation:

1
Thought: I need the current AAPL price to calculate market cap.
2
Action:  search("AAPL stock price today")
3
Obs:     AAPL is trading at $192.50.
4

5
Thought: Apple has ~15.5B shares outstanding. I can calculate now.
6
Action:  calculate(192.50 * 15_500_000_000)
7
Obs:     $2,983,750,000,000
8

9
Answer:  Apple's current market cap is approximately $2.98 trillion.

System Prompts & Prompt Structure

Production prompts follow a consistent structure. Every missing element is a potential failure mode:

1
SYSTEM_PROMPT = """
2
You are a customer support specialist for Acme Corp.
3

4
Role: Answer customer questions accurately and helpfully.
5

6
Rules:
7
- Only answer questions about Acme Corp products.
8
- If unsure, say "I don't have that information" — never fabricate.
9
- Keep responses under 150 words.
10
- End every response with: "Is there anything else I can help you with?"
11

12
Tone: Professional, warm, empathetic.
13
Format: Plain text. Use bullet points only for multi-step instructions.
14
"""
15

16
USER_MSG = """
17
Context: {retrieved_docs}
18

19
Customer question: {question}
20

21
Response:"""
22

23
# The five elements of a strong system prompt:
24
# 1. Persona    — who the model IS
25
# 2. Task       — what it should do
26
# 3. Rules      — what it must / must not do
27
# 4. Format     — how to structure output
28
# 5. Examples   — few-shot if format is non-standard

Structured Outputs & Function Calling

Function calling forces the LLM to return structured JSON — critical for production integrations where you need reliable parsing:

1
import openai, json
2

3
client = openai.OpenAI()
4

5
tools = [{
6
    'type': 'function',
7
    'function': {
8
        'name': 'extract_order_info',
9
        'description': 'Extract order details from a customer message',
10
        'parameters': {
11
            'type': 'object',
12
            'properties': {
13
                'order_id': {
14
                    'type': 'string',
15
                    'description': 'Order ID if mentioned'
16
                },
17
                'issue_type': {
18
                    'type': 'string',
19
                    'enum': ['delay', 'wrong_item', 'damaged', 'missing', 'other']
20
                },
21
                'urgency': {
22
                    'type': 'string',
23
                    'enum': ['low', 'medium', 'high']
24
                },
25
                'customer_emotion': {
26
                    'type': 'string',
27
                    'enum': ['neutral', 'frustrated', 'angry', 'happy']
28
                }
29
            },
30
            'required': ['issue_type', 'urgency', 'customer_emotion']
31
        }
32
    }
33
}]
34

35
response = client.chat.completions.create(
36
    model='gpt-4o',
37
    messages=[{
38
        'role': 'user',
39
        'content': 'My order #12345 is 2 weeks late and I need it ASAP!'
40
    }],
41
    tools=tools,
42
    tool_choice={'type': 'function', 'function': {'name': 'extract_order_info'}}
43
)
44

45
result = json.loads(
46
    response.choices[0].message.tool_calls[0].function.arguments
47
)
48
# {'order_id': '12345', 'issue_type': 'delay', 'urgency': 'high', 'customer_emotion': 'frustrated'}

Tip (Function calling vs output parsers)

Use native function calling when your provider supports it (OpenAI, Anthropic) — it is more reliable than instructing the model to output JSON. For models without native tool support, use a Pydantic output parser with retry logic in LangChain.

Retrieval-Augmented Generation (RAG)

RAG is the most important production pattern for LLM applications. Every senior GenAI DS must be able to build, evaluate, and optimise a RAG system from scratch.

Why RAG? The Three Problems It Solves

LLMs have three fatal flaws for production use:

Knowledge cutoff — they know nothing after their training date. GPT-4’s world ends in April 2023.
Hallucination — when they don’t know something, they confidently make it up. They cannot say “I don’t know” unless trained to.
No private data — they cannot access your internal documents, databases, or real-time information.

RAG solves all three: retrieve relevant documents at query time and inject them into the context window as grounding evidence. The LLM answers from evidence, not from memory.

The Architecture

RAG Pipeline

Click each step to learn what happens. The two phases run at different times — indexing is done once, querying happens per request.

One-time setup

Important (The same-model rule — most common RAG bug)

Always embed documents and queries with the exact same embedding model. Using different models puts your vectors in different spaces — similarity search returns garbage. This is the single most common cause of “my RAG retrieves nonsense” in production.

Build a Full RAG Pipeline

1
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
2
from langchain.text_splitter import RecursiveCharacterTextSplitter
3
from langchain.embeddings import OpenAIEmbeddings
4
from langchain.vectorstores import FAISS
5
from langchain.chat_models import ChatOpenAI
6
from langchain.chains import RetrievalQA
7
from langchain.prompts import PromptTemplate
8

9
# ── Step 1: LOAD ──────────────────────────────────────────────────────────────
10
loader = DirectoryLoader('./docs/', glob='**/*.pdf', loader_cls=PyPDFLoader)
11
documents = loader.load()
12

13
# ── Step 2: CHUNK ─────────────────────────────────────────────────────────────
14
splitter = RecursiveCharacterTextSplitter(
15
    chunk_size=512,
16
    chunk_overlap=50,          # 10% overlap prevents info loss at boundaries
17
    separators=['\n\n', '\n', '.', ' ', '']  # Tries each in order (prefers natural breaks)
18
)
19
chunks = splitter.split_documents(documents)
20

21
# ── Step 3: EMBED + INDEX ─────────────────────────────────────────────────────
22
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
23
vectordb = FAISS.from_documents(chunks, embeddings)
24
vectordb.save_local('./faiss_index')   # Persist for reuse
25

26
# ── Step 4: BUILD RETRIEVAL CHAIN ─────────────────────────────────────────────
27
llm = ChatOpenAI(model='gpt-4o', temperature=0)
28

29
PROMPT = PromptTemplate(
30
    input_variables=['context', 'question'],
31
    template="""
32
You are a helpful assistant. Answer using ONLY the context below.
33
If the answer is not in the context, say "I don't have that information."
34

35
Context: {context}
36

37
Question: {question}
38

39
Answer:"""
40
)
41

42
qa_chain = RetrievalQA.from_chain_type(
43
    llm=llm,
44
    chain_type='stuff',    # 'map_reduce' when retrieved chunks overflow context
45
    retriever=vectordb.as_retriever(search_kwargs={'k': 5}),
46
    chain_type_kwargs={'prompt': PROMPT},
47
    return_source_documents=True
48
)
49

50
# ── Step 5: QUERY ─────────────────────────────────────────────────────────────
51
result = qa_chain({'query': 'What are the return policies?'})
52
print(result['result'])            # The grounded answer
53
print(result['source_documents'])  # Evidence chunks — show these as citations

Advanced RAG — What Separates Good from Great

Technique	What it does	When to add it
Hybrid Search	Combine vector (dense) + BM25 (sparse) via Reciprocal Rank Fusion	When exact keyword matches matter (product codes, names, IDs)
Reranking	Retrieve top-20 with ANN, rerank with cross-encoder, keep top-5	High-accuracy apps where 50–100ms extra latency is acceptable
HyDE	Generate a hypothetical ideal answer first, embed it, then search	Factoid queries where the question and answer embeddings naturally differ
Query Expansion	Break a complex query into sub-queries, answer each, synthesise	Multi-hop questions spanning multiple documents
Parent-Child Chunking	Index small chunks for precision, return larger parent chunks	When retrieval is precise but generation needs more surrounding context
Metadata Filtering	Pre-filter by date/source/category before semantic search	Large knowledge bases with clear categorical structure

Tip (Chunking is the most underrated RAG decision)

Chunk size dramatically affects quality. Small chunks (128 tokens) — precise retrieval, thin context. Large chunks (1024 tokens) — rich context, imprecise retrieval. Good default: 512 tokens with 50-token overlap. For single-sentence factoid QA, go smaller. For summarisation, go larger. Measure retrieval recall first — most RAG failures are retrieval failures, not generation failures.

Vector Databases & Embeddings

Embeddings — Converting Meaning to Numbers

An embedding model converts text into a dense vector — typically 768 to 3072 numbers. The remarkable property: texts with similar meanings land near each other in this high-dimensional space, regardless of exact wording.

1
from openai import OpenAI
2
import numpy as np
3

4
client = OpenAI()
5

6
def embed(text: str) -> np.ndarray:
7
    response = client.embeddings.create(
8
        model='text-embedding-3-small',  # 1536 dimensions, $0.02 per 1M tokens
9
        input=text
10
    )
11
    return np.array(response.data[0].embedding)
12

13
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
14
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
15

16
v1 = embed('machine learning')
17
v2 = embed('deep learning')
18
v3 = embed('cooking recipes')
19

20
print(cosine_similarity(v1, v2))  # ~0.88 — high similarity
21
print(cosine_similarity(v1, v3))  # ~0.42 — low similarity

Intuition (The city map analogy)

Think of embedding space as a city. “Machine learning” and “deep learning” are in the same neighbourhood — a short walk apart. “Cooking recipes” is across the city. When you embed a query, you drop a pin on this map, then find all documents within a certain radius. Cosine similarity measures the angle between vectors (not distance) — this works better for high-dimensional text because all vectors sit on a high-dimensional sphere.

FAISS — Fast Approximate Nearest-Neighbour Search

1
import faiss
2
import numpy as np
3

4
dimension = 1536  # Must match your embedding model's output
5

6
# ── IndexFlatL2 ───────────────────────────────────────────────────────────────
7
# Exact L2 (Euclidean) search — brute force, 100% recall
8
# Good for < 100K vectors or when perfect accuracy is required
9
index_flat = faiss.IndexFlatL2(dimension)
10

11
# ── IndexIVFFlat ──────────────────────────────────────────────────────────────
12
# Approximate — clusters vectors into nlist cells, searches nprobe cells at query time
13
# Faster but ~1% recall loss. Must train before adding vectors.
14
quantizer = faiss.IndexFlatL2(dimension)
15
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist=100)
16
index_ivf.train(embeddings_np)
17
index_ivf.nprobe = 10   # Higher = more accurate but slower
18

19
# ── IndexHNSWFlat ─────────────────────────────────────────────────────────────
20
# Graph-based — best speed/recall tradeoff for millions of vectors
21
# No training required, very fast at query time
22
index_hnsw = faiss.IndexHNSWFlat(dimension, M=32)  # M = graph connectivity
23

24
# Add and search
25
index_flat.add(embeddings_np)              # shape: (n_documents, dimension)
26
D, I = index_flat.search(query_emb, k=5)  # D=distances, I=indices
27
retrieved_chunks = [all_chunks[i] for i in I[0]]

Vector Database Comparison

Database	Type	Best for	Strengths	Weaknesses
FAISS	Open-source	Research, offline batch	Fastest raw search, flexible index types	No metadata filtering, no HTTP API
ChromaDB	Open-source	Prototyping, local RAG	Simplest setup, embedded Python	Slower at scale
Pinecone	Managed SaaS	Production startups	Fully managed, excellent filtering, hybrid search	Cost scales with usage
Weaviate	OSS + managed	Complex data models	GraphQL API, built-in vectorisers, multi-modal	Steeper learning curve
Qdrant	OSS (Rust)	Self-hosted production	Very fast, excellent filtering, rich payloads	Less managed-service ecosystem
pgvector	PostgreSQL ext.	Existing Postgres stacks	No new infrastructure	Slower than dedicated DBs at scale

Tip (Interview rule of thumb)

Prototype → ChromaDB. Production managed → Pinecone. Production self-hosted → Qdrant. Already on PostgreSQL → pgvector. Complex data model or multi-modal → Weaviate.

LangChain & LlamaIndex — Building LLM Applications

LangChain: LCEL Chains

LangChain’s modern interface is LCEL (LangChain Expression Language) — a composable pipe syntax that makes chains readable and streamable:

1
from langchain_core.prompts import ChatPromptTemplate
2
from langchain_openai import ChatOpenAI
3
from langchain_core.output_parsers import StrOutputParser
4

5
prompt = ChatPromptTemplate.from_messages([
6
    ('system', 'You are a helpful data scientist.'),
7
    ('user', '{question}')
8
])
9
llm = ChatOpenAI(model='gpt-4o', temperature=0)
10
parser = StrOutputParser()
11

12
# Pipe operator: left → right. Each component's output feeds the next.
13
chain = prompt | llm | parser
14
response = chain.invoke({'question': 'What is overfitting?'})

Memory — maintain conversation history across turns:

1
from langchain.memory import ConversationBufferWindowMemory
2

3
# Keeps only the last k exchanges — prevents unbounded context growth
4
memory = ConversationBufferWindowMemory(k=5)

Structured extraction with Pydantic:

1
from langchain.output_parsers import PydanticOutputParser
2
from pydantic import BaseModel
3

4
class SentimentResult(BaseModel):
5
    sentiment: str
6
    confidence: float
7
    key_phrases: list[str]
8

9
parser = PydanticOutputParser(pydantic_object=SentimentResult)
10
# Add parser.get_format_instructions() to your prompt for reliable extraction

LangChain Agents — The Power Feature

Agents let LLMs decide which tools to call and in what order — the model becomes an autonomous decision-maker rather than a single-shot responder:

1
from langchain.agents import create_react_agent, AgentExecutor
2
from langchain.tools import tool
3
from langchain_openai import ChatOpenAI
4
from langchain import hub
5

6
@tool
7
def get_current_date() -> str:
8
    """Returns today's date in YYYY-MM-DD format."""
9
    from datetime import date
10
    return str(date.today())
11

12
@tool
13
def calculate_metrics(data_json: str) -> str:
14
    """Calculate mean, median, and std from a JSON array of numbers.
15

16
    Input must be a JSON string like: [1, 2, 3, 4, 5]
17
    """
18
    import json, statistics
19
    data = json.loads(data_json)
20
    return json.dumps({
21
        'mean': statistics.mean(data),
22
        'median': statistics.median(data),
23
        'std': statistics.stdev(data)
24
    })
25

26
tools = [get_current_date, calculate_metrics]
27
llm = ChatOpenAI(model='gpt-4o', temperature=0)
28
prompt = hub.pull('hwchase17/react')   # Standard ReAct prompt template
29

30
agent = create_react_agent(llm, tools, prompt)
31
executor = AgentExecutor(
32
    agent=agent,
33
    tools=tools,
34
    verbose=True,
35
    max_iterations=10   # Always cap — prevents infinite loops
36
)
37
result = executor.invoke({'input': 'What is the mean of [5, 10, 15, 20, 25]?'})
38
print(result['output'])

Warning (Agents are non-deterministic — plan for failure)

Agents call tools based on the LLM’s reasoning chain, which can fail, loop, or take unexpected paths. Always set max_iterations, add defensive error handling to every tool, and test with adversarial inputs. Never give agents write access to production systems without a human-in-the-loop confirmation step.

LlamaIndex — When RAG Is the Core Product

1
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
2
from llama_index.llms.openai import OpenAI
3
from llama_index.embeddings.openai import OpenAIEmbedding
4

5
Settings.llm = OpenAI(model='gpt-4o', temperature=0)
6
Settings.embed_model = OpenAIEmbedding(model='text-embedding-3-small')
7

8
# Load, embed, and index in three lines
9
documents = SimpleDirectoryReader('./docs/').load_data()
10
index = VectorStoreIndex.from_documents(documents)
11

12
# Simple query
13
query_engine = index.as_query_engine(similarity_top_k=5)
14
response = query_engine.query('What is the Q3 revenue?')
15
print(response)
16
print(response.source_nodes)  # Retrieved chunks used as evidence
17

18
# Advanced: SubQuestion Query Engine
19
# Automatically decomposes "Compare Q2 vs Q3 revenue and profit"
20
# into sub-questions, answers each, then synthesises
21
from llama_index.core.query_engine import SubQuestionQueryEngine
22
from llama_index.core.tools import QueryEngineTool
23

24
tools = [QueryEngineTool.from_defaults(
25
    query_engine=index.as_query_engine(),
26
    name='company_docs',
27
    description='Financial and operational documents for the company'
28
)]
29
sub_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)

Tip (LangChain vs LlamaIndex — when to use which)

LangChain: general LLM orchestration, multi-tool agents, complex conditional workflows, broad ecosystem. Choose when building flexible agent systems or complex pipelines. LlamaIndex: specialised for data indexing and querying — better out-of-the-box RAG, richer indexing strategies (knowledge graphs, tree indices), cleaner query abstractions. If your primary use case is querying a large document collection, start with LlamaIndex.

LLM Fine-Tuning — When and How

Fine-tuning adapts a pre-trained LLM to a specific task, domain, or style by continuing training on a curated dataset. It is expensive and often unnecessary. A great senior DS knows when not to fine-tune.

The Decision Matrix

Approach	Best for	Pros	Cons
Prompting only	General tasks, rapid prototyping	Zero cost, instant, flexible	Limited customisation, token cost per call
RAG	Up-to-date info, private data, citations	No training, updatable knowledge	Retrieval quality dependent, adds latency
Full fine-tuning	Deep domain adaptation, strong style changes	Best task performance, no retrieval overhead	Extremely expensive, needs thousands of examples, static
LoRA / QLoRA	Efficient adaptation on limited GPU budget	10–100× less compute, weights can be merged	Less powerful than full fine-tuning
RLHF / DPO	Aligning the model to human preferences	Best for chat quality and helpfulness	Most complex pipeline, needs preference pairs

Important (Exhaust these options before fine-tuning)

Before committing to a fine-tuning run: (1) exhaust prompt engineering with 10+ few-shot examples, (2) check if RAG solves the knowledge gap, (3) test system prompt variations. Fine-tuning is best at changing format and style — if the model lacks factual knowledge, fine-tuning often just makes the hallucinations more confident and fluent.

LoRA — Low-Rank Adaptation

LoRA is the dominant fine-tuning technique. Instead of updating all parameters (billions of numbers), it injects small trainable matrices into specific layers, leaving the base model frozen.

The core idea: original weight matrix $W \in \mathbb{R}^{d \times d}$ is frozen. LoRA adds a low-rank update:

$W' = W + \Delta W = W + AB$

where $A \in \mathbb{R}^{d \times r}$ and $B \in \mathbb{R}^{r \times d}$ , with the rank $r \ll d$ . Only $A$ and $B$ are trained. With $r = 8$ on a 7B model, trainable parameters drop from 7 billion to roughly 8 million — about 0.1%.

After training, merge back: $W_{\text{final}} = W + AB$ — zero inference overhead, same model size.

QLoRA adds 4-bit quantisation of the base model before applying LoRA, making fine-tuning possible on a single consumer GPU (e.g., an RTX 4090 can fine-tune a 13B model).

1
from peft import LoraConfig, get_peft_model, TaskType
2
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
3
import torch
4

5
# QLoRA: 4-bit quantise the base model first
6
bnb_config = BitsAndBytesConfig(
7
    load_in_4bit=True,
8
    bnb_4bit_quant_type='nf4',
9
    bnb_4bit_compute_dtype=torch.float16
10
)
11

12
model = AutoModelForCausalLM.from_pretrained(
13
    'meta-llama/Meta-Llama-3.1-8B-Instruct',
14
    quantization_config=bnb_config,
15
    device_map='auto'
16
)
17

18
# LoRA configuration
19
lora_config = LoraConfig(
20
    r=16,               # Rank — higher r = more capacity but more params
21
    lora_alpha=32,      # Scaling factor — rule of thumb: lora_alpha = 2 × r
22
    target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'],  # Attention layers
23
    lora_dropout=0.05,
24
    bias='none',
25
    task_type=TaskType.CAUSAL_LM
26
)
27

28
model = get_peft_model(model, lora_config)
29
model.print_trainable_parameters()
30
# trainable params: 13,631,488 || all params: 8,044,941,312 || trainable%: 0.1695%

Training Data Format

1
# ── Alpaca-style instruction format ───────────────────────────────────────────
2
training_example = {
3
    'instruction': 'Classify the sentiment of this customer review.',
4
    'input': 'The product quality is excellent but delivery was very slow.',
5
    'output': 'Mixed sentiment: Positive about product quality, Negative about delivery.'
6
}
7

8
# ── ChatML format (OpenAI-compatible, LLaMA 3 native) ─────────────────────────
9
chatml_example = """<|im_start|>system
10
You are a helpful data scientist assistant.<|im_end|>
11
<|im_start|>user
12
What is precision in machine learning?<|im_end|>
13
<|im_start|>assistant
14
Precision is TP/(TP+FP) — the fraction of positive predictions that are actually positive.<|im_end|>"""
15

16
# ── Data quantity guidelines ──────────────────────────────────────────────────
17
# Style / format changes:    500–2,000 examples
18
# Domain adaptation:         1,000–10,000 examples
19
# Full task fine-tuning:     10,000+ examples
20
#
21
# Quality >> Quantity. 100 clean, diverse examples consistently beat
22
# 1,000 noisy or repetitive ones.

Tip (Quality beats quantity — always)

The three most common data quality issues: (1) duplicate or near-duplicate examples, (2) answers that are too generic and do not demonstrate the target behaviour, (3) inconsistent formatting across examples. Always do a manual review pass of at least 50 random samples before launching a training run.