Building Real-World AI Systems with RAG, Agents, and LLM Applications

AI Agents & Tool Calling

An AI agent is an LLM that can autonomously decide to use tools, reason across multiple steps, and complete complex tasks without a human directing every action. This is the cutting edge of GenAI in 2025.

The ReAct Loop — The Core Pattern

Every agent framework — LangChain, LangGraph, AutoGen, CrewAI — implements some variant of the ReAct pattern (Reason + Act). The LLM reasons in a Thought, takes an Action (tool call), observes the result, and loops until it can give a Final Answer.

ReAct Agent Loop

The core pattern powering all modern AI agents. Toggle between the loop structure and a step-by-step trace.

User Input + Tool Schemas

Task arrives with available tools defined

Final Answer

Loop ends, response returned to user

Thought

The LLM reads everything: user task, system prompt, tool schemas, and all previous observations. It reasons explicitly before deciding — this inner monologue is what allows multi-step planning and self-correction.

Thought: I need the current AAPL stock price. I will call search_stock with ticker="AAPL".

1
WHILE not done:
2
  1. LLM receives: system prompt + conversation history + tool schemas
3
  2. LLM outputs: Thought (reasoning) + Action (tool call)  OR  Final Answer
4
  3. If action:       execute tool → append Observation → go to step 1
5
  4. If final answer: return to user

Important (max_iterations is not optional)

Without a hard cap on the number of loop iterations, agents can run indefinitely — burning API credits and never returning. Always set max_iterations=10 (or lower). Add a cost budget if you’re running expensive models.

The Four Agent Components

Component	Role	Common implementations
Brain / LLM	Reasoning engine — decides what to do and when	GPT-4o, Claude 3.7, LLaMA 3.1
Memory	Short-term (context window) + long-term (vector DB)	ConversationBufferWindowMemory, FAISS retrieval
Tools	External capabilities the LLM can invoke	Web search, code exec, SQL, REST APIs, file I/O
Orchestrator	Manages the action loop, executes tool calls, maintains state	LangGraph, AgentExecutor, AutoGen, CrewAI

Multi-Agent Systems

Single-agent systems hit a ceiling — one LLM doing everything runs out of context and makes more errors on long tasks. Multi-agent systems assign specialised roles to different models.

Framework	Model	Best for
AutoGen (Microsoft)	Conversational agents that collaborate: AssistantAgent + UserProxyAgent	Complex research tasks, code review loops
CrewAI	Role-based agents with defined goals, tools, and a structured workflow (sequential or parallel)	Business process automation, report generation
LangGraph	Graph-based orchestration — nodes (agents/functions) + edges (transitions) + cycles	Complex conditional workflows that need looping and branching

Key design patterns:

Router: central agent routes incoming tasks to specialist agents
Supervisor: manager-worker hierarchy — supervisor assigns tasks, checks results
Parallel: independent sub-graphs run simultaneously, results merged

1
# LangGraph — stateful agent with review cycle
2
from langgraph.graph import StateGraph, END
3
from typing import TypedDict, Annotated
4
import operator
5

6
class AgentState(TypedDict):
7
    messages: Annotated[list, operator.add]
8
    next: str
9

10
def analyst_node(state: AgentState):
11
    response = llm.invoke(state['messages'])
12
    return {'messages': [response]}
13

14
def reviewer_node(state: AgentState):
15
    response = reviewer_llm.invoke(state['messages'])
16
    # Routing decision based on response content
17
    if 'APPROVED' in response.content:
18
        return {'next': 'END', 'messages': [response]}
19
    return {'next': 'analyst', 'messages': [response]}
20

21
workflow = StateGraph(AgentState)
22
workflow.add_node('analyst', analyst_node)
23
workflow.add_node('reviewer', reviewer_node)
24
workflow.set_entry_point('analyst')
25
workflow.add_edge('analyst', 'reviewer')
26
workflow.add_conditional_edges(
27
    'reviewer',
28
    lambda s: s['next'],
29
    {'analyst': 'analyst', 'END': END}
30
)
31
app = workflow.compile()

Warning (Multi-agent failure modes)

Hallucinated tool calls, infinite loops, and cost explosion are the three main risks. Always: set max_iterations, add error handling to every tool, log every agent step for debugging, and never give agents write access to production systems without a human confirmation step.

Responsible AI & Explainability

Enterprise clients care deeply about responsible AI. Showing you can build fair, explainable, and safe systems is a differentiator.

Bias in ML Systems

Bias is not just “the model is unfair” — it has specific technical causes:

Type	Source	Example
Historical bias	Training data reflects past discrimination	Loan approval model trained on data from when banks discriminated by race
Representation bias	Certain groups underrepresented in training data	Face recognition trained mostly on lighter-skinned faces
Measurement bias	Features measured differently across groups	ZIP code as a proxy for race in credit scoring
Aggregation bias	Single model applied to groups that behave differently	One churn model for enterprise and consumer customers
Evaluation bias	Test set doesn’t reflect real deployment distribution	Testing a medical model on hospital data but deploying in clinics

Fairness Metrics — and Why They Conflict

Demographic Parity — equal positive prediction rates across groups:

$P(\hat{Y}=1 \mid A=0) = P(\hat{Y}=1 \mid A=1)$

Equalized Odds — equal TPR and FPR across groups (stronger than demographic parity):

$\text{TPR}_{A=0} = \text{TPR}_{A=1} \quad \text{and} \quad \text{FPR}_{A=0} = \text{FPR}_{A=1}$

Calibration — probability scores should be equally accurate across groups:

$P(Y=1 \mid \hat{p}=s,\ A=0) = P(Y=1 \mid \hat{p}=s,\ A=1) \quad \forall s$

Important (The impossibility theorem — know this cold)

Chouldechova (2017) proved that when base rates differ between groups, demographic parity, equalized odds, and calibration cannot all hold simultaneously. You must choose which fairness criterion to prioritise based on the business context and the cost of different error types. There is no mathematically perfect answer.

Explainability — SHAP & LIME

SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. The Shapley value for feature $i$ is the average marginal contribution of that feature across all possible feature orderings:

$\phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|!\,(|F| - |S| - 1)!}{|F|!} \Big[ f_{S \cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S) \Big]$

In plain English: $\phi_i$ measures how much feature $i$ pushes the prediction up or down from the baseline, averaged fairly across every way the features could be ordered.

1
import shap
2

3
# ── TreeExplainer: fast, exact SHAP for tree-based models ──────────────────
4
explainer = shap.TreeExplainer(xgboost_model)
5
shap_values = explainer.shap_values(X_test)
6

7
# Single prediction — waterfall plot shows each feature's contribution
8
shap.waterfall_plot(explainer(X_test)[0])
9

10
# Global feature importance — beeswarm (shows direction of effect)
11
shap.summary_plot(shap_values, X_test)
12

13
# Global feature importance — bar chart (mean absolute SHAP values)
14
shap.summary_plot(shap_values, X_test, plot_type='bar')
15

16
# Feature interaction — how two features jointly affect prediction
17
shap.dependence_plot('income', shap_values, X_test, interaction_index='age')
18

19
# ── LIME: local linear approximations (model-agnostic) ──────────────────────
20
from lime.lime_text import LimeTextExplainer
21

22
explainer = LimeTextExplainer(class_names=['negative', 'positive'])
23
explanation = explainer.explain_instance(
24
    text,
25
    classifier.predict_proba,
26
    num_features=10
27
)
28
explanation.show_in_notebook()

Tip (SHAP vs LIME — when to use which)

SHAP: consistent, theoretically grounded, global + local explanations. Preferred for tabular models. Slow for non-tree models (use KernelSHAP or DeepSHAP).\nLIME: faster for any model, good for text and image. Local only. Less consistent — small perturbations can give different results. Use LIME as a quick sanity check; use SHAP for reporting to stakeholders.

LLM Guardrails — Defence in Depth

Production LLM systems need layers of defence against harmful outputs, prompt injection, and hallucinations:

1
import re
2

3
# ── Layer 1: Input validation (fast, cheap) ────────────────────────────────
4
def validate_input(user_input: str) -> tuple[bool, str]:
5
    injection_patterns = [
6
        r'ignore previous',
7
        r'forget instructions',
8
        r'act as',
9
        r'jailbreak',
10
    ]
11
    for pattern in injection_patterns:
12
        if re.search(pattern, user_input, re.IGNORECASE):
13
            return False, 'Potential prompt injection detected'
14
    if len(user_input) > 2000:
15
        return False, 'Input exceeds maximum length'
16
    return True, 'OK'
17

18
# ── Layer 2: LLM-based classifier (Llama Guard) ────────────────────────────
19
from transformers import pipeline
20

21
guard = pipeline('text-classification', model='meta-llama/LlamaGuard-7b')
22
result = guard(f'[INST] {user_message} [/INST] {model_response}')
23
# Returns 'safe' or 'unsafe' with category
24

25
# ── Layer 3: Output validation ─────────────────────────────────────────────
26
def validate_output(response: str, retrieved_context: str) -> float:
27
    """Return faithfulness score [0, 1] — does response stay within context?"""
28
    # Use an NLI model to check if response is entailed by context
29
    from transformers import pipeline as hf_pipeline
30
    nli = hf_pipeline('text-classification', model='cross-encoder/nli-deberta-v3-base')
31
    result = nli(f'{retrieved_context} [SEP] {response}')
32
    # 'entailment' score is your faithfulness proxy
33
    return next(r['score'] for r in result if r['label'] == 'ENTAILMENT')

Note (NeMo Guardrails (NVIDIA))

NVIDIA’s open-source framework lets you define rails in Colang — a domain-specific language for specifying what topics a bot should and shouldn’t engage with. Good for production systems where compliance teams need to review and approve the rules independently of the model.

Cloud GenAI Platforms

Tredence clients span all three major clouds. You must be able to speak intelligently about each and recommend the right services for a given architecture.

Platform Comparison

Capability	Azure	AWS	GCP
Flagship models	GPT-4o, GPT-4, Whisper, DALL-E	Claude, Llama 3, Mistral, Nova Pro	Gemini 1.5 Pro/Flash, Llama 3, Claude
Managed RAG	Azure AI Search + Prompt Flow	Knowledge Bases for Bedrock (S3 → OpenSearch)	Vertex AI Search + Data Stores
Managed agents	Azure AI Studio / Prompt Flow	Agents for Bedrock (Lambda as tools)	Vertex AI Agent Builder
Content safety	Azure Content Safety (multimodal)	Guardrails for Bedrock (built-in)	Vertex AI content filters
Fine-tuning	Azure OpenAI Fine-tuning (GPT-3.5/4)	Bedrock fine-tuning (Titan, Llama)	Vertex AI supervised fine-tuning
Key advantage	Data stays in tenant, enterprise compliance	Unified API across many model families	Deep BigQuery + Workspace integration

Azure OpenAI

1
from openai import AzureOpenAI
2
import os
3

4
client = AzureOpenAI(
5
    api_key=os.getenv('AZURE_OPENAI_KEY'),
6
    api_version='2024-02-01',
7
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT')
8
)
9

10
response = client.chat.completions.create(
11
    model='gpt-4o',          # This is the deployment name in your Azure resource
12
    messages=[{'role': 'user', 'content': 'Explain RAG in one paragraph.'}],
13
    max_tokens=500,
14
    temperature=0
15
)
16

17
print(response.choices[0].message.content)

Tip (Azure enterprise considerations)

Use Managed Identity (MSI → RBAC → Azure OpenAI) instead of API keys for production — no secrets to rotate or leak. Add a Private Endpoint for strict data residency compliance. Azure Content Safety provides multimodal moderation (hate, violence, sexual, self-harm) that can be chained before and after your LLM call.

AWS Bedrock

1
import boto3, json
2

3
client = boto3.client('bedrock-runtime', region_name='us-east-1')
4

5
response = client.invoke_model(
6
    modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
7
    body=json.dumps({
8
        'anthropic_version': 'bedrock-2023-05-31',
9
        'max_tokens': 500,
10
        'messages': [{'role': 'user', 'content': 'Explain RAG in one paragraph.'}]
11
    })
12
)
13

14
result = json.loads(response['body'].read())
15
print(result['content'][0]['text'])

Tip (Bedrock key services to know)

Knowledge Bases for Bedrock: S3 → managed embeddings → OpenSearch Serverless or Aurora pgvector. Zero embedding infrastructure to manage. Agents for Bedrock: define action groups as Lambda functions — Bedrock handles the ReAct loop. Guardrails for Bedrock: content filtering, PII redaction, and topic denial built in — review requirements without writing custom validators. Nova Pro: Amazon’s own flagship multimodal model (300K context), deeply integrated with AWS services.

GCP Vertex AI

GCP’s unique advantage is the 1M-token context of Gemini 1.5 Pro and the deep integration with BigQuery.

1
-- BigQuery ML: run LLM inference directly in SQL
2
SELECT
3
    customer_id,
4
    review_text,
5
    ML.GENERATE_TEXT(
6
        MODEL `project.dataset.gemini_model`,
7
        STRUCT(review_text AS prompt),
8
        STRUCT(0 AS temperature, 100 AS max_output_tokens)
9
    ) AS sentiment_analysis
10
FROM `project.dataset.customer_reviews`
11
WHERE DATE(created_at) = CURRENT_DATE()

Key services: Model Garden (unified API for Gemini, Llama, Mistral, Claude), Vertex AI Search (enterprise RAG with grounding), Vertex AI Pipelines (Kubeflow-based MLOps), Feature Store, Model Registry, Monitoring.

LLM Evaluation — RAGAS & Metrics

Knowing how to measure quality systematically separates engineers from scientists. Poor evaluation is the most common reason LLM projects fail silently in production.

RAGAS — RAG-Specific Evaluation

RAGAS evaluates the full RAG pipeline using LLMs as judges — no labelled dataset required for most metrics:

1
from ragas import evaluate
2
from ragas.metrics import (
3
    faithfulness,       # Is the answer grounded in retrieved context? (no hallucinations)
4
    answer_relevancy,  # Does the answer address the question?
5
    context_precision, # Are the retrieved chunks relevant to the question?
6
    context_recall,    # Does the retrieved context cover the ground-truth answer?
7
    answer_correctness # Does the answer match the ground truth?
8
)
9
from datasets import Dataset
10

11
data = {
12
    'question':     ['What is the return policy?', 'How do I cancel my subscription?'],
13
    'answer':       ['Returns accepted within 30 days', 'Email support to cancel anytime'],
14
    'contexts':     [['Our policy allows returns in 30 days...'], ['Cancellations via support email...']],
15
    'ground_truth': ['Products can be returned within 30 days', 'Cancel by emailing support']
16
}
17

18
dataset = Dataset.from_dict(data)
19
results = evaluate(
20
    dataset,
21
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
22
)
23
print(results.to_pandas())

Metric	Measures	Production target
Faithfulness	Is the answer supported by retrieved context? (hallucination rate)	> 0.85
Context Precision	Are the retrieved chunks relevant to the question?	> 0.75
Context Recall	Does the retrieved context cover the ground-truth answer?	> 0.70
Answer Relevancy	Does the answer actually address what was asked?	> 0.80
Answer Correctness	Does the answer match the ground-truth? (needs labels)	> 0.75

Text Generation Metrics

1
# ── BLEU — precision-based, used for machine translation ──────────────────
2
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
3

4
reference = [['the', 'product', 'works', 'well', 'overall']]
5
candidate = ['the', 'product', 'performs', 'well', 'overall']
6
score = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)
7
# BLEU weakness: 'works' ≠ 'performs' by BLEU — no semantic awareness
8

9
# ── ROUGE — recall-based, used for summarisation ──────────────────────────
10
from rouge_score import rouge_scorer
11

12
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
13
scores = scorer.score(
14
    'Machine learning is a type of artificial intelligence.',
15
    'ML is a subset of AI that learns from data.'
16
)
17
# rouge1: unigram overlap | rouge2: bigram | rougeL: longest common subsequence
18

19
# ── BERTScore — semantic similarity using BERT embeddings ─────────────────
20
from bert_score import score as bert_score
21

22
P, R, F1 = bert_score(
23
    ['The product works well overall'],
24
    ['The item performs great in general'],
25
    lang='en'
26
)
27
print(f'BERTScore F1: {F1.mean():.4f}')  # Captures semantic similarity BLEU misses
28

29
# ── LLM-as-Judge (G-Eval) — best correlation with human judgement ─────────
30
# Use GPT-4 to score outputs on coherence, relevance, fluency
31
# Most expensive but most aligned with human evaluation

Metric	Captures	Does NOT capture	Best for
BLEU	n-gram precision	Semantic equivalence	Machine translation (legacy)
ROUGE	n-gram recall	Paraphrase quality	Summarisation benchmarks
BERTScore	Semantic similarity	Factual accuracy	General text quality
RAGAS	Full RAG pipeline quality	Non-RAG generation	RAG system evaluation
LLM-as-Judge	Nuanced quality	Cost-efficiency	Final production evaluation

Tip (Evaluation hierarchy for production)

(1) During development: RAGAS on a curated test set of 100–200 queries. (2) Before release: LLM-as-judge on a stratified sample. (3) In production: lightweight automatic metrics + user thumbs up/down + faithfulness monitoring on sampled traffic. Never rely on a single metric.

50 Mock Interview Questions

Practice these out loud. Cover the answer, say yours, then compare. Aim for 90–120 seconds per answer. This is the single highest-impact activity you can do with this material.

50 Mock Interview Q&A

0 / 50 reviewed

System Design Scenarios

Expect at least one system design question. Structure your answer: requirements → architecture → components → tradeoffs → monitoring.

Scenario 1 — Enterprise Customer Service Copilot (100K support tickets/day)

Requirements: Answer customer questions in under 3 seconds, cite policy documents, escalate complex cases to humans.

Architecture:

1
User → API Gateway → Input Guardrails → Intent Classifier
2
                                              ↓
3
                     ┌────────────────────────────────────────┐
4
                     │  Simple query            Complex query  │
5
                     ↓                               ↓         │
6
               RAG Pipeline              Agent with tools      │
7
               (policy docs)          (ticket lookup, order    │
8
                                       status, escalation)     │
9
                     └──────────────────┬─────────────────────┘
10
                                        ↓
11
                              Output Guardrails → User

Key decisions:

LLM routing: GPT-4o-mini for intent classification (cheap, fast); GPT-4o only for complex agent tasks
RAG: Azure AI Search with hybrid search over policy docs, product manuals, FAQ knowledge base
Agent tools: ticket lookup (SQL), order status (REST API), human escalation trigger
Cost optimisation: semantic caching handles ~30% of repeat queries; mini model handles ~60% of simple lookups
Evaluation: RAGAS weekly + CSAT correlation monthly

Scenario 2 — Automated Financial Report Analysis (10K reports/month)

Requirements: Extract KPIs, identify risks, generate executive summaries, compare across quarters.

Architecture decisions:

Parsing: Layout-aware parsing (Marker or Unstructured.io) to preserve table structure from PDFs
Extraction: Function calling to extract standardised JSON schema (revenue, EBITDA, guidance, risks) — structured data before natural language
Multi-document reasoning: LlamaIndex SubQuestion Query Engine decomposes “compare Q3 vs Q4” into sub-queries, answers each, synthesises
Long context: Claude 3.7 Sonnet (200K context) for entire 100-page reports when section-level retrieval isn’t sufficient
Validation: Calculator tool cross-checks extracted numbers against embedded tables — never trust the LLM’s arithmetic
Output: Structured data → database + natural language summary generated from the structured data, not the raw document

Caveat (Never trust LLM arithmetic)

LLMs can misread numbers from complex PDF tables. Always extract numerical data with a structured approach (function calling, regex validation), then generate narrative from the validated structured data. Cross-check extracted numbers against at least two locations in the document.

Scenario 3 — Code Review Assistant for Engineering Teams

Requirements: Review PRs for bugs, security issues, and style violations; suggest improvements; explain changes in plain English.

Architecture decisions:

Input: GitHub webhook → diff extraction → retrieve surrounding context via code embedding search
Parallel agents: security checker + logic reviewer + style linter run in parallel (LangGraph parallel subgraph), synthesis agent aggregates results
Code embedding DB: all internal code indexed with code-specific embeddings (code-search-ada-002) — enables “show me similar functions in the codebase”
Agent tools: run_tests (trigger CI), search_codebase, lookup_docs, check_cve (security vulnerability database)
Models: fine-tuned CodeLlama-34B on internal codebase for code understanding; GPT-4o for explanation generation
Hard rule: never auto-merge — the agent only adds PR comments; a human must approve. All suggestions logged for audit.

Final Preparation

Research Checklist — Tredence

Founded 2013, focused on last-mile analytics for large enterprises
Key verticals: CPG, retail, manufacturing, healthcare, financial services
Known for: AI-powered decision intelligence, supply chain analytics, customer analytics platforms
Flagship products: SCOUT (supply chain intelligence), PRISM (pricing analytics)
Review recent case studies and blog posts at tredence.com before the interview
Understand their delivery model: heavily client-embedded, consulting + product hybrid

Questions to Ask the Interviewer

These show preparation and genuine curiosity — both matter:

“What does the typical project lifecycle look like for the GenAI team? From stakeholder request to deployment?”
“What LLM infrastructure do your enterprise clients typically run on — Azure, AWS, or GCP?”
“What are the biggest challenges the team is solving in LLMOps right now?”
“How does the team balance building reusable frameworks vs custom client solutions?”
“What does the first 90 days look like for someone in this role?”

Day-of Mindset

Summary (Interview execution checklist)

If you don’t know something: say “My understanding is X — let me think through this carefully.” Never fake it. Interviewers can always tell, and intellectual honesty is valued far more than a wrong confident answer.

On tools you haven’t used: “I haven’t used X specifically, but I’ve used Y which solves the same problem — I’d apply the same principles.”

For coding questions: think aloud before typing. Say what you’re building and why. The interviewer wants to see your thought process, not just the solution.

Connect every technical answer to business impact. “This improves precision, which reduces manual review cost by X%.”

STAR for behavioural questions: Situation → Task → Action → Result (always quantified).

Show genuine curiosity: ask clarifying questions before solving (what’s the scale? what’s the latency budget? what’s the failure mode?).