Building Real-World AI Systems with RAG, Agents, and LLM Applications
Overview

Building Real-World AI Systems with RAG, Agents, and LLM Applications

Building Real-World AI Systems with RAG, Agents, and LLM Applications
March 5, 2026 · 15 min read
· ·

AI Agents & Tool Calling

An AI agent is an LLM that can autonomously decide to use tools, reason across multiple steps, and complete complex tasks without a human directing every action. This is the cutting edge of GenAI in 2025.

The ReAct Loop — The Core Pattern

Every agent framework — LangChain, LangGraph, AutoGen, CrewAI — implements some variant of the ReAct pattern (Reason + Act). The LLM reasons in a Thought, takes an Action (tool call), observes the result, and loops until it can give a Final Answer.

ReAct Agent Loop

The core pattern powering all modern AI agents. Toggle between the loop structure and a step-by-step trace.

User Input + Tool Schemas

Task arrives with available tools defined

Final Answer

Loop ends, response returned to user

Thought

The LLM reads everything: user task, system prompt, tool schemas, and all previous observations. It reasons explicitly before deciding — this inner monologue is what allows multi-step planning and self-correction.

Thought: I need the current AAPL stock price. I will call search_stock with ticker="AAPL".

WHILE not done:
1. LLM receives: system prompt + conversation history + tool schemas
2. LLM outputs: Thought (reasoning) + Action (tool call) OR Final Answer
3. If action: execute tool → append Observation → go to step 1
4. If final answer: return to user
Important (max_iterations is not optional)

Without a hard cap on the number of loop iterations, agents can run indefinitely — burning API credits and never returning. Always set max_iterations=10 (or lower). Add a cost budget if you’re running expensive models.

The Four Agent Components

ComponentRoleCommon implementations
Brain / LLMReasoning engine — decides what to do and whenGPT-4o, Claude 3.7, LLaMA 3.1
MemoryShort-term (context window) + long-term (vector DB)ConversationBufferWindowMemory, FAISS retrieval
ToolsExternal capabilities the LLM can invokeWeb search, code exec, SQL, REST APIs, file I/O
OrchestratorManages the action loop, executes tool calls, maintains stateLangGraph, AgentExecutor, AutoGen, CrewAI

Multi-Agent Systems

Single-agent systems hit a ceiling — one LLM doing everything runs out of context and makes more errors on long tasks. Multi-agent systems assign specialised roles to different models.

FrameworkModelBest for
AutoGen (Microsoft)Conversational agents that collaborate: AssistantAgent + UserProxyAgentComplex research tasks, code review loops
CrewAIRole-based agents with defined goals, tools, and a structured workflow (sequential or parallel)Business process automation, report generation
LangGraphGraph-based orchestration — nodes (agents/functions) + edges (transitions) + cyclesComplex conditional workflows that need looping and branching

Key design patterns:

  • Router: central agent routes incoming tasks to specialist agents
  • Supervisor: manager-worker hierarchy — supervisor assigns tasks, checks results
  • Parallel: independent sub-graphs run simultaneously, results merged
# LangGraph — stateful agent with review cycle
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
next: str
def analyst_node(state: AgentState):
response = llm.invoke(state['messages'])
return {'messages': [response]}
def reviewer_node(state: AgentState):
response = reviewer_llm.invoke(state['messages'])
# Routing decision based on response content
if 'APPROVED' in response.content:
return {'next': 'END', 'messages': [response]}
return {'next': 'analyst', 'messages': [response]}
workflow = StateGraph(AgentState)
workflow.add_node('analyst', analyst_node)
workflow.add_node('reviewer', reviewer_node)
workflow.set_entry_point('analyst')
workflow.add_edge('analyst', 'reviewer')
workflow.add_conditional_edges(
'reviewer',
lambda s: s['next'],
{'analyst': 'analyst', 'END': END}
)
app = workflow.compile()
Warning (Multi-agent failure modes)

Hallucinated tool calls, infinite loops, and cost explosion are the three main risks. Always: set max_iterations, add error handling to every tool, log every agent step for debugging, and never give agents write access to production systems without a human confirmation step.

Responsible AI & Explainability

Enterprise clients care deeply about responsible AI. Showing you can build fair, explainable, and safe systems is a differentiator.

Bias in ML Systems

Bias is not just “the model is unfair” — it has specific technical causes:

TypeSourceExample
Historical biasTraining data reflects past discriminationLoan approval model trained on data from when banks discriminated by race
Representation biasCertain groups underrepresented in training dataFace recognition trained mostly on lighter-skinned faces
Measurement biasFeatures measured differently across groupsZIP code as a proxy for race in credit scoring
Aggregation biasSingle model applied to groups that behave differentlyOne churn model for enterprise and consumer customers
Evaluation biasTest set doesn’t reflect real deployment distributionTesting a medical model on hospital data but deploying in clinics

Fairness Metrics — and Why They Conflict

Demographic Parity — equal positive prediction rates across groups:

P(Y^=1A=0)=P(Y^=1A=1)P(\hat{Y}=1 \mid A=0) = P(\hat{Y}=1 \mid A=1)

Equalized Odds — equal TPR and FPR across groups (stronger than demographic parity):

TPRA=0=TPRA=1andFPRA=0=FPRA=1\text{TPR}_{A=0} = \text{TPR}_{A=1} \quad \text{and} \quad \text{FPR}_{A=0} = \text{FPR}_{A=1}

Calibration — probability scores should be equally accurate across groups:

P(Y=1p^=s, A=0)=P(Y=1p^=s, A=1)sP(Y=1 \mid \hat{p}=s,\ A=0) = P(Y=1 \mid \hat{p}=s,\ A=1) \quad \forall s

Important (The impossibility theorem — know this cold)

Chouldechova (2017) proved that when base rates differ between groups, demographic parity, equalized odds, and calibration cannot all hold simultaneously. You must choose which fairness criterion to prioritise based on the business context and the cost of different error types. There is no mathematically perfect answer.

Explainability — SHAP & LIME

SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. The Shapley value for feature ii is the average marginal contribution of that feature across all possible feature orderings:

ϕi=SF{i}S!(FS1)!F![fS{i}(xS{i})fS(xS)]\phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|!\,(|F| - |S| - 1)!}{|F|!} \Big[ f_{S \cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S) \Big]

In plain English: ϕi\phi_i measures how much feature ii pushes the prediction up or down from the baseline, averaged fairly across every way the features could be ordered.

import shap
# ── TreeExplainer: fast, exact SHAP for tree-based models ──────────────────
explainer = shap.TreeExplainer(xgboost_model)
shap_values = explainer.shap_values(X_test)
# Single prediction — waterfall plot shows each feature's contribution
shap.waterfall_plot(explainer(X_test)[0])
# Global feature importance — beeswarm (shows direction of effect)
shap.summary_plot(shap_values, X_test)
# Global feature importance — bar chart (mean absolute SHAP values)
shap.summary_plot(shap_values, X_test, plot_type='bar')
# Feature interaction — how two features jointly affect prediction
shap.dependence_plot('income', shap_values, X_test, interaction_index='age')
# ── LIME: local linear approximations (model-agnostic) ──────────────────────
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=['negative', 'positive'])
explanation = explainer.explain_instance(
text,
classifier.predict_proba,
num_features=10
)
explanation.show_in_notebook()
Tip (SHAP vs LIME — when to use which)

SHAP: consistent, theoretically grounded, global + local explanations. Preferred for tabular models. Slow for non-tree models (use KernelSHAP or DeepSHAP).\nLIME: faster for any model, good for text and image. Local only. Less consistent — small perturbations can give different results. Use LIME as a quick sanity check; use SHAP for reporting to stakeholders.

LLM Guardrails — Defence in Depth

Production LLM systems need layers of defence against harmful outputs, prompt injection, and hallucinations:

import re
# ── Layer 1: Input validation (fast, cheap) ────────────────────────────────
def validate_input(user_input: str) -> tuple[bool, str]:
injection_patterns = [
r'ignore previous',
r'forget instructions',
r'act as',
r'jailbreak',
]
for pattern in injection_patterns:
if re.search(pattern, user_input, re.IGNORECASE):
return False, 'Potential prompt injection detected'
if len(user_input) > 2000:
return False, 'Input exceeds maximum length'
return True, 'OK'
# ── Layer 2: LLM-based classifier (Llama Guard) ────────────────────────────
from transformers import pipeline
guard = pipeline('text-classification', model='meta-llama/LlamaGuard-7b')
result = guard(f'[INST] {user_message} [/INST] {model_response}')
# Returns 'safe' or 'unsafe' with category
# ── Layer 3: Output validation ─────────────────────────────────────────────
def validate_output(response: str, retrieved_context: str) -> float:
"""Return faithfulness score [0, 1] — does response stay within context?"""
# Use an NLI model to check if response is entailed by context
from transformers import pipeline as hf_pipeline
nli = hf_pipeline('text-classification', model='cross-encoder/nli-deberta-v3-base')
result = nli(f'{retrieved_context} [SEP] {response}')
# 'entailment' score is your faithfulness proxy
return next(r['score'] for r in result if r['label'] == 'ENTAILMENT')
Note (NeMo Guardrails (NVIDIA))

NVIDIA’s open-source framework lets you define rails in Colang — a domain-specific language for specifying what topics a bot should and shouldn’t engage with. Good for production systems where compliance teams need to review and approve the rules independently of the model.

Cloud GenAI Platforms

Tredence clients span all three major clouds. You must be able to speak intelligently about each and recommend the right services for a given architecture.

Platform Comparison

CapabilityAzureAWSGCP
Flagship modelsGPT-4o, GPT-4, Whisper, DALL-EClaude, Llama 3, Mistral, Nova ProGemini 1.5 Pro/Flash, Llama 3, Claude
Managed RAGAzure AI Search + Prompt FlowKnowledge Bases for Bedrock (S3 → OpenSearch)Vertex AI Search + Data Stores
Managed agentsAzure AI Studio / Prompt FlowAgents for Bedrock (Lambda as tools)Vertex AI Agent Builder
Content safetyAzure Content Safety (multimodal)Guardrails for Bedrock (built-in)Vertex AI content filters
Fine-tuningAzure OpenAI Fine-tuning (GPT-3.5/4)Bedrock fine-tuning (Titan, Llama)Vertex AI supervised fine-tuning
Key advantageData stays in tenant, enterprise complianceUnified API across many model familiesDeep BigQuery + Workspace integration

Azure OpenAI

from openai import AzureOpenAI
import os
client = AzureOpenAI(
api_key=os.getenv('AZURE_OPENAI_KEY'),
api_version='2024-02-01',
azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT')
)
response = client.chat.completions.create(
model='gpt-4o', # This is the deployment name in your Azure resource
messages=[{'role': 'user', 'content': 'Explain RAG in one paragraph.'}],
max_tokens=500,
temperature=0
)
print(response.choices[0].message.content)
Tip (Azure enterprise considerations)

Use Managed Identity (MSI → RBAC → Azure OpenAI) instead of API keys for production — no secrets to rotate or leak. Add a Private Endpoint for strict data residency compliance. Azure Content Safety provides multimodal moderation (hate, violence, sexual, self-harm) that can be chained before and after your LLM call.

AWS Bedrock

import boto3, json
client = boto3.client('bedrock-runtime', region_name='us-east-1')
response = client.invoke_model(
modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
body=json.dumps({
'anthropic_version': 'bedrock-2023-05-31',
'max_tokens': 500,
'messages': [{'role': 'user', 'content': 'Explain RAG in one paragraph.'}]
})
)
result = json.loads(response['body'].read())
print(result['content'][0]['text'])
Tip (Bedrock key services to know)

Knowledge Bases for Bedrock: S3 → managed embeddings → OpenSearch Serverless or Aurora pgvector. Zero embedding infrastructure to manage. Agents for Bedrock: define action groups as Lambda functions — Bedrock handles the ReAct loop. Guardrails for Bedrock: content filtering, PII redaction, and topic denial built in — review requirements without writing custom validators. Nova Pro: Amazon’s own flagship multimodal model (300K context), deeply integrated with AWS services.

GCP Vertex AI

GCP’s unique advantage is the 1M-token context of Gemini 1.5 Pro and the deep integration with BigQuery.

-- BigQuery ML: run LLM inference directly in SQL
SELECT
customer_id,
review_text,
ML.GENERATE_TEXT(
MODEL `project.dataset.gemini_model`,
STRUCT(review_text AS prompt),
STRUCT(0 AS temperature, 100 AS max_output_tokens)
) AS sentiment_analysis
FROM `project.dataset.customer_reviews`
WHERE DATE(created_at) = CURRENT_DATE()

Key services: Model Garden (unified API for Gemini, Llama, Mistral, Claude), Vertex AI Search (enterprise RAG with grounding), Vertex AI Pipelines (Kubeflow-based MLOps), Feature Store, Model Registry, Monitoring.

LLM Evaluation — RAGAS & Metrics

Knowing how to measure quality systematically separates engineers from scientists. Poor evaluation is the most common reason LLM projects fail silently in production.

RAGAS — RAG-Specific Evaluation

RAGAS evaluates the full RAG pipeline using LLMs as judges — no labelled dataset required for most metrics:

from ragas import evaluate
from ragas.metrics import (
faithfulness, # Is the answer grounded in retrieved context? (no hallucinations)
answer_relevancy, # Does the answer address the question?
context_precision, # Are the retrieved chunks relevant to the question?
context_recall, # Does the retrieved context cover the ground-truth answer?
answer_correctness # Does the answer match the ground truth?
)
from datasets import Dataset
data = {
'question': ['What is the return policy?', 'How do I cancel my subscription?'],
'answer': ['Returns accepted within 30 days', 'Email support to cancel anytime'],
'contexts': [['Our policy allows returns in 30 days...'], ['Cancellations via support email...']],
'ground_truth': ['Products can be returned within 30 days', 'Cancel by emailing support']
}
dataset = Dataset.from_dict(data)
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results.to_pandas())
MetricMeasuresProduction target
FaithfulnessIs the answer supported by retrieved context? (hallucination rate)> 0.85
Context PrecisionAre the retrieved chunks relevant to the question?> 0.75
Context RecallDoes the retrieved context cover the ground-truth answer?> 0.70
Answer RelevancyDoes the answer actually address what was asked?> 0.80
Answer CorrectnessDoes the answer match the ground-truth? (needs labels)> 0.75

Text Generation Metrics

# ── BLEU — precision-based, used for machine translation ──────────────────
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
reference = [['the', 'product', 'works', 'well', 'overall']]
candidate = ['the', 'product', 'performs', 'well', 'overall']
score = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)
# BLEU weakness: 'works' ≠ 'performs' by BLEU — no semantic awareness
# ── ROUGE — recall-based, used for summarisation ──────────────────────────
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(
'Machine learning is a type of artificial intelligence.',
'ML is a subset of AI that learns from data.'
)
# rouge1: unigram overlap | rouge2: bigram | rougeL: longest common subsequence
# ── BERTScore — semantic similarity using BERT embeddings ─────────────────
from bert_score import score as bert_score
P, R, F1 = bert_score(
['The product works well overall'],
['The item performs great in general'],
lang='en'
)
print(f'BERTScore F1: {F1.mean():.4f}') # Captures semantic similarity BLEU misses
# ── LLM-as-Judge (G-Eval) — best correlation with human judgement ─────────
# Use GPT-4 to score outputs on coherence, relevance, fluency
# Most expensive but most aligned with human evaluation
MetricCapturesDoes NOT captureBest for
BLEUn-gram precisionSemantic equivalenceMachine translation (legacy)
ROUGEn-gram recallParaphrase qualitySummarisation benchmarks
BERTScoreSemantic similarityFactual accuracyGeneral text quality
RAGASFull RAG pipeline qualityNon-RAG generationRAG system evaluation
LLM-as-JudgeNuanced qualityCost-efficiencyFinal production evaluation
Tip (Evaluation hierarchy for production)

(1) During development: RAGAS on a curated test set of 100–200 queries. (2) Before release: LLM-as-judge on a stratified sample. (3) In production: lightweight automatic metrics + user thumbs up/down + faithfulness monitoring on sampled traffic. Never rely on a single metric.

50 Mock Interview Questions

Practice these out loud. Cover the answer, say yours, then compare. Aim for 90–120 seconds per answer. This is the single highest-impact activity you can do with this material.

50 Mock Interview Q&A

0 / 50 reviewed

System Design Scenarios

Expect at least one system design question. Structure your answer: requirements → architecture → components → tradeoffs → monitoring.

Scenario 1 — Enterprise Customer Service Copilot (100K support tickets/day)

Requirements: Answer customer questions in under 3 seconds, cite policy documents, escalate complex cases to humans.

Architecture:

User → API Gateway → Input Guardrails → Intent Classifier
┌────────────────────────────────────────┐
│ Simple query Complex query │
↓ ↓ │
RAG Pipeline Agent with tools │
(policy docs) (ticket lookup, order │
status, escalation) │
└──────────────────┬─────────────────────┘
Output Guardrails → User

Key decisions:

  • LLM routing: GPT-4o-mini for intent classification (cheap, fast); GPT-4o only for complex agent tasks
  • RAG: Azure AI Search with hybrid search over policy docs, product manuals, FAQ knowledge base
  • Agent tools: ticket lookup (SQL), order status (REST API), human escalation trigger
  • Cost optimisation: semantic caching handles ~30% of repeat queries; mini model handles ~60% of simple lookups
  • Evaluation: RAGAS weekly + CSAT correlation monthly

Scenario 2 — Automated Financial Report Analysis (10K reports/month)

Requirements: Extract KPIs, identify risks, generate executive summaries, compare across quarters.

Architecture decisions:

  • Parsing: Layout-aware parsing (Marker or Unstructured.io) to preserve table structure from PDFs
  • Extraction: Function calling to extract standardised JSON schema (revenue, EBITDA, guidance, risks) — structured data before natural language
  • Multi-document reasoning: LlamaIndex SubQuestion Query Engine decomposes “compare Q3 vs Q4” into sub-queries, answers each, synthesises
  • Long context: Claude 3.7 Sonnet (200K context) for entire 100-page reports when section-level retrieval isn’t sufficient
  • Validation: Calculator tool cross-checks extracted numbers against embedded tables — never trust the LLM’s arithmetic
  • Output: Structured data → database + natural language summary generated from the structured data, not the raw document
Caveat (Never trust LLM arithmetic)

LLMs can misread numbers from complex PDF tables. Always extract numerical data with a structured approach (function calling, regex validation), then generate narrative from the validated structured data. Cross-check extracted numbers against at least two locations in the document.

Scenario 3 — Code Review Assistant for Engineering Teams

Requirements: Review PRs for bugs, security issues, and style violations; suggest improvements; explain changes in plain English.

Architecture decisions:

  • Input: GitHub webhook → diff extraction → retrieve surrounding context via code embedding search
  • Parallel agents: security checker + logic reviewer + style linter run in parallel (LangGraph parallel subgraph), synthesis agent aggregates results
  • Code embedding DB: all internal code indexed with code-specific embeddings (code-search-ada-002) — enables “show me similar functions in the codebase”
  • Agent tools: run_tests (trigger CI), search_codebase, lookup_docs, check_cve (security vulnerability database)
  • Models: fine-tuned CodeLlama-34B on internal codebase for code understanding; GPT-4o for explanation generation
  • Hard rule: never auto-merge — the agent only adds PR comments; a human must approve. All suggestions logged for audit.

Final Preparation

Research Checklist — Tredence

  • Founded 2013, focused on last-mile analytics for large enterprises
  • Key verticals: CPG, retail, manufacturing, healthcare, financial services
  • Known for: AI-powered decision intelligence, supply chain analytics, customer analytics platforms
  • Flagship products: SCOUT (supply chain intelligence), PRISM (pricing analytics)
  • Review recent case studies and blog posts at tredence.com before the interview
  • Understand their delivery model: heavily client-embedded, consulting + product hybrid

Questions to Ask the Interviewer

These show preparation and genuine curiosity — both matter:

  1. “What does the typical project lifecycle look like for the GenAI team? From stakeholder request to deployment?”
  2. “What LLM infrastructure do your enterprise clients typically run on — Azure, AWS, or GCP?”
  3. “What are the biggest challenges the team is solving in LLMOps right now?”
  4. “How does the team balance building reusable frameworks vs custom client solutions?”
  5. “What does the first 90 days look like for someone in this role?”

Day-of Mindset

Summary (Interview execution checklist)

If you don’t know something: say “My understanding is X — let me think through this carefully.” Never fake it. Interviewers can always tell, and intellectual honesty is valued far more than a wrong confident answer.

On tools you haven’t used: “I haven’t used X specifically, but I’ve used Y which solves the same problem — I’d apply the same principles.”

For coding questions: think aloud before typing. Say what you’re building and why. The interviewer wants to see your thought process, not just the solution.

Connect every technical answer to business impact. “This improves precision, which reduces manual review cost by X%.”

STAR for behavioural questions: Situation → Task → Action → Result (always quantified).

Show genuine curiosity: ask clarifying questions before solving (what’s the scale? what’s the latency budget? what’s the failure mode?).


Liked this article? this post and share it with a friend. Have a question, feedback or simply wish to contact me privately? Shoot me a DM and I'll do my best to get back to you.

Have a wonderful day.

– Sarath