AI Agents & Tool Calling
An AI agent is an LLM that can autonomously decide to use tools, reason across multiple steps, and complete complex tasks without a human directing every action. This is the cutting edge of GenAI in 2025.
The ReAct Loop — The Core Pattern
Every agent framework — LangChain, LangGraph, AutoGen, CrewAI — implements some variant of the ReAct pattern (Reason + Act). The LLM reasons in a Thought, takes an Action (tool call), observes the result, and loops until it can give a Final Answer.
ReAct Agent Loop
The core pattern powering all modern AI agents. Toggle between the loop structure and a step-by-step trace.
User Input + Tool Schemas
Task arrives with available tools defined
Final Answer
Loop ends, response returned to user
Thought
The LLM reads everything: user task, system prompt, tool schemas, and all previous observations. It reasons explicitly before deciding — this inner monologue is what allows multi-step planning and self-correction.
Thought: I need the current AAPL stock price. I will call search_stock with ticker="AAPL".
WHILE not done: 1. LLM receives: system prompt + conversation history + tool schemas 2. LLM outputs: Thought (reasoning) + Action (tool call) OR Final Answer 3. If action: execute tool → append Observation → go to step 1 4. If final answer: return to userImportant (max_iterations is not optional)
Without a hard cap on the number of loop iterations, agents can run indefinitely — burning API credits and never returning. Always set max_iterations=10 (or lower). Add a cost budget if you’re running expensive models.
The Four Agent Components
| Component | Role | Common implementations |
|---|---|---|
| Brain / LLM | Reasoning engine — decides what to do and when | GPT-4o, Claude 3.7, LLaMA 3.1 |
| Memory | Short-term (context window) + long-term (vector DB) | ConversationBufferWindowMemory, FAISS retrieval |
| Tools | External capabilities the LLM can invoke | Web search, code exec, SQL, REST APIs, file I/O |
| Orchestrator | Manages the action loop, executes tool calls, maintains state | LangGraph, AgentExecutor, AutoGen, CrewAI |
Multi-Agent Systems
Single-agent systems hit a ceiling — one LLM doing everything runs out of context and makes more errors on long tasks. Multi-agent systems assign specialised roles to different models.
| Framework | Model | Best for |
|---|---|---|
| AutoGen (Microsoft) | Conversational agents that collaborate: AssistantAgent + UserProxyAgent | Complex research tasks, code review loops |
| CrewAI | Role-based agents with defined goals, tools, and a structured workflow (sequential or parallel) | Business process automation, report generation |
| LangGraph | Graph-based orchestration — nodes (agents/functions) + edges (transitions) + cycles | Complex conditional workflows that need looping and branching |
Key design patterns:
- Router: central agent routes incoming tasks to specialist agents
- Supervisor: manager-worker hierarchy — supervisor assigns tasks, checks results
- Parallel: independent sub-graphs run simultaneously, results merged
# LangGraph — stateful agent with review cyclefrom langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Annotatedimport operator
class AgentState(TypedDict): messages: Annotated[list, operator.add] next: str
def analyst_node(state: AgentState): response = llm.invoke(state['messages']) return {'messages': [response]}
def reviewer_node(state: AgentState): response = reviewer_llm.invoke(state['messages']) # Routing decision based on response content if 'APPROVED' in response.content: return {'next': 'END', 'messages': [response]} return {'next': 'analyst', 'messages': [response]}
workflow = StateGraph(AgentState)workflow.add_node('analyst', analyst_node)workflow.add_node('reviewer', reviewer_node)workflow.set_entry_point('analyst')workflow.add_edge('analyst', 'reviewer')workflow.add_conditional_edges( 'reviewer', lambda s: s['next'], {'analyst': 'analyst', 'END': END})app = workflow.compile()Warning (Multi-agent failure modes)
Hallucinated tool calls, infinite loops, and cost explosion are the three main risks. Always: set max_iterations, add error handling to every tool, log every agent step for debugging, and never give agents write access to production systems without a human confirmation step.
Responsible AI & Explainability
Enterprise clients care deeply about responsible AI. Showing you can build fair, explainable, and safe systems is a differentiator.
Bias in ML Systems
Bias is not just “the model is unfair” — it has specific technical causes:
| Type | Source | Example |
|---|---|---|
| Historical bias | Training data reflects past discrimination | Loan approval model trained on data from when banks discriminated by race |
| Representation bias | Certain groups underrepresented in training data | Face recognition trained mostly on lighter-skinned faces |
| Measurement bias | Features measured differently across groups | ZIP code as a proxy for race in credit scoring |
| Aggregation bias | Single model applied to groups that behave differently | One churn model for enterprise and consumer customers |
| Evaluation bias | Test set doesn’t reflect real deployment distribution | Testing a medical model on hospital data but deploying in clinics |
Fairness Metrics — and Why They Conflict
Demographic Parity — equal positive prediction rates across groups:
Equalized Odds — equal TPR and FPR across groups (stronger than demographic parity):
Calibration — probability scores should be equally accurate across groups:
Important (The impossibility theorem — know this cold)
Chouldechova (2017) proved that when base rates differ between groups, demographic parity, equalized odds, and calibration cannot all hold simultaneously. You must choose which fairness criterion to prioritise based on the business context and the cost of different error types. There is no mathematically perfect answer.
Explainability — SHAP & LIME
SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. The Shapley value for feature is the average marginal contribution of that feature across all possible feature orderings:
In plain English: measures how much feature pushes the prediction up or down from the baseline, averaged fairly across every way the features could be ordered.
import shap
# ── TreeExplainer: fast, exact SHAP for tree-based models ──────────────────explainer = shap.TreeExplainer(xgboost_model)shap_values = explainer.shap_values(X_test)
# Single prediction — waterfall plot shows each feature's contributionshap.waterfall_plot(explainer(X_test)[0])
# Global feature importance — beeswarm (shows direction of effect)shap.summary_plot(shap_values, X_test)
# Global feature importance — bar chart (mean absolute SHAP values)shap.summary_plot(shap_values, X_test, plot_type='bar')
# Feature interaction — how two features jointly affect predictionshap.dependence_plot('income', shap_values, X_test, interaction_index='age')
# ── LIME: local linear approximations (model-agnostic) ──────────────────────from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=['negative', 'positive'])explanation = explainer.explain_instance( text, classifier.predict_proba, num_features=10)explanation.show_in_notebook()Tip (SHAP vs LIME — when to use which)
SHAP: consistent, theoretically grounded, global + local explanations. Preferred for tabular models. Slow for non-tree models (use KernelSHAP or DeepSHAP).\nLIME: faster for any model, good for text and image. Local only. Less consistent — small perturbations can give different results. Use LIME as a quick sanity check; use SHAP for reporting to stakeholders.
LLM Guardrails — Defence in Depth
Production LLM systems need layers of defence against harmful outputs, prompt injection, and hallucinations:
import re
# ── Layer 1: Input validation (fast, cheap) ────────────────────────────────def validate_input(user_input: str) -> tuple[bool, str]: injection_patterns = [ r'ignore previous', r'forget instructions', r'act as', r'jailbreak', ] for pattern in injection_patterns: if re.search(pattern, user_input, re.IGNORECASE): return False, 'Potential prompt injection detected' if len(user_input) > 2000: return False, 'Input exceeds maximum length' return True, 'OK'
# ── Layer 2: LLM-based classifier (Llama Guard) ────────────────────────────from transformers import pipeline
guard = pipeline('text-classification', model='meta-llama/LlamaGuard-7b')result = guard(f'[INST] {user_message} [/INST] {model_response}')# Returns 'safe' or 'unsafe' with category
# ── Layer 3: Output validation ─────────────────────────────────────────────def validate_output(response: str, retrieved_context: str) -> float: """Return faithfulness score [0, 1] — does response stay within context?""" # Use an NLI model to check if response is entailed by context from transformers import pipeline as hf_pipeline nli = hf_pipeline('text-classification', model='cross-encoder/nli-deberta-v3-base') result = nli(f'{retrieved_context} [SEP] {response}') # 'entailment' score is your faithfulness proxy return next(r['score'] for r in result if r['label'] == 'ENTAILMENT')Note (NeMo Guardrails (NVIDIA))
NVIDIA’s open-source framework lets you define rails in Colang — a domain-specific language for specifying what topics a bot should and shouldn’t engage with. Good for production systems where compliance teams need to review and approve the rules independently of the model.
Cloud GenAI Platforms
Tredence clients span all three major clouds. You must be able to speak intelligently about each and recommend the right services for a given architecture.
Platform Comparison
| Capability | Azure | AWS | GCP |
|---|---|---|---|
| Flagship models | GPT-4o, GPT-4, Whisper, DALL-E | Claude, Llama 3, Mistral, Nova Pro | Gemini 1.5 Pro/Flash, Llama 3, Claude |
| Managed RAG | Azure AI Search + Prompt Flow | Knowledge Bases for Bedrock (S3 → OpenSearch) | Vertex AI Search + Data Stores |
| Managed agents | Azure AI Studio / Prompt Flow | Agents for Bedrock (Lambda as tools) | Vertex AI Agent Builder |
| Content safety | Azure Content Safety (multimodal) | Guardrails for Bedrock (built-in) | Vertex AI content filters |
| Fine-tuning | Azure OpenAI Fine-tuning (GPT-3.5/4) | Bedrock fine-tuning (Titan, Llama) | Vertex AI supervised fine-tuning |
| Key advantage | Data stays in tenant, enterprise compliance | Unified API across many model families | Deep BigQuery + Workspace integration |
Azure OpenAI
from openai import AzureOpenAIimport os
client = AzureOpenAI( api_key=os.getenv('AZURE_OPENAI_KEY'), api_version='2024-02-01', azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'))
response = client.chat.completions.create( model='gpt-4o', # This is the deployment name in your Azure resource messages=[{'role': 'user', 'content': 'Explain RAG in one paragraph.'}], max_tokens=500, temperature=0)
print(response.choices[0].message.content)Tip (Azure enterprise considerations)
Use Managed Identity (MSI → RBAC → Azure OpenAI) instead of API keys for production — no secrets to rotate or leak. Add a Private Endpoint for strict data residency compliance. Azure Content Safety provides multimodal moderation (hate, violence, sexual, self-harm) that can be chained before and after your LLM call.
AWS Bedrock
import boto3, json
client = boto3.client('bedrock-runtime', region_name='us-east-1')
response = client.invoke_model( modelId='anthropic.claude-3-5-sonnet-20241022-v2:0', body=json.dumps({ 'anthropic_version': 'bedrock-2023-05-31', 'max_tokens': 500, 'messages': [{'role': 'user', 'content': 'Explain RAG in one paragraph.'}] }))
result = json.loads(response['body'].read())print(result['content'][0]['text'])Tip (Bedrock key services to know)
Knowledge Bases for Bedrock: S3 → managed embeddings → OpenSearch Serverless or Aurora pgvector. Zero embedding infrastructure to manage. Agents for Bedrock: define action groups as Lambda functions — Bedrock handles the ReAct loop. Guardrails for Bedrock: content filtering, PII redaction, and topic denial built in — review requirements without writing custom validators. Nova Pro: Amazon’s own flagship multimodal model (300K context), deeply integrated with AWS services.
GCP Vertex AI
GCP’s unique advantage is the 1M-token context of Gemini 1.5 Pro and the deep integration with BigQuery.
-- BigQuery ML: run LLM inference directly in SQLSELECT customer_id, review_text, ML.GENERATE_TEXT( MODEL `project.dataset.gemini_model`, STRUCT(review_text AS prompt), STRUCT(0 AS temperature, 100 AS max_output_tokens) ) AS sentiment_analysisFROM `project.dataset.customer_reviews`WHERE DATE(created_at) = CURRENT_DATE()Key services: Model Garden (unified API for Gemini, Llama, Mistral, Claude), Vertex AI Search (enterprise RAG with grounding), Vertex AI Pipelines (Kubeflow-based MLOps), Feature Store, Model Registry, Monitoring.
LLM Evaluation — RAGAS & Metrics
Knowing how to measure quality systematically separates engineers from scientists. Poor evaluation is the most common reason LLM projects fail silently in production.
RAGAS — RAG-Specific Evaluation
RAGAS evaluates the full RAG pipeline using LLMs as judges — no labelled dataset required for most metrics:
from ragas import evaluatefrom ragas.metrics import ( faithfulness, # Is the answer grounded in retrieved context? (no hallucinations) answer_relevancy, # Does the answer address the question? context_precision, # Are the retrieved chunks relevant to the question? context_recall, # Does the retrieved context cover the ground-truth answer? answer_correctness # Does the answer match the ground truth?)from datasets import Dataset
data = { 'question': ['What is the return policy?', 'How do I cancel my subscription?'], 'answer': ['Returns accepted within 30 days', 'Email support to cancel anytime'], 'contexts': [['Our policy allows returns in 30 days...'], ['Cancellations via support email...']], 'ground_truth': ['Products can be returned within 30 days', 'Cancel by emailing support']}
dataset = Dataset.from_dict(data)results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])print(results.to_pandas())| Metric | Measures | Production target |
|---|---|---|
| Faithfulness | Is the answer supported by retrieved context? (hallucination rate) | > 0.85 |
| Context Precision | Are the retrieved chunks relevant to the question? | > 0.75 |
| Context Recall | Does the retrieved context cover the ground-truth answer? | > 0.70 |
| Answer Relevancy | Does the answer actually address what was asked? | > 0.80 |
| Answer Correctness | Does the answer match the ground-truth? (needs labels) | > 0.75 |
Text Generation Metrics
# ── BLEU — precision-based, used for machine translation ──────────────────from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
reference = [['the', 'product', 'works', 'well', 'overall']]candidate = ['the', 'product', 'performs', 'well', 'overall']score = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)# BLEU weakness: 'works' ≠ 'performs' by BLEU — no semantic awareness
# ── ROUGE — recall-based, used for summarisation ──────────────────────────from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)scores = scorer.score( 'Machine learning is a type of artificial intelligence.', 'ML is a subset of AI that learns from data.')# rouge1: unigram overlap | rouge2: bigram | rougeL: longest common subsequence
# ── BERTScore — semantic similarity using BERT embeddings ─────────────────from bert_score import score as bert_score
P, R, F1 = bert_score( ['The product works well overall'], ['The item performs great in general'], lang='en')print(f'BERTScore F1: {F1.mean():.4f}') # Captures semantic similarity BLEU misses
# ── LLM-as-Judge (G-Eval) — best correlation with human judgement ─────────# Use GPT-4 to score outputs on coherence, relevance, fluency# Most expensive but most aligned with human evaluation| Metric | Captures | Does NOT capture | Best for |
|---|---|---|---|
| BLEU | n-gram precision | Semantic equivalence | Machine translation (legacy) |
| ROUGE | n-gram recall | Paraphrase quality | Summarisation benchmarks |
| BERTScore | Semantic similarity | Factual accuracy | General text quality |
| RAGAS | Full RAG pipeline quality | Non-RAG generation | RAG system evaluation |
| LLM-as-Judge | Nuanced quality | Cost-efficiency | Final production evaluation |
Tip (Evaluation hierarchy for production)
(1) During development: RAGAS on a curated test set of 100–200 queries. (2) Before release: LLM-as-judge on a stratified sample. (3) In production: lightweight automatic metrics + user thumbs up/down + faithfulness monitoring on sampled traffic. Never rely on a single metric.
50 Mock Interview Questions
Practice these out loud. Cover the answer, say yours, then compare. Aim for 90–120 seconds per answer. This is the single highest-impact activity you can do with this material.
50 Mock Interview Q&A
0 / 50 reviewedSystem Design Scenarios
Expect at least one system design question. Structure your answer: requirements → architecture → components → tradeoffs → monitoring.
Scenario 1 — Enterprise Customer Service Copilot (100K support tickets/day)
Requirements: Answer customer questions in under 3 seconds, cite policy documents, escalate complex cases to humans.
Architecture:
User → API Gateway → Input Guardrails → Intent Classifier ↓ ┌────────────────────────────────────────┐ │ Simple query Complex query │ ↓ ↓ │ RAG Pipeline Agent with tools │ (policy docs) (ticket lookup, order │ status, escalation) │ └──────────────────┬─────────────────────┘ ↓ Output Guardrails → UserKey decisions:
- LLM routing: GPT-4o-mini for intent classification (cheap, fast); GPT-4o only for complex agent tasks
- RAG: Azure AI Search with hybrid search over policy docs, product manuals, FAQ knowledge base
- Agent tools: ticket lookup (SQL), order status (REST API), human escalation trigger
- Cost optimisation: semantic caching handles ~30% of repeat queries; mini model handles ~60% of simple lookups
- Evaluation: RAGAS weekly + CSAT correlation monthly
Scenario 2 — Automated Financial Report Analysis (10K reports/month)
Requirements: Extract KPIs, identify risks, generate executive summaries, compare across quarters.
Architecture decisions:
- Parsing: Layout-aware parsing (Marker or Unstructured.io) to preserve table structure from PDFs
- Extraction: Function calling to extract standardised JSON schema (revenue, EBITDA, guidance, risks) — structured data before natural language
- Multi-document reasoning: LlamaIndex SubQuestion Query Engine decomposes “compare Q3 vs Q4” into sub-queries, answers each, synthesises
- Long context: Claude 3.7 Sonnet (200K context) for entire 100-page reports when section-level retrieval isn’t sufficient
- Validation: Calculator tool cross-checks extracted numbers against embedded tables — never trust the LLM’s arithmetic
- Output: Structured data → database + natural language summary generated from the structured data, not the raw document
Caveat (Never trust LLM arithmetic)
LLMs can misread numbers from complex PDF tables. Always extract numerical data with a structured approach (function calling, regex validation), then generate narrative from the validated structured data. Cross-check extracted numbers against at least two locations in the document.
Scenario 3 — Code Review Assistant for Engineering Teams
Requirements: Review PRs for bugs, security issues, and style violations; suggest improvements; explain changes in plain English.
Architecture decisions:
- Input: GitHub webhook → diff extraction → retrieve surrounding context via code embedding search
- Parallel agents: security checker + logic reviewer + style linter run in parallel (LangGraph parallel subgraph), synthesis agent aggregates results
- Code embedding DB: all internal code indexed with code-specific embeddings (code-search-ada-002) — enables “show me similar functions in the codebase”
- Agent tools:
run_tests(trigger CI),search_codebase,lookup_docs,check_cve(security vulnerability database) - Models: fine-tuned CodeLlama-34B on internal codebase for code understanding; GPT-4o for explanation generation
- Hard rule: never auto-merge — the agent only adds PR comments; a human must approve. All suggestions logged for audit.
Final Preparation
Research Checklist — Tredence
- Founded 2013, focused on last-mile analytics for large enterprises
- Key verticals: CPG, retail, manufacturing, healthcare, financial services
- Known for: AI-powered decision intelligence, supply chain analytics, customer analytics platforms
- Flagship products: SCOUT (supply chain intelligence), PRISM (pricing analytics)
- Review recent case studies and blog posts at tredence.com before the interview
- Understand their delivery model: heavily client-embedded, consulting + product hybrid
Questions to Ask the Interviewer
These show preparation and genuine curiosity — both matter:
- “What does the typical project lifecycle look like for the GenAI team? From stakeholder request to deployment?”
- “What LLM infrastructure do your enterprise clients typically run on — Azure, AWS, or GCP?”
- “What are the biggest challenges the team is solving in LLMOps right now?”
- “How does the team balance building reusable frameworks vs custom client solutions?”
- “What does the first 90 days look like for someone in this role?”
Day-of Mindset
Summary (Interview execution checklist)
If you don’t know something: say “My understanding is X — let me think through this carefully.” Never fake it. Interviewers can always tell, and intellectual honesty is valued far more than a wrong confident answer.
On tools you haven’t used: “I haven’t used X specifically, but I’ve used Y which solves the same problem — I’d apply the same principles.”
For coding questions: think aloud before typing. Say what you’re building and why. The interviewer wants to see your thought process, not just the solution.
Connect every technical answer to business impact. “This improves precision, which reduces manual review cost by X%.”
STAR for behavioural questions: Situation → Task → Action → Result (always quantified).
Show genuine curiosity: ask clarifying questions before solving (what’s the scale? what’s the latency budget? what’s the failure mode?).