Agent fundamentals
Learn how the ReAct loop works, what tool use looks like under the hood, and why compound failure math is the central challenge every production agent team faces.
TL;DR
- The ReAct loop is Thought β Action β Observation, repeated until a stop condition. Every production agent follows this structure in some form.
- Tool use gives agents the ability to act on the world: the agent receives schemas, selects a tool, generates arguments, and the scaffolding executes the call and injects the result as an Observation.
- Agents have four memory types: working (context window), episodic (vector store), semantic (knowledge base via RAG), and procedural (baked-in model weights). Each has different latency, capacity, and update cost.
- At 95% per-step reliability, a 10-step agent chain succeeds only $(0.95)^10 \approx 0.60$ of the time. Every additional step compounds the failure rate.
- For most production tasks, a carefully designed multi-step pipeline is more reliable and faster to debug than an open-ended agent loop.
- Planning before acting and validating before executing are the two highest-return reliability investments in agent design.
The problem it solves
You want an LLM to research competitor pricing, compare it to your product catalog, and draft an updated spreadsheet. A single LLM call cannot do this: it has no internet access, no file access, and no mechanism to chain multiple operations based on intermediate results.
The naive response is to "add tools." But tool use without a loop gives you a prompt that can call exactly one tool and then stop. The agent needs to see the result of that call, reason about it, and decide what to do next. The loop is not decorative. It is the mechanism that makes multi-step work possible.
Every team I have seen skip the loop structure hit the same wall: the agent could start a task but had no mechanism to continue after it learned something new. You either add the loop or you rebuild a custom one from scratch, poorly.
What is it?
An AI agent is an LLM running in an action-observation loop with access to tools it can call to interact with external systems. The loop continues until the task is complete or a stop condition triggers (a step limit, an explicit "Final Answer" token, or a declared error threshold).
Think of a detective investigating a case. They form a hypothesis (Thought), collect evidence (Action), examine what they find (Observation), revise their hypothesis, and repeat until they reach a conclusion. The detective does not have all the information at the start. They acquire it through iteration. An agent works the same way.
How it works
The ReAct loop
ReAct (Reasoning + Acting) was introduced by Yao et al. in 2022 as a prompting strategy that interleaves reasoning traces with tool actions. Each cycle produces three sections: a Thought (visible reasoning), an Action (tool call), and an Observation (tool return value injected by the scaffolding). The LLM does not generate the Observation. The scaffolding executes the tool and inserts the result.
Here is a complete ReAct trace for a stock price query:
Thought: The user wants the current AAPL price. I will call get_stock_price.
Action: get_stock_price({"ticker": "AAPL"})
Observation: {"ticker": "AAPL", "price": 189.42, "timestamp": "2026-04-11T09:30:00Z"}
Thought: I have the current price. I can answer the user directly.
Final Answer: AAPL is currently trading at $189.42 as of 9:30 AM.
The loop terminates when the LLM generates "Final Answer:" instead of another "Action:". Most frameworks (LangChain AgentExecutor, OpenAI Assistants API, LangGraph) implement this as a while-loop: call the LLM, parse the response, if it contains a tool call execute it and loop, if it contains "Final Answer:" return.
I have seen production systems where engineers tried to handle the loop logic inside complex prompt templates. The result was unparseable output, failed tool calls, and bugs that reproduced differently on each run. Use the framework's loop mechanism.
Tool use mechanics
The agent always receives the full tool schema at the start of each inference call. The schema includes the tool name, a description of what it does and when to use it, and the parameter specification in JSON Schema format.
# Tool registration in the ReAct loop
tools = [
{
"name": "get_stock_price",
"description": "Retrieve the current NYSE or NASDAQ stock price for a given ticker symbol. Returns price in USD and the timestamp of the last trade. Use this when the user asks about current stock prices, not historical data.",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "The stock ticker symbol (e.g., AAPL, MSFT, GOOGL)"
}
},
"required": ["ticker"]
}
}
]
The description is the highest-leverage variable in tool selection accuracy. Vague descriptions ("get data") lead to wrong tool selection and hallucinated arguments. Precise descriptions that specify the data source, return format, and use cases produce consistently correct calls.
The four memory types
Agents can draw on four different kinds of memory. Each has different access speed, capacity, update mechanism, and reliability. Understanding which type to use for which data is the difference between an agent that scales and one that collapses at session 1000.
Working memory is the context window. It holds the current task description, tool schemas, and every Observation produced so far. It is fast but finite. When a long-running agent fills its context, older content either gets summarized (lossy) or truncated (also lossy). Context management for long-running agents is an unsolved problem every production team patches differently.
Episodic memory enables continuity across sessions. The agent stores key outcomes in a vector database and retrieves the most semantically similar records at the start of each session. This is how an agent "remembers" that a user prefers TypeScript over JavaScript, or that a particular API requires OAuth 2.0 rather than API keys.
Semantic memory is the knowledge base. Factual domain knowledge retrieved via RAG when the agent encounters a question it cannot answer from working memory. Domain documentation, product catalogs, API references. Not personalized per user.
Procedural memory is the most overlooked type. It is the skills baked into the weights through training or fine-tuning: writing idiomatic Python, applying a company's code review checklist, using domain-specific vocabulary correctly. Unlike the other three types, procedural memory is not retrieved on demand. It is always active, which makes it both reliable and hard to correct when wrong.
Planning and plan validation
A key weakness of basic ReAct is that it picks the next action without a global plan. The agent knows what to do now but not how many steps remain or whether the current approach will work. This causes wasted tool calls, circular reasoning loops, and hitting step limits without completing the task.
Better agents generate a plan before executing. A plan is an ordered list of subtasks. The agent (or a dedicated planner LLM called before the loop starts) produces the plan first, validates it for completeness and feasibility, then executes each step using the ReAct loop.
Validating the plan before execution is the most effective place to catch impossible or harmful actions. It is far cheaper to catch "this task requires deleting a user account, which is out of scope" before 10 tool calls than after step 9.
The compound reliability problem
Each LLM inference in the loop has some failure rate. Even at high per-step reliability, chains of steps multiply the failure. The formula:
$$P = p^n$$
where $P$ is end-to-end success probability, $p$ is per-step reliability, and $n$ is the step count. At $p = 0.95$:
A 10-step agent with seemingly reliable components succeeds only 60% of the time. A 20-step agent: 36%. This math drives every architectural decision: how many tools to give the agent, how many steps the happy path requires, and where to add validation checkpoints.
Three mitigations: (1) minimize the step count by designing tasks to require fewer hops, (2) add validation checkpoints after high-risk steps to catch failures before they propagate, and (3) use structured output schemas to reduce LLM parsing failures that burn steps on correction loops.
Key variants
| Variant | How It Works | Best For | Tradeoff |
|---|---|---|---|
| ReAct (single-agent loop) | One LLM reasons, selects tools, and loops until done. | Tasks with moderate complexity and clear tool definitions. | Unreliable at 10+ steps. No plan validation. Non-deterministic behavior. |
| Plan-then-Execute | Planner LLM generates an ordered step list. Executor LLM performs each step using the ReAct loop. | Complex multi-step tasks that benefit from upfront planning and human review before execution. | Slower (two LLM stages). Plan can be wrong or underspecified. |
| Deterministic pipeline | Fixed sequence of LLM calls, no autonomous tool selection. Each step takes defined inputs and produces defined outputs. | Repetitive, well-defined tasks with known structure and no variance in step sequence. | No flexibility to adapt. Cannot handle unexpected intermediate results. |
| Multi-agent | Orchestrator dispatches subtasks to specialized agents. Workers execute in parallel or sequence. Results merge at the orchestrator. | Tasks with distinct domains (code review, research, drafting) that benefit from parallel specialization. | Coordination overhead. State sharing complexity. See Multi-agent systems. |
When to use / when to avoid
When to use
- When your task genuinely requires multiple tool calls whose order cannot be determined before execution starts. (If you always call the same tools in the same order, use a pipeline.)
- When intermediate results meaningfully change what the next step should do. An agent that always takes the same steps regardless of inputs is a pipeline wearing an agent costume.
- When the task has enough variance that a fixed pipeline would need dozens of if-else branches to cover the expected cases.
- When tolerating a 10-20% failure rate is acceptable because the task is high-value and the cost of failure is recoverable.
When to avoid
- When your success rate requirement exceeds 85% and the task requires more than 7-8 steps. The compound failure math makes this difficult without expensive validation gates at every step.
- When you can solve the task with a deterministic pipeline. Pipelines are faster, cheaper, and 10x easier to debug. Default to pipelines and add autonomy only when forced.
- When the agent has access to irreversible write tools (sending emails, executing code, creating database records, deleting data). An unconstrained loop with write access is a blast radius problem. Add human approval gates or scope down the tools.
- When latency requirements are under 10 seconds. Multi-step agent loops with 3-8 LLM calls commonly take 15-60 seconds. Users will not wait.
Real-world examples
GitHub Copilot Workspace uses a multi-step agent loop to translate a GitHub issue into working code. The agent reads the issue, explores the codebase with file search tools, generates a change plan, and then produces code matching the plan. GitHub found that the planning step alone (generating a change spec before any code) significantly reduced hallucination in the final output. This validates plan-first architectures: structure before execution reduces errors.
Intercom's Fin handles approximately 50% of customer support tickets without human intervention. Their key lesson from production: limiting Fin to read-only tools for initial deployment reduced engineering escalations by 80% compared to immediately granting write access. They spent months validating read-only behavior before adding tools that could create tickets or issue refunds. Incremental blast radius expansion is the right approach.
Anthropic's Computer Use (Claude) runs a ReAct loop against a browser sandbox, invoking screenshot, click, and type tools. Their published OSWorld benchmark shows a 22% success rate on full web automation tasks. This number tells the compound failure story: each browser interaction has a meaningful failure rate, and complex tasks with 20+ interactions multiply quickly. Even at 95% per-action, 20 actions gives $(0.95)^20 = 0.36$ end-to-end success.
Stripe's internal tooling agents use agents for developer productivity tasks: searching internal documentation, generating code from templates, and proposing infrastructure changes. Their constraint: all write operations require a human to review the proposed change before execution. The agent plans and drafts; the human approves and executes. This is the pattern that makes agents safe for engineering workflows involving shared infrastructure.
Limitations and tradeoffs
| Advantage | Limitation |
|---|---|
| Handles adaptive multi-step decision-making | Failure compounds: 10 steps at 95% = 60% end-to-end success rate |
| Flexible: no hard-coded step sequence | Expensive: 5-10 LLM calls per task vs. 1 for a direct prompt |
| Transparent: every Thought is logged and inspectable | Slow: 10-60 seconds per task with multiple tool calls |
| Generalizes to novel task types without code changes | Blast radius: write-capable agents cause irreversible damage on failure |
| Works with any tool that has a well-defined schema interface | Non-deterministic: same input can produce different tool call sequences |
The fundamental tension here is autonomy vs. reliability. More autonomous agents handle a broader range of inputs without human intervention, but each degree of autonomy sacrifices reliability. Every team shipping production agents makes explicit choices about where to draw the line. Draw it conservatively and expand incrementally.
How this shows up in interviews
When to bring it up
Mention agent fundamentals any time the interview touches on: systems that perform multi-step tasks autonomously, tools for an AI feature, "autonomous" or "agentic" capabilities, or reliability of LLM-based systems. Even when the interviewer does not explicitly ask about agents, naming the failure modes and architectural constraints signals that you think about LLMs as systems with real production costs.
Depth expected by level
- Junior: Know the ReAct loop structure (Thought/Action/Observation), name the four memory types, recognize that tool descriptions matter for correctness.
- Senior: Explain compound failure math ($(0.95)^10 = 0.60$), design blast radius constraints (read-only tools first, scoped permissions), choose between ReAct and Plan-then-Execute based on task complexity, and articulate when to use a pipeline instead of an agent.
- Staff: Design complete agent reliability architectures (validation checkpoints, structured outputs, retry policies), spec HITL gates for irreversible actions, evaluate framework choices (LangGraph vs. raw loops vs. Temporal), and reason about the full memory architecture for a production system serving 100K+ users daily.
Q&A table
| Interviewer Asks | Strong Answer |
|---|---|
| "How does an agent use tools?" | "The agent receives tool schemas with name, description, and parameter spec. It generates a JSON tool call. The scaffolding parses, validates, executes, and returns the result as an Observation that feeds into the next reasoning step." |
| "What's the agent memory model?" | "Four types: working (context window, current session, fast but finite), episodic (vector store, past sessions, semantic search), semantic (RAG knowledge base, shared facts, not personalized), and procedural (baked-in model weights, zero-latency, hard to update)." |
| "How do you handle step failures?" | "Per-step validation (structured output schemas), retry with exponential backoff for transient tool errors, step-limit hard stops to prevent runaway cost, and checkpoint completed steps with LangGraph persistence for multi-session tasks." |
| "My agent keeps running out of context. What do I do?" | "Three options: summarize older observations into compressed working memory, move key findings to episodic memory and retrieve on demand, or reduce the task to require fewer sequential steps by batching where possible." |
| "When should I use an agent vs. a pipeline?" | "Agent when the step sequence genuinely cannot be determined before execution starts. Pipeline when steps are fixed. Pipelines are 3-5x cheaper and dramatically more debuggable when they fit the task." |
| "What is the compound failure problem?" | "At $p$ per-step reliability and $n$ steps, end-to-end success is $p^n$. At 95% and 10 steps, that is 60%. An agent that seems reliable step-by-step can fail 40% of runs, which is usually unacceptable for production workflows." |
Common interview mistakes
| Mistake | Why It's Wrong | Say This Instead |
|---|---|---|
| "Agents can do anything a human can do, just faster" | This conflates capability with reliability. Agents can attempt more tasks but complete them reliably at much lower rates. The compound failure math bounds what agents can do dependably. | "Agents extend what is possible but at a reliability cost. Every additional step is a compounding failure point. I design agents to minimize steps and add checkpoints for high-risk actions." |
| "Tools are just API calls the agent makes" | Tools require precise schemas and well-written descriptions. A bad description breaks tool selection even when the underlying function is technically callable. This is the most common source of production agent failures. | "Tools need accurate descriptions as much as working code. The description is what the agent uses to decide whether to call this tool at all. I treat tool descriptions as first-class artifacts that get tested and iterated." |
| "The agent can remember everything once I increase the context window" | Context windows have hard limits. Long-running agents overflow. Working memory is the smallest of the four memory types by capacity. Large contexts also degrade attention quality. | "Working memory is finite regardless of model. For long-horizon tasks I use episodic memory in a vector store for cross-session continuity, not larger context windows." |
| "More steps means the agent can handle more complex tasks" | More steps compounds failure. $(0.95)^20 = 0.36$: a 20-step agent fails 64% of the time at 95% per-step reliability. Adding steps hurts reliability faster than it adds capability. | "More steps means lower end-to-end reliability. I design agents to accomplish the goal in the minimum steps that handle the expected input variance, not to add steps for edge cases." |
| "Agents work just like RAG with extra steps" | RAG is retrieval-augmented single-call generation. Agents add a loop, tool execution, multi-step state, and a categorically different failure profile. The debugging approaches, cost models, and reliability strategies are completely different. | "RAG grounds one LLM call with retrieved context. Agents add a loop, tool execution, and cumulative state. The failure modes are different: RAG fails on retrieval quality; agents fail on step reliability, tool descriptions, and context management." |
Test your understanding
Quick recap
- An AI agent is an LLM running in an action-observation loop (the ReAct pattern), equipped with tools it calls to interact with external systems. The loop continues until a stop condition triggers.
- Tool schemas and descriptions are first-class engineering artifacts. Vague descriptions are the most common cause of incorrect tool selection in production agents.
- Agents have four memory types: working (context window, fast and finite), episodic (vector store, persistent across sessions), semantic (RAG knowledge base, shared facts), and procedural (model weights, always-available baked-in skills).
- The compound failure formula is $p^n$: at 95% per-step reliability and 10 steps, end-to-end success is 60%. Minimizing step count is the highest-return reliability investment.
- Planning before executing and validating plan feasibility before the first tool call are the most effective points in the loop to catch errors cheaply.
- Default to deterministic pipelines. Use agents only when the step sequence genuinely cannot be determined before execution starts.
- For interview depth, lead with compound failure math and blast radius analysis. These signal you have shipped agents to production, not just read about them.
Related concepts
- Stateful agents with LangGraph - The logical next step after understanding the ReAct loop: how to add persistence, conditional branching, and human-in-the-loop to an agent using LangGraph's state graph model.
- Multi-agent systems - When a single ReAct loop is not enough, multi-agent architectures use specialized agents in coordination. Read this after understanding single-agent fundamentals.
- Retrieval-augmented generation - RAG is the implementation of semantic memory: how agents ground LLM calls in retrieved domain knowledge rather than relying solely on training data.
- Production agents - Reliability, observability, and blast radius management at scale. The engineering guide for agents past the prototype stage.
- Human-in-the-loop - Approval gates, interrupt patterns, and when human oversight is worth the latency cost. The natural complement to understanding autonomous agent limits.