AI agents
Understand what AI agents are, how the ReAct loop works, what memory and tool primitives look like, and why production agents fail so often at tasks that demos make look easy.
TL;DR
- An agent is an LLM running in a loop: observe the current state, reason about it, take an action, observe the result, repeat until done or until a step limit is hit.
- The ReAct pattern (Yao et al., 2022) alternates "Thought:" reasoning steps with "Action:" tool calls; observations feed back as input for the next iteration.
- Each LLM call is roughly 95% reliable. A 10-step chain drops to about 60% end-to-end reliability (0.95 to the power of 10). This compound failure rate is the central production challenge.
- Memory comes in two forms: short-term (conversation context in the window) and long-term (vector-stored experience retrieved on demand).
- Most production "agents" are actually constrained pipelines with LLMs at specific decision nodes, not fully autonomous systems. That is the right design for most use cases.
The problem it solves
Ask an LLM to write unit tests for a function. It generates the test file, and it looks plausible. But the tests reference a helper method that does not exist. The only way to catch that is to actually run the tests. And when they fail, you need to read the error, fix the code, and try again. A single prompt-response cycle cannot do that.
This is the fundamental gap. LLMs in a single call can generate text, but they cannot take actions in the world, observe results, and iterate. They cannot check a database, run a command, verify their own output, or recover from errors. Every output is just a guess, disconnected from reality.
A large category of valuable tasks (code generation with testing, research across multiple sources, customer support that actually updates tickets) requires acting on external systems, observing what happened, and then deciding what to do next. That is what agents solve.
What is it?
An AI agent is an LLM configured to run in a loop with access to tools and memory. At each step it observes its current state (conversation history, latest tool output), produces either a reasoning step or a tool call, observes the result, and feeds the observation back into the next iteration. The loop continues until the model decides it has a final answer, or until it hits a step limit.
Think of it like a junior developer pair-programming with you over chat. They read the problem, think about an approach, try something (run a command, read a file, write code), look at what happened, and adjust. They do not produce a perfect answer in one shot. They iterate.
The key difference from a standard LLM call is the loop. A chatbot takes one turn at a time, with the user driving each round. An agent takes as many turns as needed to complete a goal, driving itself.
Agents are a programming model, not magic. The LLM is still a text predictor. What makes it an "agent" is the scaffolding around it: the loop, the tools, the memory, and the stopping conditions. The scaffolding code is deterministic. The LLM is the non-deterministic decision-maker inside it.
How it works
The ReAct loop
The most widely used agent pattern is ReAct (Reason + Act, Yao et al. 2022). The model alternates between "Thought:" steps (internal reasoning visible in traces) and "Action:" steps (tool calls). Each tool call returns an "Observation:" that gets appended to the context. The model reads the full accumulated history at each step to decide what to do next.
A typical agent run looks like this in the trace log:
Thought: The user wants unit tests for calculate_tax(). I should read the function first.
Action: read_file("src/tax.py")
Observation: def calculate_tax(amount, rate): return amount * rate ...
Thought: Simple function. I'll write tests covering normal cases and edge cases.
Action: write_file("tests/test_tax.py", "...")
Observation: File written successfully.
Thought: Now I should run the tests to verify they pass.
Action: run_command("pytest tests/test_tax.py")
Observation: 4 passed, 1 failed - test_zero_rate expected 0 but got 0.0
Thought: Float comparison issue. I'll fix the assertion to use approximate matching.
Action: write_file("tests/test_tax.py", "...updated...")
Observation: File written successfully.
Action: run_command("pytest tests/test_tax.py")
Observation: 5 passed
Thought: All tests pass. I'll return the final result.
Final Answer: Created 5 unit tests in tests/test_tax.py. All passing.
That is six loop iterations. Each one adds to the context window.
Memory systems
Agents need memory at two timescales. Short-term memory is everything in the current context window: the original goal, every thought, action, and observation from the current run. It requires no infrastructure but is bounded by the context window size (128K tokens for most frontier models in 2025).
Long-term memory is external storage the agent can query. A vector database holding past conversations, documentation, or learned procedures. The agent calls a retrieval tool to pull relevant information into context when needed. This is essentially a RAG system embedded inside the agent loop.
Episodic memory is an emerging pattern where the agent stores compressed traces of past task executions. When it encounters a similar task, it retrieves the successful strategy. Voyager (the Minecraft agent) pioneered this: it stores discovered skills as code and retrieves them when facing similar challenges.
I have seen most production agent systems use only short-term memory for individual task runs, with explicit summarization steps to compress history before it overflows the context window.
Tool integration
Tools are functions the agent can call. Each tool has a name, a description (which the model reads to decide when to call it), and typed parameters via a JSON schema. The model decides which tool to call based on the description, not hard-coded routing.
Common production tools: file read/write, code execution (sandboxed), web search, database queries, API calls, and browser automation. The model generates the tool call as structured JSON, your runtime executes it, and the result is fed back as an observation.
A typical tool schema looks like this:
{
"name": "search_docs",
"description": "Search the internal documentation. Returns top 5 results. Use when the user asks about company policies or procedures. Do NOT use for general knowledge questions.",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Natural language search query" },
"max_results": { "type": "integer", "default": 5 }
},
"required": ["query"]
}
}
The description does the heavy lifting. The model reads it to decide whether this tool applies to the current situation.
Tool descriptions are the agent's API contract
The agent decides which tool to call based entirely on the tool description string. If your "send_email" tool says "sends a message," the agent may call it when it only intends to draft something. Write descriptions that are explicit about what action is performed, what side effects occur, and when NOT to use the tool.
Planning strategies
Simple agents plan implicitly: the ReAct loop handles one step at a time, and the model figures out the next action from context. This works for tasks under 5 steps but falls apart on complex goals where the agent gets lost mid-execution.
Plan-and-Execute agents generate an explicit task plan upfront, then execute each subtask sequentially. If a subtask fails, they can re-plan. LangGraph and AutoGen both support this pattern. The tradeoff is that plans generated before execution may not account for information discovered during execution.
The best current approach for complex tasks is hierarchical: a planner agent breaks the goal into subtasks, and executor agents handle each subtask in a short (3-5 step) loop. The planner reviews results and adjusts. This keeps each individual loop short (higher reliability) while handling complex goals.
Stopping conditions
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.