Parallel tool execution
Execute independent tool calls concurrently instead of sequentially, cutting agent task latency by 2-5x when multiple tools have no data dependencies.
TL;DR
- When an agent needs multiple tools in a single turn, independent calls run concurrently instead of sequentially, cutting latency from
sum(latencies)tomax(latencies). - A dependency analyzer classifies tool calls as read-only or state-modifying. All-read batches execute in parallel; any write in the batch forces sequential execution.
- Real-world gains: 40-50% latency reduction is typical (Anthropic documentation), with 2-5x improvements on tool-heavy turns.
- OpenAI, Anthropic, and Google all support parallel tool calls natively in their function-calling APIs.
- The tradeoff: parallel execution burns through context windows faster (all results arrive at once) and requires accurate tool classification to avoid race conditions.
The Problem It Solves
Your coding agent receives a user request: "Find all TODO comments, check the test coverage report, and look up the deployment status." Three independent operations. The agent dutifully calls grep_search, then waits for the result. Then calls read_coverage_report, then waits. Then calls check_deploy_status, then waits. Each tool takes 1-3 seconds. Total wall-clock time: 6-9 seconds of sequential waiting.
None of these tools depend on each other. The grep results don't feed into the coverage check. The deploy status doesn't need the grep output. The agent is burning 4-6 seconds of latency for no reason, just because it processes tool calls one at a time.
I've profiled agent workflows where 70% of total execution time was spent waiting for sequential tool calls that had zero data dependencies between them. The agent was architecturally single-threaded in a world where every tool call is an independent network request. That's the equivalent of loading a web page by fetching images one at a time instead of in parallel.
What Is It?
Parallel tool execution dispatches independent tool calls concurrently within a single agent turn, reducing wall-clock latency from the sum of individual tool latencies to the maximum of any single tool's latency.
Think of it like a restaurant kitchen during dinner rush. A good chef doesn't grill the steak, then start the vegetables after the steak finishes, then begin the sauce after the vegetables are done. Independent dishes fire simultaneously on different burners. The meal is ready when the slowest component finishes, not when the sum of all cooking times elapses. But if one step depends on another (you can't plate until the sauce reduces), that dependency forces sequencing.
The orchestrator sits between the LLM and the tool execution layer. When the model requests multiple tool calls in a single turn, the orchestrator analyzes dependencies, groups independent calls, dispatches them concurrently, collects results, and returns them to the model in the original request order.
How It Works
Step 1: Tool Classification
Every tool in the agent's toolkit must declare its operational nature. This is the foundation that makes safe parallelism possible.
Two categories matter:
Read-only tools inspect state without modifying it. File search, database queries, API GET requests, status checks. These are safe to run concurrently because they don't create side effects. Running two read-only tools simultaneously produces the same results as running them sequentially.
State-modifying (write) tools change files, update databases, invoke external APIs with side effects, or execute shell commands. These are unsafe to parallelize without dependency analysis because they might create race conditions. Writing to the same file from two tools, or executing a shell command that depends on a prior file edit, produces unpredictable results.
# Tool classification declaration
class Tool:
name: str
is_read_only: bool # The critical classification
# Examples
tools = {
"grep_search": Tool("grep_search", is_read_only=True),
"file_read": Tool("file_read", is_read_only=True),
"web_search": Tool("web_search", is_read_only=True),
"file_write": Tool("file_write", is_read_only=False),
"bash_execute": Tool("bash_execute", is_read_only=False),
"db_insert": Tool("db_insert", is_read_only=False),
}
The classification must be conservative. If there's any doubt about whether a tool modifies state, mark it as state-modifying. A false-negative (marking a write tool as read-only) causes race conditions. A false-positive (marking a read tool as write) only costs some parallelism.
Step 2: Dependency Analysis and Batch Formation
When the LLM returns multiple tool calls in a single turn, the orchestrator inspects the batch before execution.
The logic is straightforward:
- Check every tool call in the batch for its classification.
- If ALL tools are read-only, execute the entire batch in parallel.
- If ANY tool is state-modifying, execute the entire batch sequentially in the order the LLM requested.
This is the "conservative" strategy. More advanced implementations build a dependency DAG (directed acyclic graph) that enables partial parallelism even when write tools are present.
The advanced DAG approach identifies which specific tools have data dependencies. If the batch contains [grep_search, file_read, file_write, web_search], the DAG might determine that grep_search and web_search are independent of everything, file_read and file_write target different files, and all four can actually run with partial parallelism. But this complexity is rarely worth it in practice. I've found the simple all-or-nothing classification handles 85%+ of real agent batches correctly.
Step 3: Concurrent Dispatch and Result Collection
For parallel batches, the orchestrator fires all tool calls simultaneously and waits for all results. The critical detail: results must be returned to the LLM in the original request order, regardless of which tool finished first.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.