Code-then-execute

TL;DR

Instead of calling predefined tools with fixed schemas, the agent writes code (Python, SQL, shell) that solves the task, then executes it in a sandboxed environment.
Moves arithmetic, data manipulation, and multi-step logic from probabilistic LLM generation into deterministic code execution, reducing hallucination on computation tasks by 80-95%.
The sandbox is non-negotiable: Docker containers, E2B, Pyodide, or Firecracker microVMs. Never execute LLM-generated code on the host.
Two modes: generate-and-run (fast, autonomous) or generate-review-run (safer, with human or automated approval before execution).
Limitation: the agent can only use libraries pre-installed in the sandbox image. Missing imports are the number one failure mode.

Your agent needs to answer: "What's the standard deviation of revenue across our top 50 customers, grouped by region?" You have a tool called query_database that returns rows, and a tool called calculate_statistics that computes mean and median. Neither tool computes standard deviation grouped by region. You'd need to build a new tool for every novel analytical question.

This happens constantly. Predefined tool schemas are rigid. Every new task shape requires a new tool definition, a new implementation, and a new deployment. I've watched teams spend weeks building tool after tool for an analytics agent, only to realize they're reimplementing pandas one function at a time.

The deeper issue: LLMs are bad at arithmetic but excellent at writing code that does arithmetic correctly. Asking GPT-4 to compute a standard deviation directly produces wrong answers roughly 15-20% of the time on multi-step calculations. Asking it to write Python that computes the answer produces correct results over 95% of the time.

Why code is more reliable than direct generation

LLMs predict tokens probabilistically. "The standard deviation is 42.7" requires the model to carry intermediate computation in its attention mechanism, which is lossy. Writing df.std() delegates the computation to deterministic code. The model only needs to know the right function name, not the right answer.

What Is It?

Code-then-execute flips the tool-use paradigm: instead of the agent calling predefined functions, the agent writes code that accomplishes the task, and a runtime executes that code in a sandbox. The agent becomes a programmer, and the sandbox becomes its computer.

Think of it as hiring a contractor. Traditional tool-use is like giving the contractor a fixed menu of services (paint walls, replace faucet, install shelf). Code-then-execute is like giving the contractor a fully equipped workshop and saying "build whatever you need." The workshop has walls (the sandbox), so they can't accidentally demolish your house, but inside those walls, they can build anything.

How It Works

The generation phase

The agent receives a task and decides that writing code is the best approach. This decision can be explicit (the system prompt instructs "write Python to solve analytical tasks") or emergent (the model recognizes that code is the right tool for the job). The agent generates a complete, executable code block.

The prompt engineering matters here. Vague instructions like "write code to help" produce meandering scripts. Specific instructions like "write a self-contained Python function that takes no arguments, performs the computation, and prints the result as JSON to stdout" produce reliable, parseable output.

I've found that the single most impactful prompt tweak is requiring the agent to print results as JSON. This eliminates the parsing ambiguity that causes 30-40% of downstream failures in naive implementations.

Here's the difference between a vague and a precise code generation prompt:

# Vague prompt (produces parsing failures ~35% of the time)
"Write Python code to analyze this data."

# Precise prompt (parsing failures drop below 5%)
"""Write a self-contained Python script that:
1. Takes no command-line arguments
2. Reads data from /tmp/input.csv
3. Computes the requested analysis
4. Prints ONLY a valid JSON object to stdout
5. Uses only: pandas, numpy, json, math, datetime
6. No print() statements other than the final JSON output
"""

The sandbox execution phase

The generated code runs inside an isolated environment with strict resource boundaries. The sandbox provides:

Filesystem isolation: the code can only read/write within its container. No access to the host filesystem.
Network restrictions: typically no outbound network access, or restricted to specific allow-listed endpoints.
Resource limits: CPU time (usually 30-60 seconds), memory (256MB-1GB), disk space (100MB-500MB).
Pre-installed libraries: a curated set of packages (pandas, numpy, requests, etc.) available inside the image.

Output capture and feedback loop

The sandbox captures three streams: stdout (the intended output), stderr (warnings and errors), and the exit code. The agent receives all three and decides what to do next.

On success, the agent parses stdout (ideally JSON) and incorporates the result into its response. On failure, the agent reads the error traceback and generates a corrected version of the code. Most implementations allow 2-3 retry attempts before giving up.

This retry loop is where code-then-execute shines compared to traditional tool calls. When a tool call fails, the agent gets a generic error. When code execution fails, the agent gets a full Python traceback with line numbers, exception types, and variable states. I've measured that agents fix their own code errors on the first retry about 75% of the time with a good traceback.

Task Input

>Waiting for task...

Code Generation

>Waiting...

Static Checks

>Waiting...

Sandbox Execute

>Waiting...

Parse Output

>Waiting...

Code-then-execute pipeline: the agent generates code, the sandbox runs it, and results flow back for the agent to interpret or retry

The security model

The security hierarchy for agent code execution is clear and non-negotiable:

Predefined tools (safest): fixed functions with validated inputs. The agent can only call what you built.
Code in a sandbox (safe): the agent writes arbitrary code, but it runs inside walls. Resource limits prevent abuse.
Code on the host (dangerous): never do this. One prompt injection and attacker-controlled code runs on your infrastructure.

TL;DR

Instead of calling predefined tools with fixed schemas, the agent writes code (Python, SQL, shell) that solves the task, then executes it in a sandboxed environment.
Moves arithmetic, data manipulation, and multi-step logic from probabilistic LLM generation into deterministic code execution, reducing hallucination on computation tasks by 80-95%.
The sandbox is non-negotiable: Docker containers, E2B, Pyodide, or Firecracker microVMs. Never execute LLM-generated code on the host.
Two modes: generate-and-run (fast, autonomous) or generate-review-run (safer, with human or automated approval before execution).
Limitation: the agent can only use libraries pre-installed in the sandbox image. Missing imports are the number one failure mode.

The Problem It Solves

Why code is more reliable than direct generation

What Is It?

How It Works

The generation phase

Here's the difference between a vague and a precise code generation prompt:

# Vague prompt (produces parsing failures ~35% of the time)
"Write Python code to analyze this data."

# Precise prompt (parsing failures drop below 5%)
"""Write a self-contained Python script that:
1. Takes no command-line arguments
2. Reads data from /tmp/input.csv
3. Computes the requested analysis
4. Prints ONLY a valid JSON object to stdout
5. Uses only: pandas, numpy, json, math, datetime
6. No print() statements other than the final JSON output
"""

The sandbox execution phase

The generated code runs inside an isolated environment with strict resource boundaries. The sandbox provides:

Filesystem isolation: the code can only read/write within its container. No access to the host filesystem.
Network restrictions: typically no outbound network access, or restricted to specific allow-listed endpoints.
Resource limits: CPU time (usually 30-60 seconds), memory (256MB-1GB), disk space (100MB-500MB).
Pre-installed libraries: a curated set of packages (pandas, numpy, requests, etc.) available inside the image.

Output capture and feedback loop

The sandbox captures three streams: stdout (the intended output), stderr (warnings and errors), and the exit code. The agent receives all three and decides what to do next.

Task Input

>Waiting for task...

Code Generation

>Waiting...

Static Checks

>Waiting...

Sandbox Execute

>Waiting...

Parse Output

>Waiting...

Code-then-execute pipeline: the agent generates code, the sandbox runs it, and results flow back for the agent to interpret or retry

The security model

The security hierarchy for agent code execution is clear and non-negotiable:

Predefined tools (safest): fixed functions with validated inputs. The agent can only call what you built.
Code in a sandbox (safe): the agent writes arbitrary code, but it runs inside walls. Resource limits prevent abuse.
Code on the host (dangerous): never do this. One prompt injection and attacker-controlled code runs on your infrastructure.

Code-then-execute

TL;DR

The Problem It Solves

What Is It?

How It Works

The generation phase

The sandbox execution phase

Output capture and feedback loop

The security model

Continue Reading with Premium

Comments

Code-then-execute

TL;DR

The Problem It Solves

What Is It?

How It Works

The generation phase

The sandbox execution phase

Output capture and feedback loop

The security model

Continue Reading with Premium

Comments