Hook-based safety guardrails
Register composable pre-tool and post-tool hooks that inspect arguments and outputs for safety violations, more reliable than system-prompt safety instructions alone.
TL;DR
- Pre-tool hooks inspect tool call arguments before execution, blocking shell injection, path traversal, SQL injection, and credential access patterns with deterministic code rather than probabilistic LLM reasoning.
- Post-tool hooks scan tool results before returning them to the agent, preventing PII, secrets, and sensitive data from entering subsequent reasoning steps.
- Hooks are composable middleware: each hook has one responsibility and can be unit-tested independently without running the full agent. A typical production agent runs 5-10 hooks per tool call in under 20ms total.
- Deterministic code beats probabilistic prompts for safety enforcement. A regex check for
rm -rfnever fails because the model was distracted by a long context window. - Limitation: hooks only catch what they're programmed to catch. Creative attacks that don't match known patterns slip through. Hooks are one layer in defense-in-depth, not a complete solution.
The Problem It Solves
Your autonomous coding agent runs overnight on a large refactoring task. At 3 AM, the agent decides the cleanest fix is rm -rf node_modules && npm install. Reasonable enough. Except it constructs the path wrong and runs rm -rf / instead. Your CI server is gone.
You could add "never run destructive commands" to the system prompt. That works 99% of the time. But 99% reliability on destructive operations means 1 in 100 runs nukes something. At 50 agent runs per day, you're looking at a catastrophic failure every two days.
The fundamental problem: system prompt safety instructions execute as probabilistic LLM reasoning. The model interprets them, weighs them against the task context, and sometimes decides the task is more important than the safety instruction. You can't build production safety on a mechanism that occasionally chooses to ignore your rules.
What Is It?
Hook-based safety guardrails are deterministic code functions that execute at defined interception points in the agent's tool-execution pipeline. Pre-tool hooks inspect arguments before a tool runs. Post-tool hooks inspect results before they return to the agent. Each hook enforces a single safety rule and either allows, warns, or blocks the operation.
Think of it as airport security. The TSA doesn't rely on passengers promising not to bring weapons (that's the system prompt approach). Instead, every bag goes through an X-ray machine (pre-tool hook) and every passenger walks through a metal detector (post-tool hook). The checks are deterministic, composable (add new scanners without changing existing ones), and independent of the passenger's intentions.
How It Works
The hook execution model
Every tool call in an agent framework passes through two interception points: pre-execution and post-execution. Each point runs an ordered chain of hook functions. If any pre-hook returns a "block" signal, the tool call is cancelled and the agent receives an error message explaining why. Post-hooks can redact, transform, or flag the output but typically can't undo execution (the tool already ran).
I've worked on agent systems where all the safety logic lived in the system prompt. It worked fine in demos. In production, with long conversations and complex multi-step tasks, the agent would "forget" safety instructions roughly once per 200 tool calls. Hooks eliminated that failure mode entirely.
Pre-tool hooks: inspecting before execution
Pre-tool hooks are the first line of defense. They receive the tool name and its arguments, apply pattern-matching rules, and decide whether to allow, warn, or block the call. The four most common pre-tool hooks cover the majority of agent safety failures.
Dangerous command blocker pattern-matches shell commands for destructive operations: rm -rf, git reset --hard, git clean -fd, DROP TABLE, chmod 777, curl | bash. It blocks the tool call and returns an error to the agent. This single hook prevents the most catastrophic agent failures.
Path traversal checker validates that file operations stay within the allowed workspace. It blocks paths containing ../, absolute paths outside the workspace root, and symlinks resolving outside the boundary. Without this, an agent asked to "clean up temp files" could delete system directories.
SQL injection detector scans database query arguments for injection patterns: unescaped quotes, UNION SELECT, ; DROP, comment sequences (--, /*). For agents that generate SQL, this is non-negotiable.
Credential access preventer blocks tools from accessing credential files (.env, ~/.aws/credentials, ~/.ssh/), keychain APIs, or environment variables containing secrets. The agent doesn't need your AWS keys to write code.
Post-tool hooks: sanitizing after execution
Post-tool hooks process the tool's output before it enters the agent's context. Unlike pre-hooks, they can't prevent execution (the tool already ran), but they can redact sensitive information from the result.
Secret scanner detects API keys, tokens, passwords, and connection strings in tool output. If the agent reads a config file containing a database password, the secret scanner replaces it with [REDACTED] before the agent sees it. This prevents secrets from appearing in the agent's reasoning (and your logs).
PII detector scans for personally identifiable information in tool results: names + email combinations, phone numbers, SSNs, addresses. This complements the PII tokenization pattern by catching PII that enters through tool results rather than user input.
Output size limiter truncates excessively large tool outputs. An agent that runs cat on a 50MB log file would fill its context window with a single tool result, destroying its ability to reason about anything else. The limiter caps output at a configurable threshold (typically 10-50KB) and appends a truncation notice.
The swiss cheese model of safety
No single hook catches everything. Hooks work like the swiss cheese model in aviation safety: each layer has holes, but stacking multiple layers means a threat has to pass through every hole simultaneously to succeed.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.