Learn how constrained decoding forces LLMs to produce valid JSON every time, why regex-guided generation works, and how to choose between JSON mode, tool-use schemas, and libraries like Outlines.
25 min read2026-04-10mediumstructured-outputjson-modellmaiconstrained-decoding
Structured output forces the LLM to produce valid JSON, XML, or any schema-compliant format by constraining the token sampler at generation time.
OpenAI's structured output mode and Anthropic's tool-use with input schemas achieve this via constrained decoding: the model physically cannot emit an invalid token.
For open-source models, libraries like Outlines and Instructor intercept the logit distribution and mask out tokens that would violate the schema.
Structured output eliminates the "parse and retry" anti-pattern that adds 200-500ms latency and wastes tokens on malformed responses.
JSON mode alone is not enough for production. You need schema enforcement (specific fields, types, enums) not just "some valid JSON."
Your invoice extraction pipeline looks clean in staging. You prompt the model to return JSON with invoice_number, amount, and currency, and it does. Then you deploy and JSONDecodeError starts appearing in your logs.
The model returns something like "Here is the JSON you requested:\n```json\n{...}\n```" for about 2% of inputs. Another 1.5% return amount as the string "149.99" rather than a float, so downstream arithmetic throws a TypeError. Your response.choices[0].message.content parser expected raw JSON and received a markdown code block.
# What you asked for (and received in staging):{"invoice_number": "INV-2024-001", "amount": 149.99, "currency": "USD"}# What production returned at 2am on a Friday:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.
"Here is the JSON you requested:\n```json\n{\n \"invoice_number\": \"INV-2024-001\",\n \"amount\": \"149.99\",\n \"currency\": \"USD\"\n}\n```\nLet me know if you need any adjustments!"
The textbook fix is a parse-and-retry loop: catch the error, append "please return valid JSON only" to the prompt, and call the model again. At 100K documents per day with a 3% failure rate, that is 3,000 retries generating roughly 300K wasted tokens daily. Each retry adds 200-500ms of latency.
The root cause is that prompt instructions operate at the semantic level but token generation is stochastic. No matter how carefully you word "output only valid JSON", some fraction of generations will include markdown fences, prose preambles, or type coercions. You cannot fix this with better prompting alone.
Structured output is a generation-time constraint that forces the model to emit only tokens that keep the partial output schema-valid at every step.
Think of it like the difference between a blank essay and a tax form. A blank page lets the model say anything, but the reader has to parse structure from prose. A fill-in-the-blank form guarantees the fields exist in exactly the right places and in the right format. Structured output turns the model's blank page into a fill-in-the-blank form at the token level, not after the fact.
The constraint is enforced by a finite-state machine that tracks which schema states are open at each step. Any token that would advance the output into an invalid state is masked to negative infinity before sampling. The model still applies all of its learned knowledge, but only within the token paths that keep the schema valid.
The model generates output one token at a time. Without constraints, the sampler picks from all ~100,000 vocabulary tokens according to the model's probability distribution. Constrained decoding adds a gating step between the logit computation and the sampler.
Before sampling each token, a finite-state machine (FSM) derived from the JSON schema computes the set of valid next tokens: tokens that, if appended to the partial output so far, would keep it on a valid path toward a schema-complete output. Any token outside that set is masked:
logits'[k] = logits[k] if token k is valid per FSM, else -β
The softmax then operates only over valid tokens. The model's relative preferences among valid tokens are preserved. If three tokens are all valid next steps, the model still picks the one it finds most likely given the context.
The FSM compilation happens once per schema and is typically cached at startup. The per-token masking step adds 10-40ms of total overhead per full generation, not per individual token.
These two terms are routinely conflated, and the distinction matters in production.
JSON mode (available from OpenAI, Anthropic, and most providers) guarantees that the output is syntactically valid JSON. The model will not emit markdown fences, prose preambles, or malformed brackets. It can still return {}, omit required fields, return numbers as strings, or include extra fields you did not ask for. JSON mode solves the "garbage text" problem but not the "wrong schema" problem.
Structured output (OpenAI's strict: true, Anthropic tool_use with a defined input schema, Google's response_schema) enforces your exact schema: specific field names, specific types, enum values, and required vs. optional distinctions. If you define amount: float, the model cannot return "149.99". If you define currency: Literal["USD", "EUR", "GBP"], the model cannot return "Canadian dollars".
I have seen teams ship JSON mode to production, watch the parse failure rate drop from 3% to 0.5%, and declare victory. That remaining 0.5% is worse: it fails silently when invoice_number is missing, because the response is technically valid JSON and no exception fires.
For self-hosted models, constrained decoding is implemented in the inference library rather than the model weights. Three tools are in common use today.
Outlines converts a Pydantic schema (or a regex pattern) into a token-level FSM, then injects the logit mask at each generation step. It works with any HuggingFace-compatible model and adds roughly 15-40ms of overhead per generation call.
import outlinesfrom pydantic import BaseModelfrom typing import Literalclass Invoice(BaseModel): invoice_number: str amount: float currency: Literal["USD", "EUR", "GBP"] line_items: list[str]# Load any HuggingFace-compatible modelmodel = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")# Outlines compiles the Invoice schema to a token-level FSM automaticallygenerator = outlines.generate.json(model, Invoice)prompt = "Extract invoice data: INV-2024-001, $149.99 USD, office supplies"invoice: Invoice = generator(prompt)# invoice is a fully typed Pydantic object. No parsing, no try/except.print(invoice.amount) # 149.99 (float, not string)print(invoice.currency) # "USD" (validated enum, not "US dollars")