Learn how function calling lets LLMs trigger real APIs and return structured data, why it's essential for production AI systems, and how to design tool schemas that work reliably.
29 min read2026-04-05mediumfunction-callingtool-usellmaistructured-output
Function calling lets you declare tool schemas alongside your LLM request; the model decides when to invoke them and returns structured JSON you execute on your side.
It replaces fragile "parse JSON from text" patterns with schema-enforced, protocol-level structured output.
Parallel tool calls (GPT-4o, Claude 3.5+) let the model invoke multiple independent tools in a single round-trip, cutting latency in half for multi-tool queries.
Tool descriptions are load-bearing: the model reads them to decide whether a tool applies. Bad descriptions cause wrong tool selection more than any other factor.
OpenAI's structured output feature uses constrained decoding at the tokenizer level, guaranteeing schema compliance without retries.
Function calling is the bridge between "chatbot" and "agent." Every production AI system that takes actions uses it.
Ask an LLM to return a JSON object with a temperature field. Eight times out of ten, you get valid JSON. The other two times, the model wraps it in a markdown code fence, prepends "Sure! Here is the JSON:", or invents extra fields your parser does not expect. Your downstream code breaks.
You can throw regex at it. Teams do. They write string-stripping logic, retry on parse failure, and add increasingly desperate system prompt instructions like "RESPOND WITH ONLY JSON. NO EXPLANATION." It works until it does not.
The deeper problem is that text generation is probabilistic. Every token is a dice roll. Without structural enforcement, there is no guarantee the model's output will conform to any schema. You cannot build reliable systems on "usually works."
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.
I have watched teams spend weeks building retry logic and output sanitizers when they could have switched to function calling in an afternoon. The model is not unreliable; you are asking it to do something (produce perfectly formatted text) that its architecture does not guarantee. Function calling moves the constraint from "hope the text is valid" to "the protocol enforces it."
Function calling (also called "tool use") is a model API feature where you declare a list of tool schemas alongside your request. Each schema specifies a tool name, a natural language description, and a JSON Schema for its parameters. The model reads the descriptions, examines the conversation, and decides: should I answer directly, or should I invoke one of these tools first?
Think of it like a doctor's office. The doctor (the model) listens to your symptoms (the user query), decides which lab tests to order (tool calls), waits for results (tool execution), and then gives you a diagnosis (final response). The doctor does not run the tests personally. They write an order, the lab executes it, and the results come back.
When the model decides to invoke a tool, it returns a structured JSON object specifying which tool to call and what arguments to pass. Your application code executes the tool, collects the result, and sends it back to the model. The model then incorporates that result into its final response. The model never calls your function directly. It returns intent; you execute.
This separation is a core architectural principle. The model stays stateless and sandboxed. Your execution environment stays in control of what actually happens, including authentication, rate limiting, validation, and error handling.
Every tool is defined by three fields: name, description, and parameters (a JSON Schema object). The description is the most important field because the model reads it to decide when this tool applies.
// TypeScript: defining a tool schema for OpenAI's APIconst tools = [ { type: "function", function: { name: "get_weather", description: "Get the current weather for a specific city. Returns temperature " + "in Celsius, condition, and humidity. Use when the user asks about " + "current weather conditions.", parameters: { type: "object", properties: { city: { type: "string", description: "City name, e.g. 'Tokyo' or 'San Francisco'", }, units: { type: "string", enum: ["celsius", "fahrenheit"], description: "Temperature unit. Defaults to celsius.", }, }, required: ["city"], }, }, },];
Notice the parameter-level descriptions. The city field includes example values. The units field uses an enum constraint. These details directly reduce argument hallucination. I have seen schemas with zero parameter documentation where the model consistently passes malformed inputs, and the fix is always the same: write better descriptions.
Interview tip: the one-sentence value prop
Say "function calling is how LLMs push data out to the world and pull real-time data in." That framing makes the value proposition concrete. Then describe the round-trip: model returns a structured call, you execute it, you return the result.
A function calling interaction is a multi-turn conversation. The user sends a message, the model returns a tool call instead of text, you execute it, return the result, and the model generates the final answer.
The critical detail: between the tool_call and the tool_result, your application code runs. This is where you validate arguments, check permissions, call real APIs, handle errors, and sanitize results before returning them. The model never touches your infrastructure directly.
If tool execution fails (API timeout, invalid arguments, authorization failure), you return an error message as the tool result. The model sees that error and can either try a different approach or explain the failure to the user. Build this error path explicitly; do not let exceptions silently break the loop.
Modern models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) support parallel tool calling. When the model determines that two or more tools are independent, it returns all calls in a single response. You execute them concurrently and return all results at once.
Sequential: 4 round-trips. Parallel: 2 round-trips. For a three-tool query, the savings compound. In production, I have measured parallel tool calls cutting p50 latency by 40-60% for multi-tool queries.
Not all calls can be parallelized. If tool B depends on the output of tool A (e.g., "look up the user, then check their order status"), the model correctly sequences them. The model handles this dependency analysis automatically based on parameter dependencies.
JSON mode is the blunt instrument: you get valid JSON, but its shape is unpredictable. The model might return {answer: "Tokyo"} or {result: {city: "Tokyo", temp: 22}} depending on its mood.
Function calling gives schema enforcement, but only when the model decides to invoke a tool. If it answers directly, you get free-form text.
Structured output (OpenAI, August 2024) uses constrained decoding to guarantee that every response conforms to your JSON Schema. It modifies the token probabilities during generation so invalid tokens are impossible. This is the most reliable option when you always need a specific shape, regardless of tool invocation.