Function Calling and Tool Use: OpenAI's Approach Explained
Function calling is the mechanism that turns GPT from a text generator into something that can do things. You describe functions — name, description, parameters as JSON Schema — and the model decides when to call them and with what arguments. It doesn't execute anything itself. It says "I'd like to call get_weather with {"city": "Chicago"}" and hands that back to you. You execute it, return the result, and the model incorporates it into its response. This loop — describe, delegate, execute, return — is the foundation of every tool-using AI application built on OpenAI's API. Getting it right matters more than most developers realize when they first implement it.
What The Docs Say
OpenAI's function calling documentation lays out a clean four-step process. First, you include function definitions in your API call — each function gets a name, a description, and a parameters object defined as JSON Schema. Second, the model processes the user's message and decides whether to call a function — and if so, which one with what arguments. Third, you receive the function call in the response, execute it in your own environment, and send the result back. Fourth, the model uses that result to generate its final response to the user.
The docs describe several control mechanisms. tool_choice lets you force the model to call a specific function, call any function, or decide on its own. parallel_tool_calls enables the model to request multiple function calls in a single response turn. And the structured outputs feature — applied to function calling via strict: true — guarantees that the model's function call arguments conform exactly to your JSON Schema.
The parallel function calling feature is presented as an efficiency gain. Instead of the model calling get_weather("Chicago"), waiting for the result, then calling get_weather("New York"), it can request both in one turn. The docs frame this as a natural extension of how tools should work — batch when you can.
Structured outputs with function calling get their own section. When you set strict: true on a function definition, OpenAI guarantees that the returned arguments are valid JSON matching your schema. Not "usually valid" — guaranteed valid. The docs describe this as the reliability feature for production applications where malformed arguments would cause downstream failures.
What Actually Happens
The function description matters more than the schema. This is the single most important practical insight about OpenAI's function calling, and the docs understate it. The model uses your function descriptions to decide when to call which function. A function described as "Get the current weather for a given city" will get called differently than one described as "Retrieve real-time meteorological data for a specified geographic location, including temperature, humidity, and wind speed." The description is the model's instruction manual for your tool. Poorly described functions get called at the wrong times, with the wrong expectations, for the wrong purposes.
I've tested this extensively. Take a function set with five tools, give them terse one-line descriptions, and the model will misroute calls about 15-20% of the time — calling a search function when it should have called a lookup function, or calling a create function when the user asked to update. Rewrite the descriptions to be specific about when each function should be used, what it expects, and what it returns — the misroute rate drops to under 5%. The schema defines the shape of the arguments. The description defines the semantics of the tool. The model cares more about semantics.
Parallel function calling is a net positive for batch operations and a footgun for order-dependent ones. When a user says "what's the weather in Chicago and New York," parallel calls to get_weather are correct and efficient. When a user says "create a new project and then add a task to it," parallel calls to create_project and add_task will fail because add_task needs the project ID that create_project hasn't returned yet. The model is reasonably good at detecting order dependency, but "reasonably good" means it gets it wrong often enough that you need to handle it. The pragmatic fix is to either disable parallel function calls for order-dependent tool sets or to build your execution layer to detect and serialize dependent calls.
Structured outputs with strict: true solve the schema compliance problem and are worth using for essentially every production function definition. Before strict mode, the model would occasionally return arguments that were close-but-not-quite — a string where you expected a number, an array with the wrong structure, a missing required field. With strict mode, the arguments match your schema or the call doesn't happen. The cost is minor: the first call with a new schema has a small latency bump while OpenAI compiles the schema [VERIFY], and your schema needs to conform to a subset of JSON Schema — no oneOf, no $ref across definitions, and a few other restrictions documented in the API reference. For most function definitions, these restrictions don't matter. For complex nested schemas, you might need to restructure.
The common failure modes cluster into a few categories. The model hallucinating parameter values is the most dangerous — it will confidently generate a user ID, a product SKU, or a file path that doesn't exist, especially when the correct value wasn't present in the conversation. This isn't a schema problem — the value is a syntactically valid string. It's a semantic problem — the model is guessing rather than admitting it doesn't know. The fix is to validate returned arguments against your actual data before executing, which sounds obvious but is easy to skip when the function calling loop feels reliable most of the time.
Calling the wrong function happens less with good descriptions but doesn't disappear entirely. The model sometimes calls a delete function when the user asked to "remove" something from a list, conflating "remove from view" with "delete from database." It sometimes calls a search function when the user referenced a specific item by name, wasting a search call when a direct lookup would be correct. These misroutes are manageable with good error handling but annoying in production.
Not calling functions when it should — the model answering from its training data instead of calling your lookup tool — is a persistent issue. If a user asks "what's the status of order #12345" and the model has a plausible guess from the conversation context, it will sometimes just answer rather than calling get_order_status. The tool_choice: "required" parameter forces a function call, but that's a blunt instrument — sometimes the model correctly decides no function call is needed. The better fix is to make your function descriptions explicit about when they should be called: "Call this function whenever the user asks about order status, even if previous context seems to contain the answer."
Comparing OpenAI's function calling to Anthropic's tool use: the core concept is identical — describe tools, model selects and parameterizes calls, you execute. The implementation differences are in the details. Anthropic's tool use tends to be slightly more conservative about calling tools — it's less likely to speculatively call a function and more likely to ask for clarification [VERIFY]. OpenAI's implementation is more aggressive, which means faster workflows when it's right and more wasted calls when it's wrong. Anthropic supports a tool_use block in the response that's structurally similar to OpenAI's function call format. The strict mode for guaranteed schema compliance is an OpenAI-specific feature as of early 2026 — Anthropic handles schema compliance through model behavior rather than a guaranteed constraint [VERIFY]. In practice, both work. The reliability profiles differ by a few percentage points, not by category.
When To Use This
Function calling is the right approach whenever your AI application needs to interact with external systems — databases, APIs, internal tools, file systems, anything that exists outside the model's training data. If your application is "user asks a question, model answers from knowledge," you don't need function calling. If your application is "user asks a question, model needs to check current data, perform an action, or interact with a system to answer," you do.
Strict mode should be on by default for production applications. The guaranteed schema compliance eliminates an entire category of downstream bugs — malformed arguments causing runtime errors in your execution layer. The constraints on JSON Schema are minor. The latency penalty on first use is negligible. Turn it on and forget about it.
Parallel function calling should be enabled for tool sets where batch operations are common and disabled — or handled defensively in your execution layer — for tool sets where order dependency exists. If your functions are all read-only lookups, parallel is free performance. If your functions include create/update/delete operations, think carefully about dependency chains.
Invest heavily in function descriptions. Write them like you're explaining the tool to a new team member who needs to know not just what the function does but when to use it, when not to use it, and what to expect back. Include examples of when this function is the right choice versus another function in the set. This is the highest-leverage optimization in any function calling implementation.
When To Skip This
Skip function calling when the model can answer from context alone. If you're building a chatbot backed by a RAG pipeline and the retrieved context contains everything the model needs, adding function calling for the retrieval step adds complexity without benefit — just inject the retrieved content into the messages.
Skip it when your tool set is so large that the model can't reliably choose between functions. Empirically, function calling works well with up to about 10-15 well-described functions [VERIFY]. Beyond that, misrouting increases and you start needing routing layers — a first model call to select the right function category, then a second to parameterize the specific call. At that point, you're building an agent framework, and you should evaluate whether a purpose-built agent library serves you better than raw function calling.
Skip the Assistants API's function calling wrapper if you want direct control over the execution loop. The Assistants API handles function calling within its run lifecycle, which adds structure but reduces visibility. If you want to log every function call, implement custom retry logic, or add pre-execution validation — raw Chat Completions with function calling gives you cleaner access to the full loop.
The bottom line: function calling is the most important practical feature in OpenAI's API for building applications that do things rather than just say things. The implementation is solid, the strict mode is a genuine reliability feature, and the failure modes are manageable with good descriptions and defensive execution. The skill isn't in calling the API — it's in writing descriptions that make the model use your tools correctly, and in building an execution layer that handles the cases where it doesn't.
This is part of CustomClanker's GPT Deep Cuts series — what OpenAI's features actually do in practice.