Structured Output: Getting JSON, Markdown, and Formats You Can Actually Use
LLMs produce natural language by default. They want to write paragraphs. Getting them to produce structured data — clean JSON, consistent markdown tables, valid CSV, properly nested YAML — requires either explicit prompt-level constraints or API-level enforcement features. Both approaches work. Neither is foolproof. The gap between "the model returned something that looks like JSON" and "the model returned valid JSON that matches my schema every time" is where most production pipelines break.
What The Docs Say
All three major providers have invested heavily in structured output capabilities at the API level, and all three document it as a recommended approach for production use.
OpenAI's structured outputs feature — shipped in late 2024 and refined since — lets you define a JSON schema and guarantees the model's response will conform to it. Not "strongly encourages." Guarantees. The model's token generation is constrained at the sampling level so it literally cannot produce tokens that would break the schema. This is the nuclear option for format enforcement, and it works. OpenAI's documentation positions it as the preferred approach for any application that parses LLM output programmatically.
Anthropic takes a different path. Claude doesn't have a dedicated "JSON mode" in the same way. Instead, Anthropic recommends using tool use (function calling) as a structured output mechanism — you define a tool with the schema you want, and Claude "calls" the tool with structured arguments that match your schema. It's a clever repurposing of the function calling API, and in practice it works reliably for getting consistent structured output. Anthropic's documentation explicitly suggests this pattern for applications that need guaranteed schema conformance.
Google's Gemini API offers response schemas — you specify the JSON schema in the API call, and the model constrains its output accordingly. The approach is similar to OpenAI's, with the response schema acting as a hard constraint rather than a suggestion. [VERIFY]
At the prompt level, all three providers recommend the same thing: include the expected schema in your prompt, provide few-shot examples of the desired output, and be explicit about format requirements. The docs agree that prompt-level format instructions work for simple structures but become unreliable for complex nested schemas. The API-level features exist precisely because prompting alone doesn't get you to 100% reliability.
What Actually Happens
The prompt-only approach works better than you'd expect for simple structures and worse than you'd hope for complex ones. That's the one-sentence summary of every structured output experience I've had across hundreds of tests.
For flat JSON — a single object with 5-10 string and number fields — putting the schema in the prompt and showing one example produces valid output roughly 95% of the time across Claude, GPT, and Gemini. The model sees the pattern, understands you want JSON, and returns clean, parseable output. The 5% failure rate comes from the model adding conversational text before or after the JSON block ("Here's the JSON you requested:"), occasionally omitting optional fields, or wrapping the JSON in markdown code fences when you didn't ask for them. These are annoyances, not showstoppers — a simple regex or string trim handles most of them.
For nested JSON — objects containing arrays of objects, multiple levels of nesting, conditional fields — prompt-only reliability drops to around 80-85%. The model handles the outer structure fine but starts making mistakes in the inner levels. Field names drift from your schema. Types get wrong — a number comes back as a string, or an array comes back as a single object when there's only one element. Nested structures expose the fundamental tension between the model's desire to generate natural language and your requirement for machine-parseable output.
For complex schemas — deeply nested structures with 20+ fields, arrays of polymorphic objects, fields with specific enum constraints — prompt-only drops below 70% reliability and you should stop trying to make it work with prompting alone. This is where API-level structured output features earn their keep. The schema constraint at the sampling level eliminates an entire category of failure modes that no amount of prompt engineering can fully address.
The most reliable prompt-only technique for structured output is few-shot examples — and this is where the few-shot article in this series connects directly. Showing the model two or three examples of the exact output format, including edge cases like empty arrays and null fields, gets you further than a page of written format instructions. The model matches what it sees. If it sees three examples of perfectly formatted JSON with the same field names and types, it pattern-matches against those examples rather than trying to interpret your schema description.
The Common Failure Modes
Understanding how structured output breaks tells you what to validate and where to add guardrails.
Preamble and postamble. The most common failure: the model wraps your structured output in conversational text. "Sure, here's the JSON:" before the output and "Let me know if you need any changes." after it. The fix at the prompt level is to say "respond with only the JSON, no additional text." This works most of the time, but not always — particularly when the model encounters an edge case it wants to explain. At the API level, structured output mode eliminates this entirely.
Schema drift. The model returns valid JSON, but the field names don't match your schema. You asked for product_name and got productName or name or product. This happens more often with complex schemas where the model is inferring field names from context rather than copying them from your example. Few-shot examples are the strongest defense — if the model sees product_name in three examples, it'll use product_name.
Type coercion. Numbers come back as strings. Booleans come back as "yes" and "no" instead of true and false. Arrays with one element come back as a bare value. These are the failures that look like valid JSON to a casual reader but break your parser. A validation layer — Pydantic in Python, Zod in TypeScript — catches these before they propagate.
Hallucinated fields. The model adds fields you didn't ask for. Your schema has name, price, and category. The model adds description, rating, and in_stock because it "helpfully" figured you'd want those too. This is the LLM equivalent of scope creep, and it's particularly common with GPT, which tends to be more eager to add information than Claude. Strict schema validation rejects responses with unexpected fields.
Missing fields. The inverse problem. The model omits fields it couldn't confidently fill — which is arguably the right behavior, but it breaks parsers that expect every field to be present. Defining default values in your schema and handling missing fields in your validation layer is the standard fix.
Function Calling as Structured Output
The most reliable prompt-level technique for structured output — across all providers — is defining a function or tool that the model "calls" with structured arguments. You're not actually calling a function. You're exploiting the fact that function calling requires the model to produce structured arguments that match a defined schema, and the providers have optimized their models to be very good at this because tool use is a core feature.
In Anthropic's API, you define a tool with the parameters matching your desired output schema, then tell the model to use that tool. Claude returns a tool_use response with structured arguments that match your schema. In OpenAI's API, you define a function and either let the model choose to call it or force a function call. The result is structured arguments conforming to your function definition.
This works better than prompt-only structured output for a specific reason: the models have been specifically trained and fine-tuned to produce valid function arguments. It's a behavior path that gets more training signal and more optimization attention than "respond in JSON format." The reliability difference is meaningful — I see 98%+ schema conformance with function calling versus 85-90% with prompt-only JSON instructions on moderately complex schemas. [VERIFY]
The downside is complexity. Setting up function calling requires using the API, defining the function schema in the provider's specific format, and parsing the response from the tool_use or function_call response structure rather than plain text. For a one-off task in the chat interface, it's overkill. For a production pipeline, it's the right investment.
The Validation Layer
The single most important piece of advice for structured output is this: never trust LLM output without parsing and validating it. Not sometimes. Not for simple schemas. Never.
In Python, Pydantic is the standard tool. You define your schema as a Pydantic model, parse the LLM output through it, and Pydantic handles type coercion, missing field defaults, extra field rejection, and validation errors. The instructor library wraps this pattern — it takes your Pydantic model, creates the appropriate API call with structured output constraints, and automatically validates and retries on failure.
In TypeScript, Zod serves the same role. Define your schema with Zod, parse the LLM output through it, and get typed, validated data or an error. The Vercel AI SDK integrates Zod schemas directly with LLM calls, creating a pipeline where the schema is both the API constraint and the validation layer.
The retry pattern is worth calling out explicitly. When validation fails — and it will fail, even with API-level structured output on edge cases — the standard approach is to send the validation error back to the model along with the original request and ask it to fix its output. "Your response failed validation: field 'price' expected number, got string '29.99'. Please correct your response." Most models fix the error on the first retry. Build this retry loop into any production pipeline that depends on structured output.
Prompt-Level vs. API-Level: When To Use Each
Use prompt-level structured output (schema in prompt + few-shot examples) when you're working in the chat interface, when your schema is simple and flat, when you're prototyping and don't want API setup overhead, or when you're willing to do light post-processing on the output. This approach is fast, flexible, and good enough for most non-critical tasks.
Use API-level structured output (JSON mode, function calling, response schemas) when you're building a production pipeline, when your schema is complex or nested, when you need 99%+ schema conformance, or when downstream systems will parse the output without human review. The API-level approach trades setup complexity for reliability — and in production, reliability is the only thing that matters.
The worst approach is combining a complex schema with prompt-only instructions and no validation layer, then being surprised when the output breaks your parser at 3am on a Saturday. If the output matters enough to parse programmatically, it matters enough to validate.
When To Use This
Use structured output techniques whenever your LLM output will be consumed by code rather than humans — API responses, database entries, configuration files, data pipeline inputs. Use few-shot examples for simple structures and API-level features for complex ones. Always validate. Build the retry loop.
When To Skip This
Skip structured output when the output is for human consumption and natural language is fine. If you're generating an email draft, a blog post, a summary, or a conversational response, forcing structured output adds complexity without benefit. The model's natural language output is the structured output — paragraphs, sentences, words in a sequence that a human reads. The structure tooling is for machines, not people.
This is part of CustomClanker's Prompting series — what actually changes output quality.