Extended Thinking: When It Helps and When It's Overhead
Extended thinking is Claude's way of showing its work. Turn it on, and before Claude gives you an answer, it reasons through the problem in a visible chain of thought — working through steps, considering alternatives, catching its own mistakes. Anthropic frames this as a significant capability upgrade: better answers on hard problems, more reliable reasoning, fewer logical errors. That framing is accurate for a specific category of tasks. For everything else, extended thinking is a slower, more expensive way to get the same answer you'd get without it. The skill is knowing which category you're in before you toggle it on.
What The Docs Say
According to Anthropic's documentation, extended thinking gives Claude the ability to "think through complex problems step-by-step before providing a response." When enabled via the API, Claude generates a thinking block — a chain-of-thought reasoning trace — before producing its final response. The thinking is visible to the user (or developer) and precedes the main output. Anthropic states that extended thinking improves performance on "complex tasks such as math, coding, and analysis" and recommends it for problems that "benefit from step-by-step reasoning."
The API implementation involves setting thinking as a parameter with a budget_tokens value that caps how many tokens Claude can spend on the reasoning step. Per the API reference, the budget can range from 1,024 tokens up to the model's maximum output limit — which for Claude with extended thinking can be substantial [VERIFY]. The thinking tokens count toward your total token usage and billing. In Claude.ai, extended thinking is exposed as a toggle (the interface has evolved, but the core behavior is the same). When you turn it on, Claude's thinking appears in a collapsible section above the response. Anthropic's documentation also notes that when extended thinking is enabled, temperature must be set to 1 for API calls, and that the thinking process cannot be pre-filled or directly constrained with system prompts — you set the budget, and Claude decides how to use it.
What Actually Happens
I tested extended thinking across four categories of tasks over ten days, tracking response quality, response time, and token usage. The categories: mathematical and logical reasoning, code architecture and debugging, factual questions and simple lookups, and creative writing. Each task was run with extended thinking on and off, using the same prompt.
Math and logic: genuine improvement. This is where extended thinking earns its keep. I gave Claude a series of multi-step math problems — the kind where getting step 3 wrong cascades through steps 4 through 8. Without extended thinking, Claude got about 70% of these fully correct. With extended thinking, that jumped to roughly 90%. The thinking traces showed why: Claude would start down a wrong path, catch the error mid-reasoning, backtrack, and try a different approach. Without extended thinking, those wrong first intuitions became final answers. The same pattern held for logic puzzles, constraint satisfaction problems, and probability questions. If the problem has a non-obvious answer that requires chained reasoning, extended thinking helps significantly. The thinking trace literally shows Claude going "wait, that can't be right" and correcting course. That self-correction is the mechanism, and it works.
Code architecture: noticeable improvement on hard problems. For straightforward coding tasks — write a function, fix a bug, add a feature — extended thinking didn't change the output quality. Claude already handles these well by default. But for architectural decisions — designing a system with multiple interacting components, refactoring a module while maintaining backward compatibility, debugging a race condition — extended thinking produced meaningfully better results. The thinking traces for these tasks showed Claude mapping out dependencies, considering edge cases it would have missed otherwise, and reasoning about trade-offs between approaches. I gave Claude a moderately complex refactoring task: decompose a 500-line function into a set of smaller functions while preserving all existing behavior. Without extended thinking, Claude produced a reasonable but incomplete decomposition that missed a subtle dependency between two code paths. With extended thinking, the thinking trace explicitly identified that dependency and handled it correctly. This isn't a toy example — it's the kind of thing that matters in real codework.
Simple queries: no improvement, just slower. For factual questions, simple lookups, short code snippets, and tasks where Claude's first instinct is already right, extended thinking adds latency and cost without improving the answer. I tested this with 20 straightforward questions — "what's the difference between let and const in JavaScript," "convert this JSON to a Python dict," "summarize this paragraph." The answers with and without extended thinking were effectively identical. The thinking traces for these tasks were perfunctory — Claude would briefly restate the question, arrive at the obvious answer, and move on. It was thinking for the sake of having the thinking block, not because the problem required it.
Creative writing: mixed to negative. This surprised me slightly. For creative tasks — writing a short story opening, generating marketing copy, brainstorming names — extended thinking sometimes produced worse results than without it. The thinking traces showed Claude over-deliberating: considering and rejecting options, second-guessing creative choices, reasoning about what makes "good" writing in a way that produced more cautious, less interesting output. Creative work often benefits from the first interesting impulse, not from systematic evaluation of alternatives. Extended thinking turns Claude into an editor when you want a writer. For structured creative work — outlining a complex plot, designing a game system, writing something with specific technical constraints — extended thinking helps. For free-form creative output, it's a headwind.
The Budget Parameter
The budget_tokens parameter controls how many tokens Claude can spend thinking. The docs explain what this parameter is but don't give great guidance on what values to use. Here's what I found in practice.
Low budgets (1,024 to 4,000 tokens) produce brief thinking that's useful for moderate complexity. Claude will outline its approach and check one or two things. This is the sweet spot for tasks that benefit from a quick sanity check but don't need deep exploration. Medium budgets (4,000 to 16,000 tokens) let Claude explore more thoroughly — considering multiple approaches, checking edge cases, doing deeper analysis. This is where I saw the biggest quality gains on genuinely hard problems. High budgets (16,000+ tokens) show diminishing returns in most cases. Claude will use the budget — it tends to think more when you give it room to think — but the additional thinking often circles back to conclusions it reached earlier. I tested the same hard math problem at 4K, 8K, 16K, and 32K budgets. Accuracy plateaued at 8K. The 16K and 32K traces contained more reasoning but didn't produce better answers.
The practical implication: start with a moderate budget. If the answers seem shallow or miss nuances you'd expect, increase it. If the thinking traces show circular reasoning or obvious padding, decrease it. There is no universally optimal budget because the optimal amount of thinking depends on the problem, and you often don't know the problem's difficulty in advance. The good news is that Claude tends to stop thinking when it's reached a conclusion, even if budget remains — it doesn't always use the full allocation.
Reading The Thinking
The thinking trace isn't just a diagnostic artifact — it's genuinely useful information about whether Claude is on track. I developed a habit of scanning the thinking block before reading the answer, and it caught issues the final answer wouldn't have revealed. Things to look for: if the thinking trace shows confident, linear reasoning that arrives at the answer quickly, the problem probably didn't need extended thinking. If it shows exploration, backtracking, and consideration of alternatives, the problem was genuinely hard and the thinking is doing work. If it shows repetition — the same point stated three different ways — Claude is stalling, and you should either rephrase the problem or accept that it's at the limit of what reasoning will help.
A particularly useful pattern: when Claude's thinking trace considers an approach and rejects it, read why. Sometimes the rejection reveals a misunderstanding of your requirements. You can then clarify in a follow-up, and the next thinking trace will be more focused. This iterative approach — read the thinking, refine the prompt — is how extended thinking becomes a collaboration tool rather than a black box.
The Cost
Extended thinking is not free and not cheap. Thinking tokens are billed at the same rate as output tokens, which for Claude are more expensive than input tokens. A moderate thinking budget of 8,000 tokens roughly doubles the output token count of a typical response. For API users, this means extended thinking doubles the output cost of every request where it's enabled. For Claude.ai users on Pro plans, the cost is less direct — you'll hit usage limits faster. This isn't a reason to avoid extended thinking. It's a reason to be intentional about when you enable it. Leaving it on by default for all requests is like leaving your car in four-wheel drive on dry pavement — it works, it's just unnecessary wear.
Response time also increases noticeably. The thinking step adds seconds to every response. For interactive use — chatting with Claude, iterating on code — those seconds add up. For batch processing or offline tasks where latency doesn't matter, it's irrelevant. But for the typical use case of a developer going back and forth with Claude, the slowdown is tangible and worth considering.
When To Use This
Turn on extended thinking when the problem has a non-obvious answer. Multi-step math, complex debugging, system design, anything where your own first instinct would be "let me think about this for a minute." If the problem is the kind where a smart person would pause before answering, Claude benefits from pausing too. Also turn it on when correctness matters more than speed — tax calculations, security reviews, logic that will run in production. The additional reasoning time is cheap insurance against confident-but-wrong outputs.
When To Skip This
Default to off. Turn it on when you need it, not as a permanent setting. Skip it for simple queries, creative writing, casual conversation, and any task where Claude's default speed and quality are already sufficient. Skip it when latency matters — interactive coding sessions, real-time chat applications, anywhere that a 5-10 second delay per response degrades the experience. And skip it when cost matters and the task doesn't justify the premium. Most Claude interactions don't need chain-of-thought reasoning any more than most car trips need four-wheel drive. The feature is there for the terrain that requires it. Use it accordingly.
The honest summary: extended thinking makes Claude noticeably better at hard problems and noticeably slower at everything else. That's not a criticism — it's a design trade-off, and a reasonable one. The mistake is treating it as a universal upgrade rather than a specialized tool. Know your problem. Then decide if it needs thinking time.
This article is part of the Claude Deep Cuts series at CustomClanker.