Temperature and Sampling Parameters: What the Sliders Actually Do
Temperature is the one generation parameter most people have heard of, and it's also the one most people adjust without understanding what it does. The short version: temperature controls how random the model's word choices are. Low temperature means predictable, high-probability outputs. High temperature means more varied, lower-probability outputs. The honest advice is that the default is correct for most tasks, and the other sliders — top-p, top-k, frequency penalty, presence penalty — should be left alone unless you have a specific reason to touch them.
What The Docs Say
Every model provider exposes temperature as a generation parameter, and most explain it with the same mechanical description. Temperature scales the logits — the raw probability scores for each possible next token — before the model samples from them. At temperature 0 (or near 0), the model almost always picks the highest-probability token. The output becomes deterministic and repetitive. At temperature 1.0, the model samples proportionally from the full probability distribution. At temperature above 1.0, the distribution gets flattened further, making low-probability tokens more likely to be selected. The output becomes more creative — and also more likely to be incoherent.
Top-p (nucleus sampling), introduced by Holtzman et al. in "The Curious Case of Neural Text Degeneration," takes a different approach. Instead of scaling the distribution, it truncates it. Top-p 0.9 means the model only considers tokens that collectively make up the top 90% of the probability mass, then samples from that subset. This cuts off the long tail of unlikely tokens without flattening the distribution like high temperature does. In theory, it gives you creative variety without the incoherence risk.
Top-k is the simplest filter: only consider the top K most probable tokens, ignore everything else. If top-k is 50, the model picks from its 50 best guesses. It's less commonly exposed in API settings than temperature and top-p, but it's the same idea — constrain the candidate pool to avoid unlikely (and often nonsensical) token choices.
Frequency penalty and presence penalty address a different problem: repetition. Frequency penalty reduces the probability of tokens proportional to how many times they've already appeared in the output. Presence penalty applies a flat reduction to any token that's appeared at all, regardless of how many times. Both are set to 0 by default in most APIs, which means no penalty for repetition.
OpenAI's API reference documents all five parameters. Anthropic's API exposes temperature and top-p but calls top-k a less common option. [VERIFY] Google's Gemini API exposes temperature, top-p, and top-k. The documentation across providers converges on the same practical advice: temperature is the parameter to adjust, top-p is an alternative approach, and the rest are edge-case controls.
What Actually Happens
Here's what actually changes when you move the temperature slider, tested across hundreds of prompts on Claude, GPT, and Gemini.
At temperature 0-0.3, the model is robotic and predictable. Ask it the same question twice and you'll get nearly identical answers. This is exactly what you want for data extraction, classification, code generation, and any task where consistency matters more than creativity. If you're building a pipeline that processes 10,000 customer emails and needs to classify each one, temperature 0 is your friend. The output is boring, reliable, and reproducible — which is the definition of production grade.
At temperature 0.7-1.0 — where most defaults sit — the model has enough randomness to produce varied, natural-sounding text without going off the rails. Ask it the same question twice and you'll get different phrasings, different examples, different angles — but the substance stays coherent. This is the sweet spot for writing, brainstorming, conversation, and most general-purpose use. The default exists at this range for a reason: it's where most users get the best experience without thinking about it.
At temperature above 1.0, things get weird. The model starts selecting low-probability tokens, which means unexpected word choices, unusual metaphors, and — past about 1.5 — outright incoherence. Sentences start strong and end in a direction nobody predicted, including the model. This has legitimate uses in brainstorming, where you want surprising combinations and don't mind filtering garbage. But "temperature 2.0 for creative writing" is bad advice that sounds good — what you actually get is word salad with occasional flashes of inspiration buried in nonsense.
The interaction between temperature and top-p is where most people get confused. Both control randomness, but through different mechanisms. Setting both temperature to 1.5 and top-p to 0.95 doesn't give you "more creative" output — it gives you a confusing blend of two different randomness strategies that's hard to reason about. The practical advice from every provider is the same: adjust one or the other, not both. If you're using temperature, leave top-p at 1.0 (the default, which means no truncation). If you're using top-p, leave temperature at 1.0.
Top-k is even simpler. Unless you're running a local model where you have granular control over the sampling pipeline — ComfyUI for image generation, llama.cpp for text — you probably don't need to think about top-k. The API-level providers have already made reasonable default choices, and top-k interacts with temperature and top-p in ways that make simultaneous tuning an exercise in diminishing returns.
Frequency and presence penalties are the parameters that solve a real problem — repetition — but that most people never need to adjust. Modern models are already trained to avoid obvious repetition, and the default penalty of 0 works fine for most conversational and writing tasks. If you're generating very long outputs (5,000+ tokens) and noticing the model circling back to the same phrases, a small frequency penalty (0.3-0.5) can help. For standard use, leave them alone.
The Temperature Ranges That Matter
Rather than thinking about temperature as a continuous slider, think about it as three zones with a clear use case for each.
The factual zone (0-0.3): Use this when the answer should be the same every time. Data extraction, classification, code generation, structured output, any task where creativity is a liability. If you're asking "what is the capital of France?" and the model says "Paris" at temperature 0 and "Well, the vibrant heart of Parisian culture beats in..." at temperature 1.0, you want temperature 0. Most API-based production pipelines should live here.
The default zone (0.5-1.0): Use this for general-purpose tasks. Writing, conversation, analysis, summarization, drafting. The default temperature for most providers falls in this range because it balances coherence with variety. Unless your outputs are either too robotic or too chaotic, this is where you should stay. The model providers have spent significant effort calibrating their defaults — the hubris of thinking you know better than the default is usually misplaced.
The chaos zone (1.0-2.0): Use this for brainstorming, concept generation, and creative exploration where you plan to filter the output heavily. Generate 20 product name ideas at temperature 1.5, expect 15 of them to be garbage, and keep the 5 that surprise you. This is a tool for generating raw material, not finished output. If you're using temperature above 1.0 for anything you plan to ship directly, you're going to have a bad time.
The Honest Advice
Temperature is the only generation parameter that most users should ever touch. And even then, the default is correct for the majority of tasks. The number of people who would benefit from adjusting top-p is very small. The number who would benefit from adjusting top-k, frequency penalty, or presence penalty while using a major provider's API is vanishingly small.
This is not because those parameters don't do anything. They do exactly what the docs say. It's because the models have been trained and the defaults have been calibrated to work well out of the box. Adjusting sampling parameters is a fine-tuning step that belongs at the end of the optimization process — after you've written a clear prompt, added few-shot examples, structured your chain of thought, and specified your output format. If your output is bad, the problem is almost certainly your prompt, not your temperature setting.
The scenario where sampling parameters matter most is production API pipelines processing thousands of requests. In that context, temperature 0 for deterministic tasks, temperature 0.7-0.8 for tasks that need variety, and everything else at default is the configuration that works. If you're in the chat interface — Claude.ai, ChatGPT, Gemini — you're using whatever temperature the provider chose, and it's fine.
The fiddling trap is real here. I've watched people spend an hour adjusting temperature from 0.7 to 0.72 to 0.68, convinced they're optimizing, when the actual output difference is indistinguishable. The model is not a synthesizer where tiny knob adjustments produce audible changes. The difference between temperature 0.7 and 0.75 is, for practical purposes, nothing. The difference between 0.3 and 0.9 is meaningful. Operate at the level of meaningful differences and you'll save yourself a lot of wasted time.
When To Use This
Adjust temperature when you're building an API pipeline and need deterministic output (set it low), when your creative outputs are too predictable and generic (raise it slightly), when you're brainstorming and want quantity over quality (raise it a lot), or when you've already optimized your prompt and are squeezing the last few percent of output quality for a production system. Adjust top-p only when you specifically want to control the randomness distribution shape rather than the scale — and if that sentence doesn't mean anything to you, stick with temperature.
When To Skip This
Skip parameter tuning for one-off tasks in the chat interface, for any task where the default output quality is acceptable, for any task where prompt changes would produce a larger improvement than parameter changes, and for any situation where you're adjusting parameters instead of improving your examples or instructions. The parameters are the last 2% of output quality. The prompt is the first 90%. Optimize in order.
This is part of CustomClanker's Prompting series — what actually changes output quality.