Claude Deep

System Prompts: What Changes Output vs. What's Superstition

Rza

13 Sep 2025 — 7 min read

Every Claude user writes system prompts. Almost nobody tests them. The gap between those two facts explains most of the confusion around prompt engineering — people add instructions, observe that Claude does something, and conclude the instructions worked. Maybe they did. Maybe Claude would have done the same thing anyway. The system prompt is the most powerful lever you have over Claude's behavior, and also the one most frequently wasted on instructions that sound good but change nothing. After two weeks of systematic A/B testing — same queries, with and without specific instructions, across both the API and Claude.ai — I have a clearer picture of what actually moves the needle.

What The Docs Say

Per Anthropic's prompt engineering documentation, the system prompt is a special message at the beginning of a conversation that sets context, defines Claude's role, and establishes behavioral constraints. It's processed before any user messages and influences how Claude interprets and responds to everything that follows. Anthropic's docs recommend using system prompts for "setting a role, defining output format, establishing tone, and providing important context." The documentation also notes that Claude is designed to follow system prompt instructions faithfully and that system prompts take precedence over default behaviors. Anthropic explicitly positions Claude as more system-prompt-responsive than competitors — the model is trained to treat system instructions as authoritative rather than suggestive.

In the API, the system prompt is a dedicated field separate from user messages. In Claude.ai, it appears as the custom instructions in a Project or, less formally, as the first message in a conversation. The docs recommend keeping system prompts concise and specific, though they don't specify an ideal length.

What Actually Changes Output

I tested 15 different system prompt patterns against a control (no system prompt) across 20 standard queries — a mix of coding tasks, writing tasks, analysis tasks, and factual questions. Here's what consistently produced measurably different output.

Format constraints work. "Respond in bullet points," "use Markdown headers," "limit responses to 200 words," "return JSON" — these change output reliably and immediately. Claude follows formatting instructions with near-perfect compliance. If your system prompt does nothing else, format constraints alone justify it. This also extends to structural patterns: "start with a one-sentence summary, then provide detail" or "always include a code example after explaining a concept." Claude treats these as hard rules, not suggestions.

Explicit rules with concrete examples work. "Never use the word 'delve'" changes output. "Write in a professional tone" often doesn't. The difference is specificity. Rules that Claude can mechanically check — does this response contain the forbidden word, does this code use the required naming convention, does this output start with the required prefix — get followed. Rules that require judgment about vibes often don't change behavior because Claude's default judgment already produces something within the range of "professional" or "friendly" or "concise." I tested "always use snake_case in Python examples" against Claude's default. Default Claude already uses snake_case in Python most of the time. The instruction changes nothing because it matches the default. But "always use camelCase in Python examples" — a rule that contradicts the default — gets followed consistently. The lesson: system prompts earn their tokens when they push Claude away from defaults, not when they confirm them.

Role framing changes behavior, but only specific role framing. "You are a senior Python developer" produces subtly different code than "you are a junior developer learning Python" — the senior role produces more idiomatic code, uses more advanced patterns, and includes less basic explanation. This effect is real but mild. What works much better is role framing with constraints: "You are a security auditor. Your job is to find vulnerabilities in code. Do not suggest fixes unless asked — only identify problems." The role plus the behavioral constraint together produce output that's dramatically different from default Claude. The role alone moves the needle slightly. The constraint alone moves it significantly. Together they're the strongest pattern I tested.

Few-shot examples work. Including 2-3 examples of desired input-output pairs in the system prompt is the single most reliable way to get a specific output format or style. I tested this with code review comments: I included three examples of the review style I wanted (terse, issue-focused, with severity ratings), and Claude matched the pattern almost exactly on new code. Without the examples but with a prose description of the same style, Claude got maybe 60% of the way there. With the examples, it was 95%. Few-shot examples are expensive in tokens but cheap in ambiguity, and that trade-off is almost always worth it.

What Doesn't Reliably Work

Personality adjectives are mostly superstition. "Be friendly," "be concise," "be helpful," "be creative" — I tested each of these against no system prompt. The outputs were functionally identical in most cases. Claude is already friendly, already attempts conciseness, already tries to be helpful. These instructions confirm defaults and consume tokens. The exception is when personality adjectives are extreme enough to push outside Claude's comfort zone: "be blunt to the point of rudeness" does produce noticeably different output, because it requires Claude to suppress its default politeness. But "be direct" and no system prompt produced outputs I couldn't reliably distinguish in a blind comparison.

Vague vibes instructions do nothing measurable. "Think like a startup founder," "channel the energy of a great teacher," "approach this with curiosity and rigor." I tested these individually and in combination. In every case, the output with the instruction was indistinguishable from the output without it. These instructions feel meaningful when you write them. They aren't. They're the prompt engineering equivalent of a mission statement on a company website — present, ignored, and burning context window space that could hold something useful.

Contradictory instructions produce inconsistent behavior, not compromise. "Be extremely concise but also thorough and detailed" doesn't produce medium-length responses. It produces responses that alternate between terse and verbose, sometimes within the same output, depending on which instruction Claude's attention lands on. If your system prompt contains contradictions, you haven't set a balance — you've set a coin flip. Clean these out. If you want concise answers with optional detail, say "provide a 2-3 sentence summary, then a detailed section under a 'Details' header." That's a format constraint, and format constraints work.

Long system prompts have diminishing returns. I tested identical core instructions at 200 tokens, 500 tokens, 1,000 tokens, and 2,000 tokens (padded with additional personality descriptions, background context, and edge case instructions). Performance on the core instructions stayed roughly flat through 500 tokens, then started degrading. At 2,000 tokens, Claude occasionally missed instructions that it followed perfectly at 200 tokens. This isn't a hard cliff — the degradation is gradual and task-dependent — but the principle is clear. Every token in your system prompt competes with every other token for Claude's attention. Padding your prompt with nice-to-haves dilutes the must-haves. This matters more for the API, where you're paying per token, but it matters everywhere because attention is finite.

The Testing Method

The testing approach is simple and everyone should do it before committing to a system prompt. Take 5-10 representative queries — the kinds of things you'll actually ask in this context. Run each query twice: once with your system prompt, once without. Compare the outputs. If you can't tell which is which, your system prompt isn't doing work. If you can tell, identify which specific instructions are causing the difference, and cut the rest. This takes maybe 30 minutes and will improve your system prompt more than any guide or template.

For the API, this is straightforward — same request, toggle the system field. For Claude.ai Projects, it's slightly harder: you need a Project with instructions and a vanilla conversation for comparison. But the principle is the same. Test the delta, keep what produces one, cut what doesn't. I keep a small test suite of queries for each of my Projects and re-run them whenever I change the instructions. This sounds tedious. It takes five minutes and has saved me from plenty of cargo-cult prompting.

Claude-Specific Behavior

Claude handles system prompts differently from other models in a few notable ways. First, Claude is more responsive to system prompts than GPT-4 or Gemini in my testing — instructions that get partial compliance from other models get near-full compliance from Claude. Anthropic has clearly invested in system prompt adherence during training, and it shows. This is a genuine advantage but also a trap: because Claude follows instructions so well, you notice less when your instructions are counterproductive. A bad instruction that GPT-4 ignores will be faithfully executed by Claude, giving you worse output that you might not attribute to the prompt.

Second, Claude has strong defaults around safety, helpfulness, and honesty that the system prompt can modify but not override. You can make Claude more terse, more technical, more casual — but you can't make it confidently lie or bypass its safety training through system prompt instructions. This is documented behavior, not a limitation you need to work around. It means system prompts operate within a range, and knowing that range prevents frustration. Third, Claude handles conditional instructions well. "If the user asks about code, respond with examples. If the user asks about concepts, respond with analogies." Other models sometimes flatten conditionals into always-on rules. Claude genuinely branches on them.

When To Use This

System prompts are worth the effort when you have specific, testable requirements that differ from Claude's defaults. If you need a particular output format, a particular code style, a particular response length, or behavior that contradicts Claude's natural tendencies — write a system prompt. If you're using the API, you should always have a system prompt, even if it's just a format constraint, because the consistency gain is free in effort and marginal in cost.

When To Skip This

If you're using Claude.ai for casual questions, a system prompt is overhead. Claude's defaults are good. If your system prompt would just say "be helpful and answer my questions" — save the keystrokes. Also skip the prompt-engineering rabbit hole of trying to find the perfect phrasing. The difference between a good system prompt and a perfect one is negligible. The difference between a good system prompt and no system prompt is significant. Get to good and move on.

The core principle fits in one sentence: specific, testable instructions that push Claude away from its defaults produce measurably different output. Everything else is ceremony.

This article is part of the Claude Deep Cuts series at CustomClanker.

System Prompts: What Changes Output vs. What's Superstition

Rza

What The Docs Say

What Actually Changes Output

What Doesn't Reliably Work

The Testing Method

Claude-Specific Behavior

When To Use This

When To Skip This

Read more

The YouTube + AI Pipeline

The Weekly Drop

The Tool Collector's Guide to Owning Nothing

Self-Hosting & Tinkering