Demo Vs Delivery

Demo Conditions vs. Your Conditions

Rza

13 Oct 2025 — 6 min read

The demo worked. You watched someone run a prompt through an AI tool and get exactly the output they wanted — clean, fast, usable. You tried the same tool with your data and got something that looked like it was assembled by a committee of interns who'd each read a different brief. Same tool. Same model. Wildly different results. The demo wasn't fake. Your attempt wasn't stupid. The gap between them is made of six things you couldn't see in the recording.

Every AI tool demo is, by necessity, a controlled environment. The person giving the demo has selected data that works, refined prompts through iteration, and is running on infrastructure they've optimized for. None of this is dishonest — you can't demo a tool with bad data and broken prompts. But the result is that every demo represents the ceiling of a tool's performance, and your first experience with it represents something closer to the floor. Understanding the specific gaps between those two conditions is the difference between productive troubleshooting and frustrated abandonment.

The Pattern

There are six gaps between demo conditions and your conditions. They compound.

The data gap. Demo data is clean. I don't mean "polished" — I mean structurally consistent. When someone demos an AI tool processing a spreadsheet, that spreadsheet has consistent column headers, no merged cells, no empty rows with notes someone typed in 2019, no mix of date formats across columns. Your spreadsheet — the real one, the one you actually need processed — has all of that. I tested Claude's data analysis capabilities with a client's actual CRM export last year, and the first run failed on a column where "phone number" contained entries like "ask Sarah," "same as above," and a literal emoji. The demo spreadsheet would never contain those values. Your data always does.

This gap is the most common source of "it doesn't work" frustration. According to OpenAI's documentation for the Assistants API, file inputs should be "well-structured" for best results — which is a polite way of saying the tool was designed for data that looks like the demo, not data that looks like yours. The fix isn't to blame the tool. It's to budget time for data cleaning, or to test with your messiest data first so you discover the gap immediately rather than after you've built a workflow around the assumption that it handles edge cases.

The prompt gap. The prompt in the demo was version 47. You're starting at version 1. This is probably the most underappreciated gap. When someone demonstrates an AI tool producing perfect output, the prompt they're using has been refined through dozens — sometimes hundreds — of iterations. They've discovered that adding "respond in bullet points, not paragraphs" fixes a formatting issue. They've learned that specifying "do not include a conclusion section" prevents the model from padding. They've found that their specific use case needs a system prompt that says "you are a technical editor who prioritizes accuracy over comprehensiveness."

You don't know any of this yet. Your first prompt will be a reasonable guess, and reasonable guesses produce mediocre outputs. I tested Cursor for a week of real coding work in early 2026 and found that my output quality improved roughly 40% between day one and day five — not because the tool changed, but because I learned how to talk to it. Day one prompts were vague ("fix this function"). Day five prompts were specific ("refactor this function to handle the null case on line 34, maintain the existing return type, and add a guard clause matching the pattern in the adjacent function"). Same tool. Different operator.

The infrastructure gap. Demos run on fast internet, premium API tiers, and recent hardware. This matters more than you'd think. I've seen tool demos running on a MacBook Pro M3 Max with 64GB RAM demonstrate local LLM performance that is physically impossible on a MacBook Air M1 with 8GB. The demo doesn't mention specs because specs aren't exciting, but the gap between running Ollama with a 7B parameter model on 8GB of RAM versus a 70B model on 64GB is not incremental — it's categorical. The smaller model produces visibly worse output, and the demo showed you the larger one.

API rate limits are the invisible infrastructure gap. Demo environments either use premium tiers or are hitting the API at a volume low enough that limits don't apply. Your production workflow, running hundreds of requests per hour, will hit rate limits that the demo never encountered. Per OpenAI's rate limit documentation, free tier users get roughly 3 requests per minute on GPT-4-class models [VERIFY]. The demo made 3 requests total. Your workflow makes 300 per hour. These are not the same conditions.

The skill gap. The person giving the demo knows the tool's quirks. They know that the JSON output mode occasionally wraps responses in markdown code blocks. They know that context window limits mean you can't paste in an entire codebase. They know that the tool hallucinates specific types of citations and they've learned to spot-check those. They've internalized workarounds for problems you haven't encountered yet. When the demo looks smooth, part of what you're seeing is an experienced operator unconsciously avoiding failure modes you'll hit face-first.

The context gap. The demo solves a clean, well-defined problem. "Summarize this article." "Generate a function that sorts by date." "Extract names and emails from this PDF." Your problem is: "We have 400 PDFs from different vendors in different formats, some are scanned images, some have tables, some are in German, and we need to extract contact information into a CRM format that matches our existing Salesforce fields — but only for vendors in three specific categories that aren't labeled consistently across documents." The demo problem fits in a tweet. Your problem fits in a requirements document that doesn't exist yet because you're still figuring out the scope.

The selection gap. The demo shows the run that worked. Not the four runs before it that produced garbage. Not the edge case that broke the pipeline during testing. The best output from the best run with the best data. This isn't dishonest — it's how demos work. But it means you're calibrating your expectations against a highlight reel.

The Psychology

These gaps are invisible by design, and that's what makes them psychologically effective. When you watch a demo, your brain encodes it as "this tool can do X" — a binary. It doesn't encode "this tool can do X under conditions A, B, C, D, E, and F, several of which I don't control." The result is that when you try the tool and it doesn't perform at demo level, you blame yourself ("I'm doing something wrong"), blame the tool ("it doesn't actually work"), or blame the demo ("they faked it"). None of these are usually correct. What's actually happening is that your conditions are different from demo conditions in specific, diagnosable, fixable ways.

The pernicious version of this is when the gap makes you believe you're not technical enough. You watched someone get perfect results effortlessly. You're getting mediocre results with effort. The obvious conclusion — and the wrong one — is that you lack some fundamental capability. In reality, you lack the prompt refinements, data preparation, and tool-specific knowledge that the demo operator accumulated over weeks or months. That's not a capability gap. It's an experience gap, and it closes with use.

The Fix

Accept the learning curve explicitly. Not "I'll figure it out" — explicitly. Budget time for it. If a demo makes a tool look like it takes 5 minutes to produce useful output, budget 2 hours for your first session. If the tool is complex (multi-step pipelines, API integrations, custom configurations), budget a full day. This isn't pessimism. It's accurate scheduling based on the gap between demo conditions and your conditions.

Test with your worst data first. Not your cleanest spreadsheet. Not your most well-structured document. The messiest, ugliest, most representative data you actually need to process. If the tool handles your worst data acceptably, it'll handle everything. If it chokes on your worst data, you've discovered the gap in 10 minutes instead of 10 hours.

Check system requirements before you evaluate performance. If the demo ran on hardware or API tiers you don't have, your benchmark is invalid. Check the docs for minimum specs, recommended specs, and which API tier the features require. Per Anthropic's documentation, Claude's tool use and extended context features behave differently across pricing tiers — what works on a Pro plan may not work on a Free plan.

Set a realistic performance expectation: you'll hit about 60% of demo quality on day one, improving to roughly 80% over your first week of regular use. That remaining 20% gap between your best and the demo's best may never close, because some of it is selection bias (you're seeing their best run, you're experiencing your average run) and some of it is environment-specific (your data is genuinely harder than demo data). Eighty percent of demo quality, achieved reliably on your actual data, is the realistic target. If you calibrate to that, you'll make good decisions about which tools to adopt. If you calibrate to the demo, you'll abandon good tools and chase promises.

The demo is the ceiling. Your first hour is the floor. The useful question isn't "why doesn't this work like the demo" — it's "how do I close the specific, identifiable gaps between my conditions and demo conditions." That question has answers. The other one just has frustration.

This article is part of the Demo vs. Delivery series at CustomClanker.

Demo Conditions vs. Your Conditions

Rza

The Pattern

The Psychology

The Fix

Read more

The YouTube + AI Pipeline

The Weekly Drop

The Tool Collector's Guide to Owning Nothing

Self-Hosting & Tinkering