Demo Vs Delivery

The Version 1 Mirage

Rza

12 Oct 2025 — 6 min read

You signed up on launch day. The UI was clean. The onboarding was smooth. You ran your first task and the output was — actually pretty good. You told two coworkers about it. You posted about it. You started thinking about how to integrate it into your workflow. That was three weeks ago. Today you've found four limitations that matter, two bugs that haven't been fixed, and a "coming soon" feature that you need now. The tool is still useful, but the gap between what you felt on day one and what you know on day 21 is wide enough to be its own product category.

This is the Version 1 Mirage. It's not a defect in the tool. It's a defect in how we evaluate tools — specifically, in how first impressions work when the thing being evaluated is designed to maximize first impressions.

The Pattern

V1 of any software product is optimized for one thing: initial adoption. This isn't cynical — it's rational. A tool that nobody tries can't become a tool that everybody uses. So the V1 experience is engineered around the first session. The onboarding is polished. The default settings are tuned for the most common use case. The happy path is paved smooth. Everything that makes a first impression good has been prioritized, and everything that makes the thirtieth use good has been deferred.

This creates a specific shape of experience that I've started thinking of as the enthusiasm curve. It peaks on first use — everything works, nothing's broken, you haven't hit the edges yet. It crashes on first real failure, which usually arrives somewhere between day three and day ten. Maybe the API times out on a large request. Maybe the output quality drops on your specific domain. Maybe you discover that the feature you assumed existed doesn't. The crash isn't proportional to the failure. It's proportional to the distance between your expectation and reality, and V1 marketing has spent considerable effort maximizing that distance.

After the crash, the curve either stabilizes at honest utility or declines to abandonment. The stabilization point — the level of usefulness you can actually count on, week after week, without being surprised — is the real value of the tool. It's almost always lower than day one suggested. Sometimes it's still high enough to be worth using. Sometimes it isn't. But you can't know which until you've gotten past the mirage.

I tested this pattern deliberately with three tools launched in the last six months. Without naming them — because the specific tools matter less than the pattern — I signed up on launch day, used each tool daily for 30 days on real work tasks, and tracked my subjective satisfaction on a 1-10 scale. The curves were remarkably similar. Day one: 8, 9, 8. Day seven: 7, 6, 7. Day fourteen: 5, 4, 6. Day thirty: 6, 3, 6. The tool that stabilized at 3 was the one with the most impressive demo. The ones that stabilized at 6 had less dramatic launches but more honest documentation about what they couldn't do.

The Psychology

Novelty is a drug, and V1 is pure novelty. When everything is new, your brain processes each interaction with heightened attention and engagement. The output feels better because you're paying more attention to it. The experience feels smoother because you're comparing it to nothing — there's no baseline, no history of frustrations, no accumulated awareness of limitations. By day 30, you've built that baseline. You know where the tool stutters. You know which prompts produce garbage. You know that the "advanced" mode is just the regular mode with more settings that don't help. The tool hasn't changed. Your perception has.

There's also an anchoring effect from the demo and the launch marketing (see article one in this series). Your expectations were set by the best possible output under ideal conditions, and every real-world interaction is unconsciously compared to that anchor. When the tool produces B+ work, it feels like a failure because the demo showed A+ work. B+ might be exactly what you need. It might save you hours. But it doesn't feel like what you were promised, and the feeling of disappointment is sticky in a way that the reality of usefulness isn't.

The "coming soon" problem deserves its own paragraph because it's so pervasive in AI tooling specifically. V1 launches with a feature roadmap. The roadmap includes things you need. You evaluate V1 partly based on what V2 will be, which is a mistake so common it barely registers as a mistake. You're not buying V2. You're buying V1. V2 might arrive in three months or three years or never. Startups pivot. Priorities change. The feature you're waiting for might get deprioritized because a bigger customer wants something else. I've watched at least a dozen AI tools promise features in Q1 that shipped in Q4 of the following year, if they shipped at all. Evaluate what exists. Discount what's promised.

The "try before you invest" problem makes this harder than it sounds. You can't properly evaluate a tool in one session — article two in this series covers why — but you also can't spend a month evaluating every tool that crosses your feed. There are too many launches, too many demos, too many "this changes everything" tweets. The evaluation bottleneck is time, and V1 is specifically designed to front-load the good experience into the limited time you're willing to spend.

The Fix

The single most useful thing you can do is evaluate with a real task, not a test task. Not "let me see what this does with some sample data." A real task with a real deadline and real stakes — even small stakes. The test task is a controlled environment, and controlled environments are where V1 shines. Real tasks have edge cases, time pressure, and actual consequences for bad output. They surface limitations that test tasks never will.

Here's my evaluation protocol, which I've refined over the last year of testing tools for this site. Day one: sign up, complete onboarding, run one real task. Note first impression but don't trust it. Day three: run a different real task. Deliberately use the tool for something slightly outside its obvious sweet spot. Note where it struggles. Day seven: try to do something you assumed it could do but haven't tested. This is usually where the first real disappointment arrives. Day fourteen: assess honestly. Is this tool saving me time net of the time I spend correcting its output? Would I pay for this if the free trial ended today? Day thirty: final assessment. Has the tool earned a place in my actual workflow, or am I keeping it around out of obligation to the setup time I invested?

Most tools don't survive to day 30 in my workflow. That's not a failure of the tools — it's a success of the evaluation. The ones that do survive are genuinely useful, and I can recommend them with specificity about what they're good at and what they're not. That specificity is impossible to develop on day one.

For the "V2 will fix it" trap, apply a simple rule: if the tool isn't useful enough in its current state to justify its current cost, don't buy it. The roadmap is a wish list, not a contract. If V2 ships and it actually fixes your problems, great — sign up then. You lose nothing by waiting except the sunk cost of premature commitment, and that's a cost worth avoiding.

There's also a case for when V1 is good enough, and it's worth stating because not every V1 is a mirage. Some tools launch with a narrow scope, do that narrow thing well, and are honest about their limitations. A tool that says "we do X well and we don't do Y yet" is telling you something valuable. If X is what you need, V1 might be genuinely sufficient. The mirage isn't the tool being limited — all V1s are limited. The mirage is the tool appearing unlimited. The tools that announce their boundaries on day one are the ones most likely to be useful on day thirty.

The version 1 mirage is the gap between first impression and sustained utility. It exists in all software but it's amplified in AI tools because AI output quality is variable in ways that traditional software isn't. A button either works or it doesn't. An LLM produces output that ranges from impressive to embarrassing depending on input, context, and something uncomfortably close to luck. V1 shows you the impressive end of that range. Real use shows you the full distribution. The only way to see the full distribution is to use the tool long enough to encounter it, and the only way to do that efficiently is to use it on real work from day one.

The demo showed you the peak. The peak is real — the tool can do that. But it can also do much worse, and the ratio of peak to average is the number that determines whether the tool is worth your time. You won't find that ratio on the landing page. You'll find it on day 30.

This article is part of the Demo vs. Delivery series at CustomClanker.

The Version 1 Mirage

Rza

The Pattern

The Psychology

The Fix

Read more

The YouTube + AI Pipeline

The Weekly Drop

The Tool Collector's Guide to Owning Nothing

Self-Hosting & Tinkering