The Demo That Finally Fooled Me
I think a lot about demos. It's a professional hazard — I evaluate AI tools for a living, and every tool arrives wearing its best demo like a first-date outfit. I've built a framework for seeing through them. I know the tricks: cherry-picked examples, clean environments, tasks designed to show strengths and hide weaknesses, latency edited out, failures not mentioned. I've written about all of this. I teach people to be skeptical. And last month, a demo fooled me completely.
The tool was Google's Gemini 2.5 Pro, and the task was full-codebase comprehension.
What the Demo Showed
Google's developer relations team published a video — I won't link it because the point isn't to drive traffic to marketing — showing Gemini 2.5 Pro ingesting an entire large codebase, understanding its architecture, and answering detailed questions about component relationships, data flow, and potential bugs. The codebase was substantial — hundreds of files, multiple services, a real application with real complexity. The model didn't just search for keywords. It reasoned about how components interacted, identified implicit dependencies, and explained architectural decisions that weren't documented anywhere in comments or README files.
I watched this and my skeptic alarm didn't fire. Here's why: the million-token context window is real. The codebase fit. The questions were specific and technical. The answers demonstrated understanding that couldn't come from surface-level pattern matching. I'd seen enough genuine improvements in long-context processing over the previous year to believe that this was plausible. The trajectory supported it. Large context windows had been getting better — less "lost in the middle" degradation, better attention over long distances, more coherent reasoning about distributed information. This demo looked like the next step on that curve.
So I did what I always tell people to do: I tested it myself. Except I didn't test it well. And that's the part of the story that's actually useful.
What I Did Wrong
I took one of my own codebases — a moderately complex application with about 150 files — loaded it into Gemini 2.5 Pro's context window, and started asking questions. The first few answers were impressive. "What's the data flow from the webhook endpoint to the database?" Gemini traced the path correctly, naming the specific files and functions involved. "Where is the authentication middleware applied?" Correct again, with the right file paths and the right order of operations. "What would break if I changed the response format of this endpoint?" A detailed, accurate answer that identified three downstream consumers I'd have to update.
I was sold. I told two people about it. I started planning a workflow where I'd use Gemini for codebase comprehension and Claude Code for the actual editing. It seemed like a genuine unlock — the first tool that could hold a whole project in its head while I worked on individual pieces.
Then I kept testing, and the cracks appeared.
I asked about a subtle bug I knew existed — a race condition in a queue processing system where two workers could grab the same job under specific timing conditions. This bug existed because of an interaction between three files that were architecturally separate but operationally coupled. Gemini didn't find it. Not because it couldn't see the files — they were in context. Because the bug required understanding not just what the code did, but what could happen when two instances of the code ran simultaneously with specific timing. That's a different kind of comprehension — temporal, concurrent, emergent — and the model couldn't do it.
I asked about the performance characteristics of a database query that I knew was slow because of a missing index combined with a table that had grown beyond the threshold where the query planner's strategy shifted. Gemini told me the query "might be slow for large tables," which is the kind of generic answer you could give by looking at any SELECT statement with a WHERE clause. It didn't identify the specific index issue, the table size threshold, or the query planner behavior. It understood the code's logic without understanding its performance.
And then I asked a question that the demo would never have included: "What's wrong with this codebase?" Not a specific question about a specific component. An open-ended question that a senior engineer could answer after spending a day reading the code. Gemini produced a list of issues that were technically accurate but entirely superficial — missing error handling in some functions, inconsistent naming conventions, TODO comments that hadn't been addressed. These are real issues. They're also the issues that any static analysis tool could find. The actual problems — the architectural coupling that made testing hard, the abstraction that leaked in ways that had caused bugs twice, the authentication flow that worked but was fragile because it depended on a third-party service's undocumented behavior — none of that appeared.
What Fooled Me
In retrospect, I can identify exactly what happened. The demo's questions were designed to test the thing Gemini is genuinely good at: retrieval and tracing over long contexts. "What's the data flow from A to B" is a retrieval task. The information is in the code, distributed across files, and the model's job is to find and connect the relevant pieces. Gemini is good at this — genuinely, measurably good. Better than what I'd seen from comparable models six months earlier.
My first test questions were the same type — retrieval and tracing. So my first results confirmed the demo's claim. This is confirmation bias operating in real time: I asked questions that the tool could answer well, got good answers, and concluded the tool was good. I didn't start with the hard questions because I was excited, and excitement makes you test easy things first so you can keep being excited.
The demo also benefited from a controlled codebase. The application Google used was presumably clean, well-structured, and architected in ways that made the relationships between components explicit. My codebase is also relatively clean — but real codebases have implicit relationships, undocumented assumptions, and behaviors that emerge from the interaction of components rather than being stated in any single file. The demo's codebase was the demo's conditions. My codebase was my conditions. The gap between them is where the demo's claim broke down.
And there's a subtler thing that fooled me: the answers sounded right. LLMs are exceptionally good at generating text that reads like a knowledgeable response. When Gemini gave a detailed, confident answer about my codebase's architecture, the answer had the texture of expertise. It used the right terminology, referenced specific files, and structured its response the way a senior developer would. The form was correct even when the substance was incomplete. And for the first few questions — where the substance was also correct — the form reinforced my confidence. By the time the substance started falling short, I'd already formed an impression.
The Lesson I Keep Having to Learn
I preach a specific methodology for evaluating AI tools: test on your data, test hard tasks, test failure modes, and measure output against what you'd produce without the tool. I wrote the methodology. I teach the methodology. And when a demo was good enough and my excitement was high enough, I shortcut the methodology. I tested on my data (good) but tested easy tasks first (bad), didn't systematically test failure modes (bad), and didn't measure the output against what I'd produce without the tool until several days later (bad).
The lesson isn't that Gemini 2.5 Pro is bad. It's genuinely useful for long-context retrieval tasks. The lesson is that even informed skeptics get fooled when the demo hits the right combination of plausibility and desire. I wanted full-codebase comprehension to work because it would be genuinely useful in my workflow. That wanting created a vulnerability that the demo — which was well-made and not dishonest — exploited by showing me exactly the part that works.
Every demo shows you exactly the part that works. The discipline is in testing the parts they didn't show you. And the hardest time to maintain that discipline is when the part they showed you is genuinely impressive.
Where This Tool Actually Sits
After more careful testing, Gemini 2.5 Pro with a full codebase in context is useful for a specific category of tasks: navigating unfamiliar codebases, tracing data flows, understanding which files are involved in a given feature, and answering "where does X happen" questions. These are real tasks that come up regularly — especially when you're working on a codebase you didn't write or haven't looked at in months.
It's not useful for: finding subtle bugs, evaluating architectural quality, identifying performance issues, understanding emergent behavior, or any task that requires the kind of judgment that comes from actually running the code and seeing what happens. For those tasks, a human developer reading the code — possibly with AI assistance for the reading part — is still the only reliable approach.
That's a narrower value proposition than the demo implied, and a genuine one. The problem was never the tool. The problem was the distance between the demo's implicit promise and the tool's actual capability, amplified by my own desire for the promise to be true.
I've adjusted my evaluation process. First hard question now. Before I test what works, I test what I expect to fail. If the failure mode is acceptable, I move to the success cases with clear eyes. If I test success first, the excitement compounds and the skepticism erodes. Hard things first. Always.
This article is part of the Weekly Drop at CustomClanker — one take, every week, no fluff.