Llm Platforms

Gemini: What Google Promised vs. What Google Shipped

Rza

21 Dec 2025 — 7 min read

Google's Gemini is the LLM platform that keeps almost delivering. It has the largest context window in production, the best integration with productivity tools you already use, and a pricing model that undercuts everyone. It is also the platform where the gap between demo and daily reality is widest, and where the trust problem is hardest to ignore.

What It Actually Does

Gemini ships in three tiers that roughly parallel the competition: Flash, Pro, and Ultra. Flash is the speed tier — extremely fast, very cheap, and competent enough for summarization, classification, and high-volume processing. Per Google's API pricing, Flash runs at fractions of a cent per million tokens, making it the cheapest capable model from a major provider. Pro is the daily driver, comparable to GPT-4o or Claude Sonnet. Ultra is the reasoning tier, Google's answer to Opus and GPT-4.5.

In practice, Flash is Gemini's most compelling offering. It's not the smartest model from any provider, but the speed-to-cost ratio is unmatched. If you need to process 10,000 documents and extract structured data, Flash does it faster and cheaper than anything else available. I ran a benchmark processing 1,000 articles through Flash, Haiku, and GPT-4o-mini: Flash was fastest, cheapest, and roughly equivalent in quality for extraction tasks. For anything that's fundamentally a "read this and pull out the relevant parts" problem, Flash is hard to beat.

Pro is where things get more complicated. Gemini Pro is a capable model — it handles most tasks competently, and on certain benchmarks it matches or beats GPT-4o and Claude Sonnet. According to Google's published evaluations, Gemini Pro leads on several multimodal benchmarks, particularly those involving video understanding [VERIFY — compare with independent evaluations]. In my daily testing over two weeks, Pro was good but inconsistent. It would nail one task and then fumble a similar one. The instruction following is measurably worse than Claude Sonnet — I ran the same style guide test I use for every model (twelve specific constraints, track which ones get dropped over a 20-message conversation), and Gemini Pro dropped constraints faster and less predictably than Sonnet or GPT-4o. It's not bad. It's just not as reliable.

Ultra is a model I have less direct experience with because Google has been cagey about access and pricing. In the testing I've done, it's competitive with Opus and GPT-4.5 on reasoning tasks but doesn't clearly surpass either. The honest assessment is that the top-tier reasoning models from all three major providers are close enough that the differences matter less than the ecosystem around them.

The context window is Gemini's headline feature, and it deserves the attention. Gemini Pro and Ultra support over 1 million tokens of context — and in some configurations, up to 2 million [VERIFY current limits]. This is not a marketing number. I loaded a 400-page technical manual — roughly 200K tokens — into Gemini and asked detailed questions about specific sections. It performed well, better than Claude at equivalent context lengths. I then loaded three books — roughly 800K tokens combined — and the performance was still usable, though recall on specific details in the middle sections degraded. The degradation was less severe than what I see with Claude at 150K tokens, which suggests Google has made real progress on the attention distribution problem. For workflows that involve very large documents — legal discovery, technical documentation review, research literature — Gemini's context window is a genuine competitive advantage, not a spec sheet number.

Google integration is the other major differentiator. Gemini in Google Workspace — Gmail, Docs, Sheets, Slides — is the most natural LLM integration in any productivity suite. You can ask Gemini to summarize an email thread, draft a response in your voice, analyze a spreadsheet, or create a presentation from a document, all without leaving the Google apps you're already in. In practice, the integration quality varies. Email summarization works well. Document drafting is decent. Spreadsheet analysis is the standout — Gemini is genuinely good at looking at your data and telling you something useful about it. Slide generation is the weakest — the output is generic and requires heavy editing.

NotebookLM deserves its own paragraph because it's Gemini's sleeper product. You upload sources — PDFs, web pages, YouTube videos, Google Docs — and NotebookLM creates an interactive research environment. You can ask questions and get answers grounded in your specific sources with citations. It generates podcast-style audio summaries. It creates study guides and FAQs. For researchers, students, and anyone who needs to synthesize information across multiple sources, NotebookLM is the best product in its category and it's free. The audio overview feature — which generates a surprisingly natural two-person discussion of your sources — went viral for a reason. It's the one Gemini product where the demo and the reality are basically the same thing.

Where Gemini wins clearly: massive context windows (no competition at the 1M+ scale), Google Workspace integration (if you live in Google's ecosystem), video understanding (Gemini processes video natively and does it well), NotebookLM (unique product with no real competitor), and price-to-performance ratio on the Flash tier.

Where Gemini loses: instruction following consistency (Claude is better, GPT-4o is better), creative writing quality (Gemini prose is functional but flat — it reads like a committee wrote it), API ergonomics (Google's API documentation is comprehensive but the developer experience has rough edges compared to OpenAI's and Anthropic's), and what I'll call "personality" — Gemini feels more like a corporate product and less like a tool with a point of view. This matters less for API usage and more for interactive conversation, but it affects the experience.

What The Demo Makes You Think

Google's Gemini demos are technically impressive and deeply misleading about the daily experience. The original Gemini launch video — showing the model understanding live video and responding in real time — was later revealed to have been significantly edited. That set a tone. Google's demos show Gemini at its absolute best, on tasks specifically chosen to highlight its strengths, with production conditions that don't match what you get through the API or the consumer interface.

The fiddling trap with Gemini is context window optimization. Because you can load a million tokens, you start thinking about what you could do with a million tokens. You build elaborate multi-document analysis pipelines, create massive context windows packed with reference material, and spend hours figuring out the optimal way to structure your context. Some of this is productive — Gemini genuinely benefits from more context in a way that other models don't. But the returns diminish, and the cost of processing million-token prompts adds up even at Google's prices. The practical sweet spot for most tasks is 100K-500K tokens. Going beyond that gives you diminishing returns unless your task specifically requires it.

The trust problem is real and worth addressing directly. Google kills products. Google Reader, Google Plus, Inbox, Stadia, Domains — the graveyard is long and well-documented. Gemini is a strategic priority for Google in a way that most of those products weren't, and the investment level suggests it's not going anywhere soon. But if you're building production systems on Gemini's API, you're building on a platform from a company with a documented history of deprecating things developers depend on. A common observation on HN is that Google's API stability track record makes enterprises nervous, and that nervousness is rational. This doesn't mean you shouldn't use Gemini. It means you should think about your exit strategy in a way you wouldn't need to with OpenAI or Anthropic.

The cost of serious Gemini usage is lower than the competition. Google One AI Premium at $20/month gives you Gemini Advanced with the best models and 2TB of storage. API pricing is competitive to cheap depending on the tier — Flash is dramatically cheaper than anything comparable. A month of serious usage typically runs $20-100 for an individual, less than the equivalent Claude or GPT setup. For API workloads, the savings can be significant, particularly if you lean on Flash for high-volume tasks.

What's Coming (And Whether To Wait)

Google iterates on Gemini aggressively, with model updates and new features shipping frequently. The trajectory is clear: better instruction following, more multimodal capabilities, deeper integration with Google's product suite, and continued expansion of the context window. Google's DeepMind research lab is arguably the deepest AI research organization in the world, and that research pipeline feeds into Gemini's development.

The features to watch are Gemini in Android (system-level AI integration across the phone), improvements to Workspace integration (particularly Sheets and Gmail), and whatever comes after NotebookLM. Google's track record of launching impressive AI products and then under-investing in them is the risk — NotebookLM could become the best research tool available or it could stagnate. There's no way to know.

Should you wait? For NotebookLM and Google Workspace integration, no — use them now, they're already good and they're free or cheap. For the API, it depends on your tolerance for Google's platform risk. If you're building something critical, consider using Gemini through a provider that also offers Claude and GPT — that way you can switch models without rebuilding your infrastructure. If you're using it for analysis and research tasks where the massive context window matters, Gemini is already the best option and worth committing to.

The Verdict

Gemini earns a slot in three specific scenarios. First, if you live in Google's ecosystem — Gmail, Docs, Drive — the Workspace integration makes it the natural choice for AI-assisted productivity. Second, if your work involves very large documents or multi-document analysis, the context window is a genuine advantage that no competitor matches. Third, if you need high-volume inference at the lowest possible cost, Flash is the answer.

Gemini does not earn the primary slot for writing (use Claude), for multimodal conversation (use GPT-4o), or for any task where instruction-following consistency matters more than context capacity. NotebookLM earns a slot for everyone who does research, regardless of what other LLM they use — it's that good at what it does.

The honest take on Gemini in 2026: it's the platform with the highest ceiling and the most uneven floor. When it works, it works better than anything else. When it doesn't, it doesn't tell you it's not working — it just quietly produces mediocre output. Learning to recognize the difference is the skill that separates productive Gemini users from frustrated ones.

Updated March 2026. This article is part of the LLM Platforms series at CustomClanker.

Gemini: What Google Promised vs. What Google Shipped

Rza

What It Actually Does

What The Demo Makes You Think

What's Coming (And Whether To Wait)

The Verdict

Read more

The YouTube + AI Pipeline

The Weekly Drop

The Tool Collector's Guide to Owning Nothing

Self-Hosting & Tinkering