When The AI Confabulates About Other AIs — Models Lying About Models

You asked Claude about GPT-4o's capabilities. You got a detailed answer — context window, supported modalities, pricing per million tokens, benchmark scores, supported features. It sounded precise. Half of it was wrong. The pricing was from three months ago. The context window number was from a different tier. One of the benchmark scores was [VERIFY] misattributed from a different model entirely. You then asked GPT about Claude's MCP support and got a response that described MCP as if it were a plugin system with a marketplace — a description that borrows from how ChatGPT plugins used to work, not how MCP actually works. The AI wasn't confused about itself. It was confused about the other AI. And it told you about it with the same calm detail it uses for everything else.

The Pattern

AI models have limited and often outdated information about competing models. This shouldn't be surprising — the training data for any model is a snapshot of the internet at a point in time, and AI capabilities change faster than almost anything else on the internet. But the effect is more pronounced than people expect, because the AI doesn't flag the staleness. It describes a competitor's capabilities the way it describes everything else: confidently, specifically, and with enough detail to sound like a primary source.

The competitive blind spot is structural, not accidental. An AI model's training data contains more content about itself — or more precisely, about the company and products associated with it — than about its competitors. Anthropic's docs, blog posts, and technical papers are well-represented in Claude's training data. OpenAI's documentation is well-represented in GPT's training data. When you ask a model about its own capabilities, the information density in the training data is relatively high. When you ask it about a competitor, the information is sparser, older, and more likely to be drawn from secondary sources — blog posts, comparisons, social media discussions — rather than official documentation. The model fills the gaps the same way it always does: by generating the most statistically plausible text. Plausible, not accurate.

The benchmark confabulation is one of the most common and most verifiable failure modes. Ask an AI to compare its own benchmark scores against a competitor's. You will frequently get numbers that are outdated, misattributed, or simply fabricated — generated because a benchmark score in that general range for that category of model would be statistically plausible. "Claude scores [X] on HumanEval" — check the actual leaderboard. The number the AI gave you might be from an older version, from a different benchmark, or from a number that appeared in a blog post speculating about what the score might be. [VERIFY] any specific benchmark number an AI gives you about any model, including itself, against the actual published results.

The feature attribution problem is subtler and harder to catch. Model capabilities overlap enough that the AI can easily conflate them. You ask about one model's feature and get a description that's actually about a different model's implementation of a similar-sounding capability. Claude's artifacts get described using language that sounds more like GPT's canvas. GPT's function calling gets described with terminology from Claude's tool use. The descriptions aren't completely wrong — the models do have analogous features — but the specific details are muddled. Parameters, limitations, naming conventions, and implementation details get swapped across models, creating a description that's accurate about the category but wrong about the specifics.

Pricing confabulation deserves its own category because it's so common and so consequential. AI model pricing changes frequently — new tiers get added, per-token costs get adjusted, free tiers get modified, enterprise pricing gets restructured. The AI's training data contains pricing information from whatever the pricing was at training time. If you're using that pricing information to budget a project or compare the cost of different models, you're working with numbers that are almost certainly wrong. Not approximately wrong — the pricing structures themselves may have changed. A model that was priced per token might now offer a flat subscription tier. A model that had a generous free tier might have restricted it. The AI will quote you the old numbers with the same precision it uses for everything else.

The context window telephone game is a specific and remarkably common confabulation. Context window sizes are one of the most frequently discussed and compared specs in AI model marketing. They're also one of the most frequently changed specs. A model that launched with a 128K context window might now support 200K — or it might have been downgraded, or the actual effective context window might differ from the advertised one, or different tiers might have different limits. Ask one AI about another AI's context window and you'll get a specific number delivered with confidence. That number is from the training data snapshot. The current number may be different. The effective number — how much context the model can actually use effectively, versus how much it technically accepts — is a different question entirely that the AI is even less equipped to answer about a competitor.

The model comparison use case is where all of these problems converge. If you're using an AI to help you decide which AI model to use for a project — and this is a common and reasonable workflow — every piece of information in the comparison is suspect. The capabilities, the pricing, the benchmarks, the limitations, the feature set — each one is drawn from the AI's training data snapshot of a competitor's product that may have changed significantly since that snapshot was taken. You're making a decision based on a comparison where one side is described from outdated secondary sources and the other side is described from slightly-less-outdated primary sources. The comparison feels authoritative because it's structured and specific. The structure and specificity are the problem, not the evidence.

The Psychology

There's an almost comic dimension to asking one AI about another AI. You're asking a participant in a competition to describe its competitors. Not that the AI has competitive intent — it doesn't. But the training data skew means it knows more about itself than about the others, and it fills the knowledge gap with generation rather than silence. The result is functionally similar to asking a salesperson to compare products: you get a detailed answer shaped more by available information than by even-handed analysis.

The reason people trust these cross-model comparisons is that the AI doesn't sound biased. It doesn't say "we're better." It gives you a structured comparison with apparent objectivity — specifications, features, use cases. The format reads as neutral. But neutrality of format is not neutrality of information. The AI may describe its own capabilities from detailed documentation and a competitor's capabilities from a six-month-old blog post, and both descriptions land with the same confident tone. The information asymmetry is invisible in the output.

There's also a convenience factor. Comparing AI models by going to each model's official documentation, pricing page, and changelog is tedious. It's exactly the kind of research task that AI assistants are supposed to help with. So you ask the AI to do the comparison, and the AI obliges — generating a plausible-sounding comparison that saves you the research time but may send you down the wrong path. The convenience is real. The accuracy is not guaranteed. The tradeoff is invisible until you discover the pricing changed, or the context window number was wrong, or the feature you chose the model for doesn't actually work the way the AI described.

The Fix

Never trust an AI's description of a competing AI's capabilities. This is the one domain where AI confabulation is most predictable and most systematic. The information the model has about competitors is sparser, older, and more likely to be drawn from secondary sources than the information it has about itself — and even the information it has about itself may be outdated if the company shipped updates after the training cutoff.

For model comparisons, go to the source. Each model provider publishes their own documentation, pricing page, and changelog. These are the primary sources. They're current. They're authoritative. The fifteen minutes it takes to check each provider's current pricing page will give you more reliable information than any AI-generated comparison table.

For benchmark claims, check the actual leaderboards. Sites like the LMSYS Chatbot Arena, the Open LLM Leaderboard, and the model providers' own published benchmark results are the primary sources. [VERIFY] any specific number the AI gives you — the number might be from the wrong model, the wrong version, the wrong benchmark, or simply generated because it sounded plausible.

For feature claims — "Model X supports function calling with streaming" or "Model Y has a 200K context window" — go to the model's current documentation. Not last month's blog post. Not a comparison article from three months ago. The current, official documentation. AI model capabilities change on cycles measured in weeks. The AI's training data was collected months ago. The gap is where the confabulation lives.

The meta-lesson is this: AI models are least reliable when describing the thing that changes fastest — which, right now, is AI models. The landscape is updating weekly. The training data is months old. The gap between those two facts is filled with plausible-sounding text that may or may not describe the current state of the world. Use the AI to understand concepts and patterns. Use primary sources for current facts. And when an AI tells you something about another AI, verify it — because that's the one topic where the model is almost guaranteed to be working from outdated information.


This is part of CustomClanker's AI Confabulation series — when the AI in your other tab is confidently wrong.