GPT Limitations: The Honest List
Every article in this series covers what works. This one collects what doesn't. Not as a hit piece — GPT is a genuinely capable tool that millions of people use productively every day — but as an honest inventory of the limitations that OpenAI's marketing glosses over and that users discover through frustration rather than documentation. If you've used ChatGPT for more than a week, you've hit at least three of these. If you've built on the API, you've hit all of them.
These limitations aren't unique to GPT. Every large language model shares most of them to varying degrees. But GPT's specific profile — broad capability, inconsistent reliability, fast iteration that breaks things — creates a particular experience that's worth mapping clearly. Knowing where the tool fails is at least as valuable as knowing where it succeeds, because the failures are where you waste time.
Hallucination
GPT makes things up. Confidently. With formatting that suggests accuracy. With citations that don't exist.
This has improved meaningfully from GPT-3.5 to GPT-4 to GPT-4o. The rate of outright fabrication is lower. The model is better at saying "I'm not sure" when it genuinely doesn't know. But the improvement is from "frequently hallucinates" to "occasionally hallucinates," not from "occasionally hallucinates" to "never hallucinates." And the occasions cluster in exactly the wrong places — specific factual claims, numerical data, academic citations, legal references, and any domain where being 95% right and 5% fabricated is worse than being completely unhelpful.
The practical problem isn't that GPT hallucinates. It's that you can't tell when it's hallucinating without checking. The model's confidence level is not correlated with its accuracy. A correct fact and a fabricated fact are delivered with identical certainty, identical formatting, and identical conversational fluency. There's no built-in signal for "I'm less sure about this part." The logprobs API gives developers a technical mechanism for confidence estimation, but it's not exposed to consumers, and even for developers, mapping token probabilities to factual accuracy is more research project than practical tool.
The web browsing feature in ChatGPT helps — when it works. The model can search for current information and cite real sources. But browsing is slow, the sources it finds are not always authoritative, and the model sometimes synthesizes information from browsing results in ways that don't match what the sources actually say. Browsing reduces hallucination for current events and verifiable facts. It doesn't eliminate it.
For any task where factual accuracy matters — research, reporting, legal analysis, medical information, financial data — GPT's output is a first draft, not a finished product. Verify everything. This is not a disclaimer. It's a workflow requirement.
The Knowledge Cutoff
GPT's training data has a date. For GPT-4o, the cutoff is somewhere in 2024 [VERIFY exact knowledge cutoff for current GPT-4o]. The model does not know what happened after that date unless it uses web browsing to look it up. And the model doesn't always know what it doesn't know — it will sometimes answer questions about recent events using outdated information rather than acknowledging that its training data might not cover the topic.
Web browsing partially addresses this, but it introduces its own problems. Browsing is slow — a simple fact-check that takes a human five seconds on Google takes ChatGPT 10-30 seconds of browsing. Browsing results are not always the most authoritative sources. And the model sometimes fails to browse when it should, answering from potentially outdated training data instead. There's no reliable way to force browsing for every factual question — the model decides based on its own assessment of whether browsing is needed, and that assessment is sometimes wrong.
For developers using the API, there is no browsing. The model's knowledge ends at its cutoff date. If your application needs current information, you need to provide it — through function calling, RAG, or direct context injection. The API model doesn't know it's outdated and won't tell you when its information might be stale.
Context Window Degradation
GPT-4o's 128K token context window sounds enormous — roughly 100,000 words, or a full-length novel. The number is real. The performance across that entire window is not uniform.
Performance degrades as context grows, and the degradation is not linear. The first 10K tokens of context are processed reliably. By 50K tokens, the model starts missing references that appear in the middle of the context — the well-documented "lost in the middle" problem. By 100K tokens, you're dealing with a model that has access to all the text but reliably processes only the beginning and the end. Information buried in the middle of a very long context gets effectively lost.
This matters for any task that involves large documents, long conversations, or multi-document analysis. If you upload a 50-page report and ask about a detail on page 25, the answer might be right or might be a hallucination based on the model's training data rather than the document you uploaded. The model won't tell you it missed the relevant section — it will give you a confident answer regardless.
The practical workarounds are the same as for every LLM with this problem: keep contexts focused, put the most important information at the beginning and end, break large documents into chunks and process them separately, and don't trust the model's recall of specific details from long contexts without verification. The 128K number is a ceiling, not a sweet spot.
Inconsistency
Same prompt. Different responses. Every time.
This is by design — language models are stochastic systems. The temperature parameter controls randomness, and even at temperature 0, GPT is not perfectly deterministic [VERIFY if temperature 0 produces identical outputs for identical inputs on current models]. For creative tasks, this variability is a feature. For tasks requiring reproducibility — test automation, data processing, content generation with specific requirements — it's a problem.
The inconsistency isn't just in wording. It's in substance. Ask GPT the same analytical question three times and you might get three different conclusions, each presented with equal confidence. Ask it to format data the same way it did in the previous message and it might change the format without acknowledgment. Ask it to follow the same rules it just demonstrated and it might drift. The model has no strong mechanism for self-consistency — each generation is influenced by the prompt and the randomness seed, not by an internal commitment to consistency.
Structured outputs help. When you force the model to return JSON matching a schema, the structure is guaranteed. But the content within that structure — the actual values, the actual analysis, the actual decisions — still varies. Structured outputs solve the format problem. They don't solve the substance problem.
For applications requiring consistency, the mitigation is engineering: deterministic seeds where available, structured outputs for format control, multiple generations with voting or aggregation for substance, and human review for anything that matters. None of these eliminate inconsistency. They manage it.
The Safety Over-Refusal Problem
GPT refuses legitimate requests. Regularly. The content policy is designed to prevent harmful outputs — no detailed instructions for violence, no generation of CSAM, no impersonation of real people. These are reasonable guardrails. The implementation, however, is broad enough to catch a significant volume of legitimate professional and creative work in the crossfire.
Examples that hit the refusal wall in practice: creative fiction involving conflict or violence. Medical information requests that the model interprets as seeking self-harm advice. Security research questions that the model interprets as malicious. Historical analysis of atrocities. Legal analysis involving criminal scenarios. Academic discussion of sensitive topics. Artistic prompts involving nudity or mature themes. The model doesn't distinguish between a novelist writing a thriller, a medical student studying pharmacology, and someone with harmful intent — all three get the same refusal for certain categories of requests.
The refusals are inconsistent, which makes them particularly frustrating. A request that gets blocked in one conversation might succeed in another with minor rephrasing. The line between "the model will answer this" and "the model will refuse this" is unpredictable, context-dependent, and impossible to map in advance. This makes GPT unreliable for any workflow that regularly touches sensitive content — not because the model can't help, but because you can't predict when it will decide not to.
The competitive landscape matters here. Claude has its own refusal patterns — different categories, different thresholds. Neither is clearly more or less permissive overall. But both exhibit the same fundamental problem: content policies designed for the worst-case user applied to all users, with no mechanism for authenticated professionals to access a less restricted tier.
Math and Reasoning
GPT-4o is better at math than GPT-3.5. The o1 and o3 series models — with chain-of-thought reasoning — are meaningfully better still. But "better" is relative, and the baseline for LLM math was very low.
Simple arithmetic, basic algebra, straightforward word problems — GPT handles these reliably. Multi-step reasoning, problems requiring careful tracking of state across steps, problems with subtle logical dependencies — the error rate climbs. Not because the model can't do math, but because it does math by pattern matching on similar problems it's seen in training, not by executing a formal reasoning process. When the problem is close to a pattern it knows, it gets the right answer. When the problem deviates from known patterns, it confidently gets the wrong answer.
Code Interpreter (Advanced Data Analysis) in ChatGPT mitigates this — the model writes Python code to solve math problems rather than reasoning about them in text, and the code runs in a sandbox. This is the correct approach for anything beyond basic calculation. But Code Interpreter is a ChatGPT feature, not an API feature (unless you use the Assistants API with Code Interpreter enabled), and even in ChatGPT, the model doesn't always choose to use it when it should.
The o1/o3 models represent a genuine step forward for reasoning tasks. They spend more compute on thinking before responding, and the results for math, logic, and multi-step reasoning are measurably better. The trade-off is speed — o1/o3 responses take longer and cost more. For tasks where reasoning accuracy matters, they're worth it. For tasks where speed matters more than deep reasoning, GPT-4o is fine. OpenAI doesn't surface this trade-off clearly to users — choosing the right model for the task is a skill the product doesn't teach.
Long Conversation Degradation
Start a conversation with GPT. Have a productive first hour. By the second hour, notice that responses are getting more generic, instructions you stated early in the conversation are being ignored, and the model is occasionally contradicting things it said earlier. By the third hour, you're essentially talking to a degraded version of the model that has lost most of the conversation's structure.
This is the context window problem manifesting as conversation quality. As the conversation grows, older messages get pushed further back in the context. The model prioritizes recent messages, and the instructions, constraints, and decisions from earlier in the conversation receive decreasing attention. The model doesn't forget them in a binary sense — they're still in the context — but they lose influence relative to the most recent exchanges.
The workaround is shorter conversations with explicit handoffs. End a conversation, summarize the state, start a new one with the summary as context. This works but it's a workflow cost that the product doesn't acknowledge or facilitate. ChatGPT's interface is designed for long, flowing conversations. The model's capability degrades in long, flowing conversations. The product design and the technical reality are in tension.
The Deprecation Treadmill
This is a developer problem, not a user problem, but it's significant enough to include.
OpenAI deprecates models. Regularly. With published timelines — usually 6-12 months from announcement to shutdown — but the impact is real. If you've fine-tuned a model that gets deprecated, you need to re-fine-tune. If you've built prompts optimized for a specific model's behavior, those prompts may perform differently on the successor. If you've set quality benchmarks based on a model's output, the successor may not meet them in the same way.
The deprecation cadence is faster than any other AI provider. OpenAI ships new models roughly quarterly, deprecates old ones on a rolling basis, and expects developers to keep up. For startups that ship fast and iterate constantly, this is manageable. For enterprises with change management processes, compliance requirements, and regression testing pipelines, the deprecation treadmill is a genuine operational burden.
The Assistants API evolution is the sharpest example — v1 to v2 was a breaking migration that required code changes. The Responses API may eventually supersede parts of the Assistants API. Plugins were deprecated entirely. Building on OpenAI means accepting that the ground under your integration will shift, and building accordingly.
Speed vs. Quality: The Model Selection Problem
OpenAI offers multiple models — GPT-4o, GPT-4o mini, o1, o3-mini, and various specialized variants. Each has a different speed, capability, and cost profile. The right model for the task depends on the task, and OpenAI does a poor job of helping users make this choice.
GPT-4o is fast and broadly capable. It's the default in ChatGPT and the workhorse for most use cases. GPT-4o mini is faster and cheaper but less capable — useful for simple tasks, insufficient for complex ones. The o1/o3 series are slower and more expensive but better at reasoning. Choosing between them requires understanding your task's requirements, and most users don't have a framework for making that assessment.
The result is that most users default to GPT-4o for everything — overpaying for simple tasks and underperforming on complex reasoning tasks. The model selection problem is a UX failure. The capability differences are real and significant, but the product doesn't surface them in a way that helps users make informed choices.
The Honest Framing
Every limitation listed here exists, in some form, in every large language model. Claude hallucinates. Gemini has a knowledge cutoff. Llama degrades over long contexts. The safety over-refusal problem is industry-wide. Context window degradation is architectural, not a bug.
GPT's specific profile is: broader capability than most alternatives, with inconsistent reliability and a platform that evolves faster than most developers can track. The breadth is genuine — no other AI product offers the same range of features in one product. The inconsistency is also genuine — no other AI product produces such variable quality across the same task depending on phrasing, model choice, context length, and whether you happened to hit a good or bad generation.
The users who get the most value from GPT are the ones who understand these limitations well enough to work around them. They verify factual claims. They keep contexts focused. They choose the right model for the task. They don't trust the first generation for anything important. They treat GPT as a drafting tool, not a finishing tool. That's not a criticism of the product. It's a description of how the product actually works, which is more useful than a description of how the marketing says it works.
This is part of CustomClanker's GPT Deep Cuts series — what OpenAI's features actually do in practice.