New Hotness

September 2026: What Actually Changed in AI Tools

Rza

05 Jan 2026 — 7 min read

Fall conference season arrived, and with it the annual firehose of product announcements designed to make you feel like everything is changing all at once. Some of it is. Most of it is a slide deck with a waitlist attached. September's job is to separate the two — what actually shipped and works from what was a keynote demo that won't reach general availability until someone's Q1 roadmap gets around to it.

Here's the scorecard.

What Shipped (For Real)

OpenAI launched GPT-5 — and it's actually available. Not a research preview. Not a limited rollout for enterprise. GPT-5 hit ChatGPT Plus and the API in September, and the capability jump is genuine [VERIFY]. Reasoning tasks that tripped up GPT-4o consistently — multi-step logic, maintaining constraints across long outputs, catching its own errors mid-generation — are measurably better. The context window expanded to 256K tokens with less degradation at the edges [VERIFY]. The pricing is higher than GPT-4o, which surprised nobody, and the rate limits are tighter at launch, which surprised nobody either. The model is real. The access constraints are also real. Both matter.

The more interesting development: GPT-5's tool use is noticeably more reliable. Function calling that used to require careful prompt engineering to avoid malformed JSON now just works most of the time. If you've built agents on the OpenAI API and spent the last year writing defensive JSON parsing around function calls, you can probably rip out 40% of that code. Probably. Test first.

Anthropic shipped Claude 4 Opus. The "big Claude" returned with a model that Anthropic is positioning as the quality ceiling — not the fastest, not the cheapest, but the most reliable for tasks where being wrong is expensive [VERIFY]. Early benchmarks show it trading blows with GPT-5 on reasoning while pulling ahead on instruction following and long-document analysis. The pricing reflects the positioning: this is not your autocomplete model. This is the model you use when the output matters enough to pay for it.

Claude Code got a corresponding upgrade, and the difference is less about raw capability and more about reduced failure modes. The edit-test-fix loop described in our Claude Code review runs tighter now — fewer confident-but-wrong first attempts, better recovery when the first attempt fails, and noticeably improved handling of large codebases where context management used to degrade. It's not a revolution. It's the tool getting 15% better at the thing it was already best at.

Meta released Llama 4 as open-weight. The open-source AI crowd got its next foundation model, and it's competitive enough to matter [VERIFY]. Llama 4's 70B parameter version matches or exceeds GPT-4o on most benchmarks, which means the floor for "what you can run locally or self-host" just rose significantly. The practical impact: companies that can't send data to external APIs now have a local option that doesn't require apologizing for the quality gap. The 400B+ version exists but requires hardware that most organizations don't have lying around.

For individual developers, Llama 4 means the Ollama-on-your-laptop experience went from "impressive party trick" to "genuinely useful for real work" — at least for the kinds of tasks where you'd previously have used GPT-4o via API. Running a competitive model locally with zero API costs and zero data exposure is no longer a compromise. It's a choice.

What Was Just a Demo

Google's "Project Astra" remained firmly in demo territory. The multimodal AI assistant that can see through your camera, understand your environment, and have a real-time conversation about it got another impressive stage demo at Made by Google [VERIFY]. It looked incredible. It is not available. The gap between "works on stage with controlled conditions" and "works in your hand with real lighting and background noise" is where most demo magic goes to die. Check back when there's a download link.

Microsoft's "Copilot Vision" was announced for Edge with no shipping date. The concept: Copilot can see your browser tab and discuss it with you. The reality: a blog post, a promotional video, and the words "coming soon" [VERIFY]. This is a feature that will probably ship eventually and probably work okay when it does. But announcing a browser feature without a browser you can install it in is just marketing wearing a product announcement costume.

Adobe's "Project Turntable" — the 3D object rotator powered by generative AI — got stage time at Adobe MAX. It looks magical: take a 2D illustration, rotate it in 3D space, and the AI maintains style consistency. It is not in any shipping Adobe product [VERIFY]. The demo drew genuine gasps. The "coming to Creative Cloud in 2027" footnote drew less attention. A year is a long time in generative AI. By the time this ships, the competition will have had four more release cycles to match or exceed it.

September Casualties

Notion AI got rolled back to a simpler feature set. After a year of expanding Notion AI into increasingly ambitious territory — AI-generated databases, automated workflows, "AI as your project manager" — Notion quietly trimmed the feature set back toward what people actually use: summarization, writing assistance, and Q&A over your workspace [VERIFY]. The ambitious features weren't bad. They were just unused. Notion discovered what every productivity tool discovers: users want AI to make their existing workflow faster, not to replace their existing workflow with a different one.

Character.ai lost several key engineers to Google DeepMind. The talent drain that started when Noam Shazeer returned to Google continued, with multiple senior researchers departing. Character.ai still has users — a lot of them — but the brain trust that built the thing is increasingly working somewhere else. The product continues to function. The question is whether it continues to improve.

Hugging Face shut down its inference API free tier. The change was announced as a "restructuring" but the effect is simple: if you were running models through Hugging Face's API without paying, that stopped working in September [VERIFY]. The free tier was always more generous than it was sustainable, and Hugging Face needs revenue like every other company that raised money at peak AI valuations. Open-source models are still free. Running them on someone else's hardware is not.

What Got Leapfrogged

Anthropic's Artifacts feature got outclassed by OpenAI's Canvas. Canvas shipped a September update that added real-time collaboration, version history, and the ability to fork outputs into multiple variations [VERIFY]. Artifacts — Claude's "generate a React component right in the chat" feature — was the original in this space, but it's still essentially a single-pane preview. Canvas now feels like a lightweight collaborative editor built into a chat interface, while Artifacts still feels like a demo window. The irony of OpenAI outshipping Anthropic on a feature category Anthropic created is not lost on anyone watching.

Eleven Labs' voice cloning got matched by an unexpected competitor: Sesame AI. Sesame shipped a voice model in September that produces more natural conversational speech — the pauses, the micro-hesitations, the way a real human says "um" when thinking — than anything Eleven Labs offers [VERIFY]. Eleven Labs still wins on raw voice cloning fidelity and production audio. But for conversational AI, where naturalness matters more than polish, Sesame moved the bar in a way that Eleven Labs will need to respond to.

What AI Was Confidently Wrong About

AI-generated "best LLM" rankings continued to be circular and self-serving. Ask GPT-5 which model is best, and it recommends GPT-5. Ask Claude, and it hedges but steers toward Claude. Ask Gemini, and Google's models rank suspiciously well. None of this is intentional deception — it's training data bias — but it means that the most common way people choose AI tools (asking an AI tool) produces systematically biased results. The tool comparison industry exists because AI cannot be trusted to compare itself honestly. We have job security for exactly this reason.

Conference keynote claims about "10x developer productivity" went unquestioned by AI-generated summaries. Multiple AI-generated recaps of September conferences reported productivity claims from vendor keynotes as factual findings rather than marketing assertions. "According to Microsoft, Copilot makes developers 55% more productive" is a real sentence that appeared in AI-generated conference recaps without the context that this is a claim by the company selling the product, based on their own study, measuring their own metric. AI summarizers strip nuance from source material the way a juice extractor strips fiber from fruit. What comes out is smoother and less nutritious.

Perplexity cited a retracted paper as evidence in a medical query. A user reported Perplexity's research mode surfacing a study that had been retracted months earlier, presenting its findings as current evidence with a working citation link. The retraction notice was on the paper's page. Perplexity didn't check. This is the specific risk of AI research tools: they are excellent at finding sources and terrible at evaluating whether those sources are still considered valid by the communities that produced them.

Sleeper Pick: Simon Willison's LLM CLI Tool

In a month dominated by billion-dollar model launches, the most practically useful AI tool release might have been an update to Simon Willison's llm command-line tool [VERIFY]. It's a Unix-philosophy CLI for working with language models: pipe text in, get text out, works with any model provider, logs everything to a SQLite database. The September update added support for GPT-5 and Claude 4 on day one, plus a plugin for working with embeddings that makes semantic search a one-liner.

It's not glamorous. There's no UI, no waitlist, no keynote demo. It just works. You can pipe your git log into it and get a changelog. You can pipe error logs and get debugging suggestions. You can chain it with other CLI tools in ways that the big AI companies haven't thought of because they're too busy building walled gardens.

If you work in a terminal and you're not using llm, you're doing AI tool interaction the hard way. The tool exists. It costs nothing. It works with everything. September's biggest launches will get the attention. This will get the daily use.

Conference Season Scorecard

Who delivered substance:
- OpenAI (GPT-5 shipped and is available)
- Anthropic (Claude 4 shipped and is available)
- Meta (Llama 4 shipped and is downloadable)

Who delivered slides:
- Google (Project Astra is still a demo)
- Microsoft (Copilot Vision has no ship date)
- Adobe (Project Turntable is a 2027 promise)

The pattern is clear: companies that ship models delivered. Companies that ship platforms demoed. The model makers are in an execution race. The platform makers are in a marketing race. If you're trying to decide where to invest your time learning, bet on the things you can install today.

This is part of CustomClanker's Monthly Drops — what actually changed in AI tools this month.

September 2026: What Actually Changed in AI Tools

Rza

What Shipped (For Real)

What Was Just a Demo

September Casualties

What Got Leapfrogged

What AI Was Confidently Wrong About

Sleeper Pick: Simon Willison's LLM CLI Tool

Conference Season Scorecard

Read more

The YouTube + AI Pipeline

The Weekly Drop

The Tool Collector's Guide to Owning Nothing

Self-Hosting & Tinkering