The AI boom has been built on a basic assumption: Bigger models are more powerful, and the most powerful models win. However, as costs rise, the industry is about to face the consequences of this assumption breaking down. Users are beginning to reconsider smaller and cheaper models, and this cost-conscious model shopping is a new trend that could have significant implications. Coinbase co-founder Brian Armstrong predicts that the vast majority of tasks will shift to 99% cheaper models within 12-18 months. He stated on X, "[D]emand for intelligence is near infinite, but 80% of workloads will be running on 99% cheaper models, while 20% will still run on the latest gen models." If Armstrong's prediction holds true, it would represent a significant shift for the AI industry. Until now, most AI companies have competed on quality, typically defaulting to the most advanced models available. If those same tasks can be handled by cheaper models without affecting quality, it would signify a massive shift in the economics of AI, potentially hitting the finances of major labs like OpenAI and Anthropic just as they approach their IPOs. Initial tests suggest that, when arranged correctly, cheaper models could replace high-end models without sacrificing quality. A recent test by the legal AI tool Harvey demonstrated that the company could reduce inference costs by 3x without compromising quality. This test, conducted in partnership with the inference platform Fireworks AI, combined Claude Opus and Fireworks’ GLM 5.1, shifting the most intensive tasks to Opus, resulting in substantially lower server time and overall costs. Gabe Pereyra, co-founder of Harvey, remarked, "Quality comes first, and in legal it always will, but the definition of quality is evolving from simply using the most powerful model for everything, to using the best model that gets the right answer most efficiently." This trend is often framed as a competition between major labs and Chinese or open-weight models, but that misses the larger point. The real divide is between large models and small ones. You can save money by switching from GPT-5.5 to DeepSeek’s V4 Flash, but switching to GPT-5.4-mini works just as well. There's an active price war between in-house inference from the big labs and independently served open-weight models. For the larger question of small versus large, it doesn’t really matter which type of small model wins. This may seem obvious—of course, you shouldn’t use more compute than necessary—but it runs counter to the scaling-first approach that has dominated the industry until now. Inspired by the bitter lessons learned, labs have leaned hard into training the most compute-intensive models possible, pushing the frontier of what AI models can achieve. With prices heavily subsidized by investors, clients had no reason to choose anything but the most advanced option. With token prices rising and subsidies slowing down, users are facing cost pressure for the first time. It remains to be seen whether this new cost pressure will indeed drive enterprise users to smaller models. They could just as easily economize by making fewer calls, using less context, or simply giving up on the least promising deployments. But if it turns out that most deployments can run just as well on a smaller model, it could dampen the growing demand for inference and raise new questions about how to justify the cost of training a frontier model.
Blogger's Review: This trend could represent a significant turning point for the AI industry. Will enterprises recognize the potential of smaller models while pursuing high performance? The impact of this shift on the business models of major labs warrants close attention.