Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more
The gloves came off at Tuesday at VB Transform 2025 as alternative chip makers directly challenged Nvidia’s dominance narrative during a panel about inference, exposing a fundamental contradiction: How can AI inference be a commoditized “factory” and command 70% gross margins?
Jonathan Ross, CEO of Groq, didn’t mince words when discussing Nvidia’s carefully crafted messaging. “AI factory is just a marketing way to make AI sound less scary,” Ross said during the panel. Sean Lie, CTO of Cerebras, a competitor, was equally direct: “I don’t think Nvidia minds having all of the service providers fighting it out for every last penny while they’re sitting there comfortable with 70 points.”
Hundreds of billions in infrastructure investment and the future architecture of enterprise AI are at stake. For CISOs and AI leaders currently locked in weekly negotiations with OpenAI and other providers for more capacity, the panel exposed uncomfortable truths about why their AI initiatives keep hitting roadblocks.
>>See all our Transform 2025 coverage here<<The capacity crisis no one talks about
“Anyone who’s actually a big user of these gen AI models knows that you can go to OpenAI, or whoever it is, and they won’t actually be able to serve you enough tokens,” explained Dylan Patel, founder of SemiAnalysis. There are weekly meetings between some of the biggest AI users and their model providers to try to persuade them to allocate more capacity. Then there’s weekly meetings between those model providers and their hardware providers.”
Panel participants also pointed to the token shortage as exposing a fundamental flaw in the factory analogy. Traditional manufacturing responds to demand signals by adding capacity. However, when enterprises require 10 times more inference capacity, they discover that the supply chain can’t flex. GPUs require two-year lead times. Data centers need permits and power agreements. The infrastructure wasn’t built for exponential scaling, forcing providers to ration access through API limits.
According to Patel, Anthropic jumped from $2 billion to $3 billion in ARR in just six months. Cursor went from essentially zero to $500 million ARR. OpenAI crossed $10 billion. Yet enterprises still can’t get the tokens they need.
Why ‘Factory’ thinking breaks AI economics
Jensen Huang’s “AI factory” concept implies standardization, commoditization and efficiency gains that drive down costs. But the panel revealed three fundamental ways this metaphor breaks down:
First, inference isn’t uniform. “Even today, for inference of, say, DeepSeek, there’s a number of providers along the curve of sort of how fast they provide at what cost,” Patel noted. DeepSeek serves its own model at the lowest cost but only delivers 20 tokens per second. “Nobody wants to use a model at 20 tokens a second. I talk faster than 20 tokens a second.”
Second, quality varies wildly. Ross drew a historical parallel to Standard Oil: “When Standard Oil started, oil had varying quality. You could buy oil from one vendor and it might set your house on fire.” Today’s AI inference market faces similar quality variations, with providers using various techniques to reduce costs that inadvertently compromise output quality.
Third, and most critically, the economics are inverted. “One of the things that’s unusual about AI is that you can spend more to get better results,” Ross explained. “You can’t just have a software application, say, I’m going to spend twice as much to host my software, and applications can get better.”
When Ross mentioned that Mark Zuckerberg praised Groq for being “the only ones who launched it with the full quality,” he inadvertently revealed the industry’s quality crisis. This wasn’t just recognition. It was an indictment of every other provider cutting corners.
Ross spelled out the mechanics: “A lot of people do a lot of tricks to reduce the quality, not intentionally, but to lower their cost, improve their speed.” The techniques sound technical, but the impact is straightforward. Quantization reduces precision. Pruning removes parameters. Each optimization degrades model performance in ways enterprises may not detect until production fails.
The Standard Oil parallel Ross drew illuminates the stakes. Today’s inference market faces the same quality variance problem. Providers betting that enterprises won’t notice the difference between 95% and 100% accuracy are betting against companies like Meta that have the sophistication to measure degradation.
This creates immediate imperatives for enterprise buyers.
- Establish quality benchmarks before selecting providers.
- Audit existing inference partners for undisclosed optimizations.
- Accept that premium pricing for full model fidelity is now a permanent market feature. The era of assuming functional equivalence across inference providers ended when Zuckerberg called out the difference.
The $1 million token paradox
The most revealing moment came when the panel discussed pricing. Lie highlighted an uncomfortable truth for the industry: “If these million tokens are as valuable as we believe they can be, right? That’s not about moving words. You don’t charge $1 for moving words. I pay my lawyer $800 for an hour to write a two-page memo.”
This observation cuts to the heart of AI’s price discovery problem. The industry is racing to drive token costs below $1.50 per million while claiming these tokens will transform every aspect of business. The panel implicitly agreed with each other that the math doesn’t add up.
“Pretty much everyone is spending, like all of these fast-growing startups, the amount that they’re spending on tokens as a service almost matches their revenue one to one,” Ross revealed. This 1:1 spend ratio on AI tokens versus revenue represents an unsustainable business model that panel participants contend the “factory” narrative conveniently ignores.
Performance changes everything
Cerebras and Groq aren’t just competing on price; they are also competing on performance. They’re fundamentally changing what is possible in terms of inference speed. “With the wafer scale technology that we’ve built, we’re enabling 10 times, sometimes 50 times, faster performance than even the fastest GPUs today,” Lie said.
This isn’t an incremental improvement. It’s enabling entirely new use cases. “We have customers who have agentic workflows that might take 40 minutes, and they want these things to run in real time,” Lie explained. “These things just aren’t even possible, even if you’re willing to pay top dollar.”
The speed differential creates a bifurcated market that defies factory standardization. Enterprises needing real-time inference for customer-facing applications can’t use the same infrastructure as those running overnight batch processes.
The real bottleneck: power and data centers
While everyone focuses on chip supply, the panel revealed the actual constraint throttling AI deployment. “Data center capacity is a big problem. You can’t really find data center space in the U.S.,” Patel said. “Power is a big problem.”
The infrastructure challenge goes beyond chip manufacturing to fundamental resource constraints. As Patel explained, “TSMC in Taiwan is able to make over $200 million worth of chips, right? It’s not even… it’s the speed at which they scale up is ridiculous.”
But chip production means nothing without infrastructure. “The reason we see these big Middle East deals, and partially why both of these companies have big presences in the Middle East is, it’s power,” Patel revealed. The global scramble for compute has enterprises “going across the world to get wherever power does exist, wherever data center capacity exists, wherever there are electricians who can build these electrical systems.”
Google’s ‘success disaster’ becomes everyone’s reality
Ross shared a telling anecdote from Google’s history: “There was a term that became very popular at Google in 2015 called Success Disaster. Some of the teams had built AI applications that began to work better than human beings for the first time, and the demand for compute was so high, they were going to need to double or triple the global data center footprint quickly.”
This pattern now repeats across every enterprise AI deployment. Applications either fail to gain traction or experience hockey stick growth that immediately hits infrastructure limits. There’s no middle ground, no smooth scaling curve that factory economics would predict.
What this means for enterprise AI strategy
For CIOs, CISOs and AI leaders, the panel’s revelations demand strategic recalibration:
Capacity planning requires new models. Traditional IT forecasting assumes linear growth. AI workloads break this assumption. When successful applications increase token consumption by 30% monthly, annual capacity plans become obsolete within quarters. Enterprises must shift from static procurement cycles to dynamic capacity management. Build contracts with burst provisions. Monitor usage weekly, not quarterly. Accept that AI scaling patterns resemble those of viral adoption curves, not traditional enterprise software rollouts.
Speed premiums are permanent. The idea that inference will commoditize to uniform pricing ignores the massive performance gaps between providers. Enterprises need to budget for speed where it matters.
Architecture beats optimization. Groq and Cerebras aren’t winning by doing GPUs better. They’re winning by rethinking the fundamental architecture of AI compute. Enterprises that bet everything on GPU-based infrastructure may find themselves stuck in the slow lane.
Power infrastructure is strategic. The constraint isn’t chips or software but kilowatts and cooling. Smart enterprises are already locking in power capacity and data center space for 2026 and beyond.
The infrastructure reality enterprises can’t ignore
The panel revealed a fundamental truth: the AI factory metaphor isn’t only wrong, but also dangerous. Enterprises building strategies around commodity inference pricing and standardized delivery are planning for a market that doesn’t exist.
The real market operates on three brutal realities.
- Capacity scarcity creates power inversions, where suppliers dictate terms and enterprises beg for allocations.
- Quality variance, the difference between 95% and 100% accuracy, determines whether your AI applications succeed or catastrophically fail.
- Infrastructure constraints, not technology, set the binding limits on AI transformation.
The path forward for CISOs and AI leaders requires abandoning factory thinking entirely. Lock in power capacity now. Audit inference providers for hidden quality degradation. Build vendor relationships based on architectural advantages, not marginal cost savings. Most critically, accept that paying 70% margins for reliable, high-quality inference may be your smartest investment.
The alternative chip makers at Transform didn’t just challenge Nvidia’s narrative. They revealed that enterprises face a choice: pay for quality and performance, or join the weekly negotiation meetings. The panel’s consensus was clear: success requires matching specific workloads to appropriate infrastructure rather than pursuing one-size-fits-all solutions.
Daily insights on business use cases with VB Daily
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.
Read our Privacy Policy
Thanks for subscribing. Check out more VB newsletters here.
An error occured.
