
Inference is often considered to be a single step in the AI process, but it’s two workloads, according to Shar Narasimhan, director of product in Nvidia’s Data Center group. They are the context or prefill phase and the decode phase. Each of these two phases has different requirements of the underlying AI infrastructure.
The prefill phase is compute-intensive, whereas the decode phase is memory-intensive, but up to now, GPU is asked to do both when it really does one task well. The Rubin CPX has been engineered to better the memory performance, Narasimhan said.
So, the Rubin CPX is purpose-built for both phases, offering processing power as well as high throughput and efficiency. “It will dramatically increase the productivity and performance of AI factories,” said Narasimhan. It achieves this through massive token generation. Tokens equal work units in AI, particularly generative AI, so the more tokens generated, the more revenue generated.
Nvidia is also announcing a new Vera Rubin NVL 144 CPX rack, offering 7.5 times the performance of a NVL72, the current top of the line system. Narasimhan said the NVL 144 CPX enables AI service providers to dramatically increase their profitability by delivering $5 billion of revenue for every $100 million invested in infrastructure.
Rubin CPX is offered in multiple configurations, including the Vera Rubin NVL144 CPX, that can be combined with the Quantum‐X800 InfiniBand scale-out compute fabric or the Spectrum-XTM Ethernet networking platform with Nvidia Spectrum-XGS Ethernet technology and Nvidia ConnectX-9 SuperNICs.
Nvidia Rubin CPX is expected to be available at the end of 2026.