Content
summary Summary

Nvidia is planning a new class of GPU called Rubin CPX, designed specifically for the compute-heavy analysis phase in AI models. The strategy, known as split inference, is backed by new benchmark records from Nvidia’s Blackwell Ultra architecture, which uses a similar approach in software.

Ad

When an AI model tackles a complex task, it typically works in two phases. For example, if you ask a language model to summarize a long book, it first has to read and analyze the whole text. This is the compute-intensive "analysis" or "context phase." Only after that does the "generation phase" begin, where the summary is created word by word. Nvidia argues that these two phases have different hardware requirements, and Rubin CPX is their new chip built specifically for that demanding analysis phase.

According to an official announcement, Rubin CPX is a derivative of the Rubin product line planned for 2025 and is set to launch at the end of 2026. It will be available as a plug-in card or a standalone computer for data centers.

A specialized chip for massive context windows

Nvidia says Rubin CPX will shine on tasks that need to process huge amounts of data at once. This includes AI applications that require millions of tokens as context, like analyzing entire codebases or generating videos. CEO Jensen Huang calls CPX the first CUDA GPU built specifically for "massive-context AI."

Ad
Ad

The technical foundation for this is Nvidia’s "disaggregated inference" strategy. As detailed in a technical blog post, the compute-bound context phase places different demands on hardware than the memory bandwidth-bound generation phase that follows. Rubin CPX will use a monolithic die design, deliver 30 petaFLOPs of NVFP4 compute, pack 128 GB of GDDR7 memory, and offer triple the attention layer acceleration compared to Blackwell.

Blackwell results back up the approach

Nvidia points to recent benchmark results to support its strategy. In the latest round of the MLPerf Inference v5.1 industry standard, Nvidia submitted Blackwell Ultra results for the first time—and set new records.

Blackwell Ultra, running in the GB300 NVL72 rack system, delivered up to 45 percent higher performance per GPU than standard Blackwell, according to Nvidia. Throughput on the new DeepSeek-R1 benchmark is reportedly five times higher than the previous Hopper architecture.

Nvidia notes that these tests already used a software approach to split inference. For the Llama 3.1 405B model, "disaggregated serving" was used to divide the context and generation phases across different GPUs. This method, managed by the Nvidia Dynamo Framework, boosted per-GPU throughput by nearly 1.5x compared to traditional methods. With Rubin CPX, this software approach gets a dedicated hardware solution. Nvidia says partners like Cursor (code editor), Runway (video AI), and Magic (AI agents) are already evaluating the technology for their own use cases.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Nvidia has announced Rubin CPX, a new GPU designed for the compute-heavy "analysis phase" of AI tasks, aiming to optimize workloads like processing millions of tokens for applications such as code analysis and video generation.
  • Rubin CPX, set for release at the end of 2026, features a monolithic die, 128 GB of GDDR7 memory, and triple the attention layer acceleration of previous models, focusing on Nvidia’s "disaggregated inference" strategy to separate the compute demands of analysis and generation phases.
  • Recent benchmark results from Nvidia’s Blackwell Ultra architecture, which uses split inference in software, show up to 45 percent better performance per GPU and five times higher throughput than earlier models, supporting Nvidia’s approach as partners in code, video, and AI agent sectors begin testing the technology.
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.