SemiAnalysis details NVIDIA's new chip "Rubin CPX": completely changing the inference architecture and reshaping the industry roadmap

Wallstreetcn
2025.09.16 13:13
portai
I'm PortAI, I can summarize articles.

The SemiAnalysis report points out that the launch of Rubin CPX is second only to the GB200 NVL72 Oberon rack-level form in March 2024 in terms of importance. This chip emphasizes computing FLOPS rather than memory bandwidth through specialized optimization of the pre-fill stage. This may lead to a decline in HBM demand and a surge in GDDR7 memory demand, with Samsung becoming the biggest beneficiary. In addition, NVIDIA's competitors may have to reconfigure their entire roadmap once again, just as the Oberon architecture changed the entire industry's roadmap

With the comprehensive arrival of the "Inference Era" of AI large models, NVIDIA has recently launched the Rubin CPX GPU. The think tank SemiAnalysis believes that this GPU may completely change the inference field, with its release significance only second to the GB200 NVL72 rack in March 2024.

Recently, Citigroup released a remarkable research report, stating in the report that the Rubin CPX GPU, which NVIDIA heavily launched at the AI Infrastructure Summit, is designed for long-context inference and is expected to achieve an astonishing return on investment of about 50 times, far exceeding the previous GB200 NVL72's return rate of about 10 times.

This release is not only an advancement for NVIDIA itself but also a reshaping of the entire industry's roadmap. As emphasized in the SemiAnalysis report, the significance of the Rubin CPX launch is second only to the GB200 NVL72 Oberon rack-level form in March 2024. This chip revolutionizes separated inference services by specifically optimizing the prefill stage, emphasizing computational FLOPS rather than memory bandwidth.

This release will also force all of NVIDIA's competitors to redraw their roadmaps. AMD and ASIC suppliers had previously invested significant resources to catch up with NVIDIA's rack-level solutions, but now they must double down on developing their own prefill chips, further delaying the time to close the gap with NVIDIA.

The SemiAnalysis report provides detailed insights into the Rubin CPX, revealing how this chip reshapes the industry roadmap by optimizing different stages of inference. Here are the key points from the report:

Breaking Through Memory Wall Limitations: Dedicated Chip Architecture Design

According to SemiAnalysis, the core idea behind NVIDIA's launch of the Rubin CPX is to decouple the inference process into two stages: "Prefill" and "Decode," and to design specialized hardware for each stage.

The report points out that the prefill stage of LLM requests (generating the first token) is typically compute-intensive (FLOPS) but has low utilization of memory bandwidth.

Although HBM is extremely valuable for both training and inference, there is a significant difference in its utilization efficiency during the specific execution of inference, with HBM only playing a high-value role in the decoding step. In this case, using chips equipped with expensive HBM for prefill is a waste of resources.

The Rubin CPX was born to address this pain point; it "slims down" memory bandwidth and emphasizes computational FLOPS instead. The Rubin CPX has a dense computing capability of 20 PFLOPS in FP4 but is equipped with only 2TB/s of memory bandwidth and 128GB of GDDR7 memory In comparison, the dual-chip R200 offers 33.3 PFLOPS of FP4 intensive computing power and 20.5TB/s of memory bandwidth, along with 288GB of HBM.

This will lead to a significant improvement in cost-effectiveness. SemiAnalysis reports that switching HBM to the cheaper GDDR7 memory can reduce the cost per GB by over 50%. This means that during the pre-fill phase, the Rubin CPX can provide efficient computing power at a much lower cost than the R200, significantly reducing the total cost of ownership (TCO).

SemiAnalysis notes that the chip design is similar to the next-generation RTX 5090 or RTX PRO 6000 Blackwell, using large monolithic chips and a 512-bit wide GDDR7 memory interface. However, unlike the consumer Blackwell GPU chip, which has only a 20% FLOPS version of its HBM, the Rubin CPX's ratio jumps to 60%, as it will be a standalone die design much closer to the R200 computing chip.

New Rack-Level Architecture: Three Deployment Options

NVIDIA has launched three Vera Rubin rack configurations: VR200 NVL144 (Rubin only), VR200 NVL144 CPX (Rubin + Rubin CPX hybrid), and the Vera Rubin CPX dual rack solution. Specifically:

  • NVL144 CPX Rack: NVIDIA has introduced the VR NVL144 CPX (Vera Rubin NVL144 CPX) rack, integrating the Rubin GPU with the Rubin CPX GPU. Each compute tray will contain 4 R200 GPUs (for decoding) and 8 Rubin CPX GPUs (for pre-filling). This heterogeneous configuration allows the system to efficiently handle both stages of inference simultaneously.
  • Dual Rack Solution: The Vera Rubin CPX dual rack solution offers greater flexibility, allowing customers to deploy the VR NVL144 (pure Rubin GPU) rack and the VR CPX (pure Rubin CPX GPU) rack separately based on their workload needs, enabling precise adjustment of the pre-fill to decode ratio (PD ratio).

SemiAnalysis provides a detailed analysis of the technological innovations in cable design. Due to the high-density design leaving no space for cable routing, NVIDIA employs PCB mid-plates and Amphenol Paladin board-to-board connectors for signal transmission. The CX-9 network card has been moved from the back half of the chassis to the front half, shortening the transmission distance for 200G Ethernet/InfiniBand signals, while the lower-speed PCIe Gen6 signals handle longer-distance transmission, **improving reliability and maintainability **

Liquid cooling adopts a layered liquid cooling design, with the Rubin CPX and CX-9 network cards utilizing a layered design that shares a liquid cooling cold plate, maximizing GPU density and cooling efficiency within the 1U tray space. This design has seen similar practices in NVIDIA's GTX 295 from 2009.

Pre-filled Pipeline Parallelism: The Key to Efficient Resource Utilization

Another significant advantage of the Rubin CPX is its optimization for pre-filled pipeline parallelism.

  • Reduced Network Costs: The communication demands during the pre-fill stage are lower, allowing Rubin CPX to forgo expensive fast horizontal scaling networks (such as NVLink). The bandwidth of PCIe Gen6 x16 (approximately 1Tbit/s) is sufficient to meet the pre-fill requirements of modern MoE LLMs.
  • Higher Throughput: Pipeline parallelism provides higher token throughput on each GPU, as it involves simple send and receive operations rather than all-to-all collective operations in expert parallelism (EP).
  • Significant TCO Savings: The cost of NVLink horizontal scaling is about $8,000 per GPU, accounting for over 10% of the total cluster cost. Rubin CPX brings substantial cost savings to end users by avoiding the use of these expensive network devices.

Technical Breakthrough in Decoupled Inference Services

SemiAnalysis reports that the industry is the first to attempt routing pre-fill and decoding requests to different computing units to address the interference issue between the two workloads. This approach better manages service level agreements (SLA), but still faces the "misconfiguration" problem—pure pre-fill operations almost always severely waste memory bandwidth resources.

SemiAnalysis emphasizes that LLM request processing consists of two stages: the pre-fill stage affects the time to first token (TTFT), which is usually computation-limited; the decoding stage affects the time per output token (TPOT), which is always memory-limited.

Analysis shows that when the sequence length exceeds 32k, FLOPS utilization reaches 100%, while memory bandwidth utilization declines. When using R200 for pure pre-fill operations, the total cost of ownership waste reaches $0.90 per hour, while Rubin CPX significantly reduces this waste by utilizing lower-cost memory.

In pipeline parallel inference, the Rubin CPX's PCIe Gen6 x16 interface provides approximately 1 Tbit/s unidirectional bandwidth, sufficient to handle the pre-filling tasks of modern MoE cutting-edge LLMs. The Rubin CPX offers larger memory capacity but uses "lower quality" GDDR7 memory, costing less than half of HBM per GB. From the perspective of memory suppliers, GDDR7 has lower profit margins due to lower technical requirements and more intense competition (e.g., Samsung can supply it).

Will HBM demand decline? Will overall memory market demand grow?

The use of CPX systems reduces the proportion of HBM in the total system cost. For every dollar spent on the VR200 NVL144 CPX or VR CPX rack, the proportion allocated to HBM is lower compared to an independent VR200 NVL144 rack. Assuming fixed AI system spending, the demand for HBM per dollar spent will decrease.

Furthermore, a report from SemiAnalysis indicates that while the NVIDIA Rubin CPX architecture reduces memory usage, it may actually drive the overall memory market size to expand, reshaping the GDDR7 supply chain landscape.

The technical reality is more complex. The mechanism of Rubin CPX is to lower the costs of pre-filling and tokens. When token costs decrease, demand increases, which means decoding demand will also rise accordingly. Similar to many other cost-reducing technological innovations, the growth in demand often exceeds the decline in costs, ultimately driving the overall market size to expand.

The surge in demand for GDDR7 driven by Rubin CPX is reshaping the memory supply chain landscape, and its effects are already beginning to manifest. Notably, the RTX Pro 6000 also uses GDDR7 memory, but at a lower speed of 28 Gbps. NVIDIA has already placed large-scale supply chain orders for the RTX Pro SKU.

In this surge of GDDR7 demand, Samsung has become the biggest beneficiary. Due to its ability to meet NVIDIA's sudden influx of large-volume order demands, these orders have primarily flowed to Samsung. In contrast, SK Hynix and Micron Technology have failed to meet this demand, mainly because their wafer production capacity is occupied by HBM orders and other businesses.

Competitors Left Far Behind

The SemiAnalysis report states that the introduction of Rubin CPX has widened the gap in rack system design capabilities between NVIDIA and its competitors to a "gorge" level.

All of NVIDIA's competitors may have to reconfigure their entire roadmaps again, just as the Oberon architecture changed the entire industry's roadmap. They will need to invest again to develop their own pre-filling chips, which will further delay their timeline to narrow the gap with NVIDIA.

SemiAnalysis believes that Google TPU, with the advantage of 3D ring expansion networks supporting a cluster size of up to 9,216 TPUs, should develop dedicated pre-filling chips to maintain cost-performance advantages AMD's catch-up strategy faces significant challenges. The MI400 72 GPU rack-level system was originally expected to compete with the VR200 NVL144 in terms of TCO, but NVIDIA has increased the memory bandwidth of the VR200 to 20.5TB/s, matching the MI400. If the actual FP4 performance of the MI400 is comparable to or lower than that of the VR200 NVL144, AMD will fall behind NVIDIA once again.

According to SemiAnalysis, AMD lacks strong internal workload support and needs to develop rack-level systems and improve software while also opening up a dedicated chip line for pre-filling in order to have a chance to catch up with NVIDIA by 2027.

Suppliers with internal workloads, such as AWS Trainium3 and Meta MTIAv4, have advantages in developing dedicated pre-filling chips. However, AWS faces technical challenges due to the limited space of the 1U compute tray, which may require solutions involving EFA network card sidecars and external PCIe AEC cables