Understanding the key points of the Hot Chips 2025 conference regarding Google's TPU performance surge, Meta's computing power investment, optical modules, and Ethernet driving Scale Up

Wallstreetcn
2025.09.04 10:42
portai
I'm PortAI, I can summarize articles.

JP Morgan stated that the Hot Chips 2025 conference shows strong growth in demand for AI infrastructure. Google TPU performance has improved tenfold compared to previous generations, rapidly narrowing the gap with NVIDIA GPUs; Meta is expanding its 100,000+ GPU cluster, expecting a tenfold growth over the next decade; Ethernet technology is expanding into the Scale Up field, becoming a key growth point for networks; optical integration technology is accelerating development to address power consumption limitations

The demand for AI is far from slowing down, and multiple technological breakthroughs are reshaping the industry landscape.

On September 3rd, JP Morgan stated in its latest research report that analysts believe the explosive growth of AI on both the consumer and enterprise sides will continue to drive a strong demand cycle for advanced computing, memory, and networking technologies for years to come after attending the Hot Chips 2025 conference.

The report noted that every session at the conference emphasized that AI is the most important driving force behind technological advancement and product demand, with the core message being: the growth momentum of AI infrastructure demand remains strong and is expanding from mere computing power competition to a comprehensive upgrade of networking and optical technologies. The bank believes that the following important trends are worth noting:

Google's Ironwood TPU has significantly improved performance, rapidly narrowing the gap with NVIDIA GPUs;

Meta is expanding its 100k+ GPU cluster scale, expected to grow tenfold in the next decade;

Networking technology has become a key growth point for AI infrastructure, with Ethernet expanding into the Scale-up domain;

Optical integration technology is accelerating development to address power consumption limitations.

Google's Ironwood TPU: Performance Leap Narrows the Gap with GPUs

JP Morgan stated that Google revealed the latest details of the Ironwood TPU (TPU v6) at the conference, showcasing remarkable performance improvements. Compared to TPU v5p, Ironwood's peak FLOPS performance has increased by about 10 times, with a 5.6 times improvement in efficiency.

Storage capacity and bandwidth have also significantly improved, with Ironwood equipped with 192GB HBM3E memory and a bandwidth of 7.3TB/s, a significant increase compared to TPU v5p's 96GB HBM2 and 2.8TB/s bandwidth.

The Ironwood supercluster can scale up to 9,216 chips (a significant increase from the previous 4,096 chips), composed of 144 racks, each containing 64 chips, totaling 1.77PB of directly addressable HBM memory and 42.5 exaflops of FP8 computing power.

Performance comparisons show that Ironwood's 4.2 TFLOPS/watt efficiency is only slightly lower than NVIDIA's B200/300 GPU's 4.5 TFLOPS/watt. JP Morgan stated:

This data highlights that advanced AI-specific chips are rapidly narrowing the performance gap with leading GPUs, driving hyperscale cloud service providers to increase investments in custom ASIC projects.

According to JP Morgan's forecast, this chip, produced using a 3-nanometer process in collaboration with Broadcom, is expected to enter mass production in the second half of 2025. Ironwood is expected to bring $9 billion in revenue to Broadcom over the next 6-7 months, with total lifecycle revenue exceeding $15 billion.

Meta's Customized Deployment Highlights MGX Architecture Advantages

The report pointed out that Meta detailed the architectural design of its customized NVL72 system, Catalina, at the conference. Unlike NVIDIA's standard NVL72 reference design, Catalina is distributed across two IT racks and is equipped with four auxiliary cooling racksFrom the internal configuration perspective, each B200 GPU is paired with one Grace CPU, rather than the standard configuration of two B200s per Grace CPU. This design doubles the total number of Grace CPUs in the system to 72, LPDDR memory increases from 17.3TB to 34.6TB, and the total amount of cache coherent memory rises from 30TB to 48TB, an increase of 60%.

Meta stated that the choice of the custom NVL72 design is primarily based on model requirements and physical infrastructure considerations. Model requirements include not only large language models but also ranking and recommendation engines. In terms of physical infrastructure, there is a need to deploy these power-intensive systems into traditional data center infrastructure.

Meta emphasized that NVIDIA adopts the OCP-compliant MGX modular reference design architecture, which allows for customization based on personalized architectural needs.

Network Technology Becomes the Focus, Scale Up Brings New Opportunities

Network technology has become an important topic at the conference, with significant growth opportunities emerging in both Scale Up and Scale Out areas.

Broadcom highlighted its newly launched 51.2TB/s Tomahawk Ultra switch, which the company describes as a "low-latency Scale Up switch built for HPC and AI applications."

Tomahawk Ultra is the successor to Broadcom's 102.4TB/s Tomahawk 6 switch, supporting the company's strategy to promote Ethernet adoption in Scale Up and Scale Out fields.

Analysts noted that Scale Up particularly represents an important opportunity for Broadcom's Total Addressable Market (TAM) expansion, especially as hyperscale cloud service providers deploy increasingly larger XPU clusters.

NVIDIA continues to advance its Ethernet layout, launching "Spectrum-XGS" Ethernet technology aimed at addressing the "cross-scale" opportunities arising from customers operating distributed clusters across multiple data centers.

NVIDIA claims that Spectrum-XGS has several advantages over off-the-shelf Ethernet solutions, including unlimited scalability and automatic load balancing, and announced that CoreWeave is the first customer to deploy this technology.

Optical Technology Deep Integration to Address Power and Cost Challenges

Optical technology has become another focal area of the conference, with multiple speakers emphasizing the key drivers for deep integration of optical technology into AI infrastructure, including the limitations of copper interconnects, rapidly growing rack power density, and the relatively high cost and power consumption of optical transceivers.

Lightmatter showcased its Passage M1000 "AI 3D Photonic Interconnect," which addresses the challenge of I/O connections located around the chip leading to slower connectivity expansion compared to chip performance expansion. The core of the M1000 is an active multi-mask photonic interconnect spanning over 4000 square millimeters, capable of creating large chip complexes within a single packageAyar Labs discussed its TeraPHY optical I/O chip for AI Scale Up, which is the first implementation of the UCIe optical repeater, ensuring compatibility and interoperability with chips from other manufacturers. This technology supports bidirectional bandwidth of up to 8.192TB/s, with power efficiency 4-8 times higher than traditional pluggable optical devices with electrical SerDes.

Although CPO and other cutting-edge photonic technologies have not yet been widely deployed, analysts expect that data center power consumption limits will become a key driver for widespread adoption in 2027-2028. The optical waveguides of the M1000 are distributed across the chip surface, eliminating the "coastline" limitations of traditional designs while significantly reducing power consumption compared to electrical signaling.

AMD Expands Product Line, MI400 Series to Launch in 2026

AMD provided an in-depth overview of the technical details of the MI350 GPU series at the conference. The MI355X operates at a higher TBP and maximum clock frequency, with a TBP of 1.4kW and a clock frequency of 2.4GHz, while the MI350X has a TBP of 1.0kW and a clock frequency of 2.2GHz.

As a result, the MI355X is primarily deployed in liquid-cooled data center infrastructure, while the MI350X mainly serves customers with traditional air-cooled infrastructure.

In terms of performance, the computing performance of the MI355X is 9% higher than that of the MI350X, but the single-chip memory capacity and bandwidth remain consistent.

In terms of deployment configuration, the MI355X can be deployed in rack systems with up to 128 GPUs, while the MI350X rack supports a maximum of 64 GPUs, which is mainly determined by the thermal management capabilities of air-cooled systems versus direct liquid-cooled systems. However, both have a Scale Up domain of 8 GPUs.

AMD reiterated that the MI400 series and its "Helios" rack solution will be launched as planned in 2026, with JP Morgan expecting it in the second half of 2026, and the MI500 series planned for release in 2027.

JP Morgan analysts believe that AMD is well-positioned in the inference computing market, which is growing faster than the training market, and AMD's products have strong performance and total cost of ownership advantages compared to NVIDIA's alternatives