JP Morgan Expert Interview: Is there "overcapacity" in AI data centers? How to deploy training and inference infrastructure?

JP Morgan's latest expert interview reveals that concerns about "overcapacity" in AI infrastructure are premature. The lightweight nature of algorithms and the recycling of hardware are alleviating computing power anxiety, but the "power issue" and "heating problem" faced by data centers are the more realistic speed bumps on the road to AI's rapid advancement

Author: Long Yue

Source: Hard AI

Recently, JP Morgan held a conference call with Scale AI data scientists and former Meta senior data scientist Sri Kanajan to discuss the trends in ultra-large-scale AI data center architecture.

Kanajan believes that the deployment of AI infrastructure is still in its early stages, with limited concerns about overcapacity. Advances in algorithms are reducing the power consumption required for training, and infrastructure is achieving efficient recycling through "training to inference," with training clusters being quickly reconfigured for inference workloads after the launch of the new generation of GPUs. However, power and cooling issues remain the main bottlenecks for scaling the next generation of data centers.

Algorithm Innovation: Shifting Power Demand from Training to Inference

According to the JP Morgan report, recent breakthroughs in algorithms—such as hybrid models (including DeepSeek), precision training, and strategic reinforcement learning—have significantly reduced the overall computational requirements for training AI models. This has prompted the industry to shift its optimization focus to the inference phase.

Kanajan points out that the industry is actively adopting techniques such as model distillation and compression to refine models, aiming to enhance performance without significantly increasing the original power investment.

Infrastructure: Dynamic Deployment, Concerns About Overcapacity Are Premature

Kanajan believes that the deployment of AI infrastructure is still in its early stages, especially considering cloud service providers' long-term return expectations on their investments, with current concerns about overcapacity being limited.

A key dynamic utilization strategy is that when the training cycle ends and the new generation of GPUs is released, existing training clusters are quickly reconfigured to support inference workloads. This "training to inference" lifecycle transition ensures that computing resources can efficiently adapt to the changing demands from intensive training to balanced inference.

In terms of construction models, training clusters are typically deployed in newly built isolated facilities ("greenfield") optimized for offline GPU utilization; while inference clusters tend to expand existing data centers ("brownfield"), especially in metropolitan areas, to support ongoing online AI services.

Energy Challenges: Power and Cooling Become the Biggest Bottleneck

Power and cooling challenges remain the primary bottlenecks for scaling the next generation of data centers.

In Kanajan's view, as data centers pursue higher density and accommodate more intensive computing loads, power supply and heat dissipation issues have become common bottlenecks for the scale expansion of next-generation data centers.

Ultra-large-scale enterprises are actively exploring innovative solutions, such as adopting liquid cooling technology in Type I architecture designs, and even evaluating nuclear or alternative energy sources to achieve stable power supply 24/7. Meanwhile, robust grid interconnection strategies are crucial for ensuring uninterrupted operation of data centers.

Meta Leads Innovation in Data Center Architecture

In terms of data center design, the JP Morgan report highlights Meta's innovative practices. Unlike traditional ultra-large-scale vendors that design H-type layouts to support multi-tenant clouds, Meta has chosen a Type I campus-style configuration specifically for internal AI workloads.

According to the report, this design has achieved improvements in power consumption, cooling, and rack density, which are critical factors for supporting high-performance training clusters In terms of hardware strategy, Meta is balancing brand solutions with white box solutions. On the networking side, while Arista's strong capabilities remain essential in the current infrastructure, Meta is collaborating with white box suppliers like Celestica, with the long-term goal of integrating its internal software with white box hardware.

This article is from the WeChat public account "Hard AI". For more cutting-edge AI news, please click here.