Google launches a groundbreaking new Scaling Law, will the future of intelligence be distributed? $3 trillion AI faces a crossroads

Google has launched a new Scaling Law called DiLoCo, marking a significant breakthrough in distributed training. Research shows that DiLoCo exhibits more robust, superior, efficient, and powerful characteristics across different model scales, surpassing traditional data parallel training. This study was collaboratively completed by Google Research, Search, and DeepMind teams, emphasizing the potential of DiLoCo in large-scale model training, indicating that the intelligent future will be distributed

After calculations during testing, the three major teams at Google joined forces to discover a brand new Scaling Law!

Just now, Google researcher Zachary Charles announced: "Significant breakthroughs have been made in distributed training on increasingly larger models."

This core algorithm is the Scaling Law of DiLoCo.

The new training method is unafraid of model size; in the future, training large models across "multiple data centers" will no longer be an issue.

The paper presents four major findings, showing that the Scaling Law of the DiLoCo training method far exceeds "data parallelism":

More Robust (Harder): The hyperparameters of DiLoCo remain stable and predictable across different model sizes.

More Superior (Better): As model size increases, the advantages of DiLoCo over data parallel training further enhance.

More Efficient (Faster): The bandwidth required by DiLoCo is several orders of magnitude less than that of data parallel training.

More Powerful (Stronger): DiLoCo can tolerate batch sizes much larger than those of data parallel training.

It is worth mentioning that this monumental work brings together three major teams at Google: Google Research, Google Search, and Google DeepMind.

Under fixed computational budgets, researchers explored the Scaling Law of DiLoCo in training large models.

The paper focuses on analyzing how algorithmic factors (such as the number of model replicas, hyperparameter settings, and token budgets) affect the training process, demonstrating that these impacts can be accurately predicted through the Scaling Law.

The results indicate that DiLoCo exhibits stable and predictable scalability as model size increases. Co-author Arthur Douillard emphasized once again: DiLoCo is effective!

The intelligent future will be distributed, and DiLoCo may be the key element Under reasonable tuning, DiLoCo has greater scalability advantages than data parallel training, and it may even outperform data parallel training on small-scale models.

These findings reveal the powerful advantages of DiLoCo: it not only addresses communication bottlenecks but also opens up new possibilities for large-scale model training.

Some netizens expressed amazement, stating, "DiLoCo may redefine the way LLM Scaling works! Less bandwidth requirement, higher efficiency."

"Is Data Parallel" Training Coming to an End?

Data parallel training performs excellently on large models, provided that computational resources are concentrated and dispersed.

If the computation is widely distributed, communication can become a significant bottleneck, especially as the model size increases, the problem becomes even more severe!

The solutions adopted in machine learning, such as in federated learning and data center training, involve training multiple independent models and periodically synchronizing them.

As the scale of machine learning models expands, the inherent frequent synchronization requirements of data parallel methods can lead to significant performance degradation, posing a critical challenge for further scaling of models.

So, how can we reduce synchronization requirements while maintaining model quality to break through this bottleneck?

The answer may lie in the innovative approach of DiLoCo (Distributed Low-Communication).

Each DiLoCo model replica independently trains H inner optimization steps.

These models synchronize through outer optimization steps, typically introducing a momentum mechanism between the outer optimization steps.

In the example below, there are M=4 model replicas in total.

The success of DiLoCo has been repeatedly validated. Its operation is similar to the FedOpt method in federated learning In addition, researchers have repeatedly demonstrated the outstanding performance of DiLoCo in training large language models (LLM).

So what is the problem with DiLoCo? Simply put—scale.

Unlike data parallel training, DiLoCo introduces additional "external" hyperparameters, and its actual performance is significantly different from the theoretical expectations.

This is precisely the purpose of studying scaling laws!

This research builds the scaling law for DiLoCo and data parallel training from scratch to predict their comparative performance on large-scale models.

In data parallel training, each training step processes a data batch of size B.

In this study, batch size refers to the number of tokens in the batch (rather than the number of sequences).

Calculate the batch gradient and optimize using the learning rate γ.

During the DiLoCo training process, a global batch size of B is processed at each time step t, and it is evenly distributed across M DiLoCo replicas at the sequence level.

Thus, the global batch size remains B, while the local batch size for each DiLoCo replica is B/M. Similar to data parallel training, each replica computes the batch gradient and performs an internal optimization using the learning rate γ.

However, unlike data parallel training, DiLoCo performs an "outer optimization" every H steps, based on the external gradients calculated in the parameter space, and updates using the learning rate η.

An important comparison is data parallel vs. DiLoCo (M=1).

While they are similar, they are not exactly the same.

In the case of M=1, DiLoCo still includes an outer optimizer (OuterOpt) step, so it can be viewed as a variant of the Lookahead optimizer.

In DiLoCo, OuterOpt typically uses GD with Nesterov momentum, which means that DiLoCo (M=1) is actually a variant of data parallel training, but the momentum operation is performed only once every H steps.

A large number of experiments were also conducted, covering various aspects of the training process, providing a comprehensive analysis of their scaling behavior.

Experimental Method

In most experiments, the research team used the training set of the C4 dataset to train the model, with evaluation metrics based on the validation set of C4.

Additionally, zero-shot evaluation metrics were calculated on three downstream tasks: HellaSwag, Piqa, and Arc-Easy.

Model Architecture: Chinchilla Variant

The research team used a pure decoder Transformer architecture similar to "Chinchilla," incorporating QK-LayerNorm and employing z-loss regularization to stabilize training.

They packed multiple sequences into each batch, with a maximum sequence length fixed at 2,048 throughout.

All models were trained from scratch, as the main focus was to study the scaling laws during the pre-training phase.

The research team trained a series of models, adjusting the number of Transformer layers, the number of attention heads, the QKV dimensions, and the hidden dimensions of the feedforward layers.

Unless otherwise specified, they used Chinchilla's token budget and performed extensive hyperparameter tuning on all models except for the two largest ones (4B and 10B parameters).

Algorithms and Optimizers

The research team used AdamW as the data-parallel optimizer and also as the inner optimizer for DiLoCo. The β1 for both algorithms was set to 0.9, and β2 was set to 0.99.

Training began with a 1000-step warm-up, followed by cosine learning rate decay. The weight decay parameter λ was set to T⁻¹, where T is the total number of training steps (depending on batch size and token budget). By the end of training, the learning rate decayed to 5% of its peak.

To ensure training stability, they clipped the global ℓ2 norm of the (inner) gradients to 1, while the outer gradients were not clipped.

For DiLoCo, they used SGD with Nesterov momentum as the outer optimizer. The momentum was set to 0.9, and the outer learning rate remained constant.

Built from Scratch, a New Scaling Law Has Arrived

Finding 1: Scale

The evaluation loss of DiLoCo improved relative to data parallelism (Data-Parallel) as N increased.

The scaling law predicts that when M=2, DiLoCo will have a lower loss than data parallelism when the parameters exceed several billion. This phenomenon was validated in the training of the largest tuned models as well as the 4B and 10B models.

Figure 2 below shows a comparison of the performance of DiLoCo and Data-Parallel algorithms at different model scales (N).

Figure (a) shows that as the model scale increases from 2^25 to 2^31, the evaluation loss (EvalLoss) of both DiLoCo (at M=1, 2, 4, 8) and Data-Parallel decreases, but the loss of DiLoCo decreases more significantly, especially at larger M values.

Figure (b) further illustrates the percentage difference in evaluation loss of DiLoCo relative to Data-Parallel, indicating that as the model scale increases, DiLoCo's loss becomes increasingly lower than that of Data-Parallel, demonstrating DiLoCo's superior performance as model scale expands

This discovery has two independent but related parts:

DiLoCo (M=1) performs better: As mentioned above, DiLoCo has lower evaluation loss than Data-Parallel for all model sizes when M=1. Moreover, as the model parameter size N increases, the gap between Data-Parallel and DiLoCo (M=1) becomes larger.

Performance of DiLoCo (M≥2): For most model sizes, the evaluation loss of DiLoCo is higher when M≥2. However, if we look at the percentage difference (with positive and negative signs) between DiLoCo and Data-Parallel, we find that as N increases, DiLoCo's performance relative to Data-Parallel improves, even surpassing Data-Parallel at M=2 and N=240 million parameters.

For example, the research team listed the evaluation losses of Data-Parallel and DiLoCo at different model sizes N in Table 4 below.

It can be seen that regardless of the value of M, the percentage difference strictly decreases as N increases.

This trend is also shown in Figure 2: as N increases, the relative evaluation loss of DiLoCo gradually decreases.

The research team also trained models with 4 billion and 10 billion parameters using hyperparameters tuned with scaling laws to verify this.

Although Figure 2 shows results within the "interpolation" range (based on extensive experimental scans), these findings can also be generalized to the extrapolation state, allowing for the training of 4 billion and 10 billion parameter models with lower evaluation loss using DiLoCo when M=1 or 2.

Table 5 shows the results of training with extrapolated hyperparameters, comparing the evaluation losses of DiLoCo and Data-Parallel algorithms on larger 4B and 10B models, indicating that DiLoCo performs excellently at larger scales.

Finding 2: Single Copy DiLoCo

When the number of copies M=1, the evaluation losses obtained by DiLoCo at different model sizes are all lower than those of Data-Parallel The following figure 3 shows the comparison of evaluation loss and HellaSwag zero-shot accuracy between DiLoCo and Data-Parallel when the number of replicas M=1, across different model sizes (35M, 550M, 1.3B, 2.4B) and global batch sizes (measured in tokens, from 2^16 to 2^20).

Figure (a) shows that the evaluation loss of DiLoCo is consistently lower than that of Data-Parallel, with the gap widening as the batch size increases; figure (b) indicates that DiLoCo also outperforms Data-Parallel in HellaSwag zero-shot accuracy, with a similar trend.

In almost all cases, when M=1, DiLoCo not only has a lower evaluation loss but also a higher zero-shot accuracy on downstream tasks compared to Data-Parallel.

Moreover, the performance of DiLoCo (M=1) is more stable with respect to batch size: doubling or quadrupling the batch size has a significant impact on the performance of Data-Parallel, but almost no effect on DiLoCo (M=1), as clearly illustrated in figure 3.

Finding 3: The Impact of Batch Size on Performance

DiLoCo improves the optimal batch size, and the optimal global batch size increases with the number of replicas M. This means that DiLoCo enhances the horizontal scalability compared to Data-Parallel.

Although DiLoCo, when M>1, selects the best experimental results among all hyperparameters, the evaluation loss is often slightly inferior, but its performance in terms of batch size shows significant improvement.

Both Data-Parallel and DiLoCo (M=1) perform well with small batches, but as the batch size increases, the performance of Data-Parallel declines rapidly.

In contrast, regardless of the batch size M, the performance of DiLoCo is much more stable with respect to batch size.

The following figure 4 shows examples of evaluation loss, indicating that for all M values, the optimal batch size of DiLoCo is larger than that of Data-Parallel, and further increases with M.

For example, in the 550M model, the evaluation loss of Data-Parallel is lowest at smaller batch sizes, while DiLoCo performs better at larger batch sizes, a similar trend holds for the 1.3B and 2.4B models.

The figure below (Figure 5) shows the zero-shot accuracy on the HellaSwag dataset. The results indicate that even with a smaller model size, DiLoCo achieves higher accuracy at a larger global batch size when M=2.

For example, in the 550M model, the accuracy curve of DiLoCo outperforms Data-Parallel as the batch size increases; similar trends are observed in the 1.3B and 2.4B models.

Finding 4: External Learning Rate

The optimal external learning rate is essentially independent of the model size N but varies with the number of replicas M.

An important finding is that DiLoCo scales more naturally in a horizontal manner. In all cases, the token budget D is only related to the model size N. This means that if a batch size is increased by 4 times, the number of training steps will be reduced to 1/4.

For DiLoCo, this still maintains good performance while allowing for more resources to be used at once, thus shortening the total training time. In contrast, Data-Parallel seems to rely more on serial training. This reduction in training time is further amplified by the decrease in communication volume.

The figure below (Figure 6) shows the ideal training time (wall-clock time), simulating scenarios under different network bandwidths.

It can be seen that DiLoCo's tolerance for larger batch sizes allows it to achieve comparable performance loss to Data-Parallel significantly faster, and this effect is more pronounced in low-bandwidth settings.

Finding 5: External Learning Rate

As shown in Figure 7, for sufficiently large models (N≥335 million parameters), the optimal η for each M is fixed. The larger M is, the larger η seems to be. This is consistent with previous research on federated learning: the outer learning rate should increase with the number of clients.

In fact, the external learning rate only depends on the number of DiLoCo models and the frequency of synchronization. In other words, while the optimal inner learning rate varies with model size N, the optimal external learning rate η for DiLoCo does not depend on N and is only related to M.

DiLoCo also helps to address the issue of overtraining!

Overtraining can be quite costly, but increasing the batch size and reducing communication means that, typically, DiLoCo can achieve 4 times the overtraining (OT) in the same amount of time, while data parallel training can only achieve 1 time the overtraining.

There is more content in the paper, including the Scaling law itself, and even methods for predicting optimal hyperparameters.

The Scaling law indicates that for models with more than 2 billion parameters, using 2 models with DiLoCo outperforms data parallel methods.

Is Chinchilla dead? The $3 trillion crossroads of AI

DiLoCo makes tuning hyperparameters and training models simpler. But the problem is that the AI model itself remains fundamentally the same — still the same approach as Chinchilla. After all, the previous pre-training Scaling Law has reached its conclusion, while the new AI Scaling Law is unrelated to training. Nowadays, with the rise of new "inference models," a question arises: What will the future of AI look like if Chinchilla is dead? About 5 years ago, OpenAI researchers discovered that investing more computing power and data into large-scale training could significantly enhance the performance of AI models.

A few years later, Google researchers took it a step further by demonstrating that increasing the amount of data could yield better results with a model named "Chinchilla."

This combination of "computation + data" has given rise to today's giant models, such as GPT-4.

However, the success of this strategy relies on massive upfront investments. Huge amounts of data are crammed into complex and energy-intensive pre-training processes, with tech giants frantically building data centers filled with NVIDIA GPUs. But the question arises: how far can this money-and-data-burning model go? Barclays Capital's top analyst Ross Sandler points out that the future may face two entirely different scenarios:

First, "Chinchilla" continues to dominate, with massive computing power and data investment continuing to rise; second, the "stagnation" alternative, new technologies and models achieve stronger performance with fewer resources.

The capital expenditure gap between these two paths exceeds $3 trillion, enough to influence the direction of the entire industry.

The Rise of "Inference Models"

Driving this potential transformation is the rise of "inference models."

New models such as OpenAI's o1, o3, DeepSeek R1, and Google's Gemini 2.0 Flash Thinking adopt a technique called "test-time compute."

This method breaks down complex queries into smaller tasks, processing them one by one, no longer relying on long pre-training.

Compared to traditional models, inference models may respond slightly slower, but they produce more accurate outputs and have lower operating costs.

More importantly, they are free from the dependence on large-scale pre-training.

DeepSeek R1 even demonstrates a possibility: open-source inference models can achieve performance leaps in a short time.

This means that AI companies may no longer need to spend 18-24 months and huge sums to build the next "behemoth" model.

Additionally, the mixture of experts model (MoE) has also become a widely adopted technology, training multiple small "expert" models to work in conjunction with large models, only calling upon part of the computing power when needed.

This approach reduces infrastructure demands in one step.

What’s Next for Chinchilla?

Over the past five years, the Chinchilla strategy has driven the prosperity of the AI supply chain, leading to soaring stock prices for many companies.

However, its sustainability is now being questioned.

Barclays analysts point out, "As input costs surge, such as a single pre-training costing $10 billion, the performance gains may become increasingly marginal, and the cost-effectiveness of this model is declining."

More critically, training data may be running out.

The supply of high-quality data is limited, while AI's "appetite" for data is growing. Without enough "food," how long can Chinchilla survive?

Some industry leaders even predict that companies like OpenAI may stop the endless scaling after GPT-5 In the face of data exhaustion, the AI industry is pinning its hopes on "synthetic data." Researchers believe that this "self-sufficient" feedback loop can allow models to continuously evolve, pushing technology to new heights.

Chinchillas can essentially survive through "self-feeding."

"If the AI industry makes breakthroughs in synthetic data and recursive self-improvement, we will return to the Chinchilla scaling path, and computational demand will continue to rise rapidly."

Is Chinchilla dead? The AI market will provide the final answer to this question.

If reasoning models and MoE technology mature, AI may move towards a lightweight and highly efficient future, and the trillions of dollars in infrastructure investment may no longer be necessary.

However, if "synthetic data" revives Chinchilla, the computing power competition will return.

Regardless of which future arrives, the evolution of AI is reshaping the entire world.

Source: New Intelligence, Original title: "Google Launches New Scaling Law to Rescue Transformers! $3 Trillion AI Faces a Crossroads"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment objectives, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at your own risk