Grok 3 conducted an experiment for the AI community using 200,000 GPUs: Scaling Law did not hit a wall, but pre-training is not guaranteed

Wallstreetcn
2025.02.20 00:21
portai
I'm PortAI, I can summarize articles.

Grok 3 uses 100,000 NVIDIA H100 cards for experiments, showing that the Scaling Law during the pre-training phase still holds, despite the issue of insufficient data. The Scaling Law has not reached a ceiling, and increasing the model size can still improve performance, but at a low cost-effectiveness ratio. The current effective Scaling methods, ranked by cost-effectiveness, are: Test time Scaling Law, RL Scaling Law, and pre-training phase Scaling Law

The media's shift in perspective is so rapid that it can be overwhelming. In the morning, there was praise for DeepSeek's low cost and high cost-effectiveness, claiming that the pre-training Scaling Law is dead, requiring fewer machines and GPU cards, prioritizing cost-effectiveness, and that NVIDIA is finished; by noon, with the release of Grok 3, which reportedly used 100,000 NVIDIA H100 cards and outperformed OpenAI's o3 mini and DeepSeek R1, the narrative shifted to say that the Scaling Law is still valid and that a large number of cards are still needed, suggesting that NVIDIA's stock price has hope and that miracles still require significant effort...

These two viewpoints are clearly contradictory; if one is true, the other must be false. So what is the truth of the matter? Let's analyze it.

Is the Scaling Law still valid during the pre-training phase?

Is the Scaling Law valid during the pre-training phase? Of course, it is valid. The so-called "Scaling Law hitting a wall" refers to a common issue where there isn't enough data. Without a large amount of new data, the trend of the pre-training phase's Scaling Law slows down. Note that it slows down but does not stop; the pre-training phase's Scaling Law has not reached its ceiling.

According to the Chinchilla Scaling Law, even without new data, it does not mean that the model's performance cannot improve. It's simple: as long as the base model size is increased, performance will still improve. However, from the perspective of computational power expended versus performance gained, it becomes very uncost-effective, with a low cost-performance ratio. This is why everyone is shifting to RL Scaling Law and Test Time Scaling Law, as the performance improvement of large models in these latter two phases is more pronounced for the same computational power, thus offering a higher cost-performance ratio.

Currently, the Scaling methods that can improve model performance, ranked from high to low cost-performance ratio, are: Test Time Scaling Law > RL Scaling Law > Pre-training Phase Scaling Law (due to insufficient data, we can only increase the model size). If there are high cost-performance ratio Scaling methods available, they will naturally be prioritized; low cost-performance ratio Scaling methods will only be adopted when there are no higher cost-performance options available. This is similar to shopping; if there are high cost-performance items, one would not buy low cost-performance items.

Someone might ask: According to your reasoning, is hoarding so many GPU resources actually not very useful for training the best models? If we follow the theory above, then indeed, it may not be very necessary. For example, DeepSeek could also produce the best model with 2,000 cards, right? However, having more cards has the advantage of compressing the time cycle for experimenting with new ideas and training large model bases. For example, you need to explore different algorithms, parameters, or data ratios for various experiments. If you have 10 new ideas and only 2,000 cards, it might take 5 days to reach a conclusion. If you have tens of thousands of cards, you might reach a conclusion in just 1 day. Therefore, having more cards greatly helps with exploration efficiency. It is certainly true that more cards lead to more innovation.

Grok 3 Base Model (comparable to DeepSeek V3, not a logical reasoning model like R1)

Why does Grok 3, as a general base model, only have evaluation metrics for mathematics, science, and code datasets? There is no comparison with general capabilities like the commonly used MMLU metric, which is not a very standard comparison model. It is inferred that Grok 3's general capabilities have not significantly improved compared to OpenAI and DeepSeek's models, so they are not presented for comparison?

If one wants to enhance the mathematical, scientific, and coding capabilities of the base model, the difficulty is not great from either a methodological or cost perspective. The current standard approach is similar to DeepSeek V3, which distills long COT data for mathematics, code, and other logical problems from DeepSeek R1, i.e., deep thinking process data.

This means introducing deep thinking long COT data into the post-training phase of the base model, or even preemptively into the pre-training phase (the so-called "left foot (DeepSeek base) stepping on the right foot (DeepSeek R1) for self-ascension" model). This can significantly enhance the base model's capabilities related to mathematics and code, which is what Grok 3 claims to possess with its "chain of thought reasoning and self-correction mechanism." The evaluation metrics will look better, and the total amount of distilled data will not be too large (a few hundred terabytes should be sufficient), with low costs and minimal computational requirements.

OpenAI will soon release a non-logical reasoning model, GPT 4.5, which is likely based on a similar idea, distilling COT data from the o3 model to enhance the IQ of the GPT 4.5 base model using deep thinking data. The "left foot stepping on the right foot for self-ascension" method will be the main means of improving the capabilities of base models in the future.

The computational consumption of Grok 3 is 10 times that of Grok 2. If we follow the Chinchilla Scaling Law, the best practice is for Grok 3's training data volume to increase 3 times compared to Grok 2, and the model size to also increase 3 times (however, the current trend is to reduce model size and increase data volume [which means the "small model, large data" approach]. Although this does not meet the optimal training principle, smaller model sizes make these models more suitable for online inference services, reducing service costs) If the claim made at the press conference is true, that Grok 3 consumes 10 times the computing power of Grok 2, then there are two possibilities.

One possibility is that the data volume has increased significantly, which would mean a large amount of multimodal data has been added. For example, the data volume could have increased from 10T to 30T (currently, the data volume used by text models is at most between 18T and 20T, which is basically at the limit; to increase significantly, multimodal data must be added. However, adding multimodal data does not significantly help improve the intelligence of large models, so this increment should not be too large). If this is the case, the model scale of Grok 3 has increased by about 3 times.

The second possibility is that the training data volume has not increased much beyond 20T. If this is the case, it can be inferred that the model size of Grok 3 is much larger than that of Grok 2, starting at least 4 to 5 times larger (if the new data is not much, then the only way to consume the additional computing power is to increase the model size). Regardless of which possibility it is, the model size of Grok 3 is definitely much larger than that of Grok 2, and Grok 2 itself may not be small (the Grok 2 release webpage evaluated its performance as exceeding Llama 3.1 405B, so whether in terms of data or model size, it should not be too small; if it is a Dense model, 70B is the smallest estimate). Therefore, the size of Grok 3 is likely to be significantly large (estimated to be between 200B and 500B).

It is clear that Grok 3 is still adopting the "traditional" approach of scaling up the base model size, which is the method analyzed in the "Scaling Law" section above to enhance the capabilities of the base model during the pre-training phase. As analyzed above, this approach has a very low cost-performance ratio. A more fashionable approach would be to focus the training on RL Scaling, which would have a much higher cost-performance ratio. But why would they engage in such a loss-making endeavor? A possible explanation will be provided later.

Grok 3 Logical Reasoning Version (Deep Thinking Version, corresponding to DeepSeek R1)

The deep thinking version of Grok 3, not to mention the experience, based solely on evaluation metrics, has reached or exceeded o3 mini, and it is indeed one of the best, if not the best, in terms of performance.

Returning to the question mentioned above, why, knowing that relying on scaling up the pre-training model size has a low cost-performance ratio, does Grok 3 still adopt this model? The internal reason may be (inferred without evidence): that the effect of RL Scaling in the Post-Training phase may have a positive correlation with the size of the base model.

In other words, for the same computing power consumption in the RL phase, if the base model size is larger, the scaling effect in the RL phase will be better. Only in this way is there a necessity to maximize the model size during the pre-training phase. We can assume that Grok 3 adopts this overly power-consuming and seemingly low cost-performance ratio approach in hopes of significantly enhancing the capabilities of the deep thinking version by increasing the base model size It seems that DeepSeek R1 has received great reviews for its effectiveness and open-source nature, but when people actually try to use it, they find that the base model is too large, making deployment difficult and consuming too many resources, which is not very friendly for downstream applications. So why does DeepSeek insist on promoting such a model that is clearly oversized for downstream applications? (Smaller distilled models look good on metrics, but their actual application performance seems to be quite poor). Is it also because if the base model is not large enough, the performance of the deep thinking model will not be as good?

If the above assumption holds, it means that the three Scaling Laws (Pre-train, RL, Test Time), in terms of improving the cost-effectiveness of large model intelligence, rank from high to low as follows: Test Time > RL > Pre-Train, which is a previous conclusion. But if the above assumption holds, it indicates that the ceiling for Test Time Scaling is the lowest, and its ceiling depends on the scaling capability of the RL phase, while the ceiling for RL phase Scaling is the second lowest, relying on the scaling of the Pre-Train phase?

If this is the case, if one day the ceilings for RL and Test Time reach their limits, it means we can initiate another round to push the model size of the large base model, subsequently raising the ceiling for RL phase Scaling, and then we can further scale RL and Test Time to obtain a higher IQ large model. If this holds true, it means that the solution for AGI is already complete? In fact, it may not be necessary for new Scaling Laws to exist?

The above reasoning is based on the premise that Grok 3 consumes such a large amount of computing power to push the scale of large models, which is a result of careful consideration or small-scale experiments, rather than merely a decision influenced by the old notion (that higher computing power in the pre-training phase leads to better results). If this premise does not hold, then the above reasoning does not hold. In any case, all responsibility lies with Musk.

Author of this article: Zhang Junlin, Source: Tencent Technology, Original Title: "Grok 3 Uses 200,000 GPUs to Conduct an Experiment for the AI Community: Scaling Law Has Not Hit a Wall, but Pre-training Is Not Necessarily Better"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at your own risk