
Elon Musk explains: how xAI built and launched a training cluster of 100,000 cards in 122 days

Elon Musk hosted the launch event for Grok 3, introducing its core features and the new tool "Deep Search." The xAI team successfully built the world's largest training cluster in 122 days, coordinating training using 100,000 H100 GPUs. The challenges faced by the team included ensuring the collaborative operation of all GPUs to avoid training errors caused by the failure of a single GPU. Musk emphasized the team's engineering achievements, breaking the expected timeline set by data center providers
Yesterday at noon, Elon Musk hosted the highly anticipated launch event for the "world's strongest artificial intelligence" — Grok 3.
He appeared alongside Igor Babuschkin, the chief engineer of xAI, co-founder Jimmy Ba, and Yuhuai "Tony" Wu, detailing the core features of Grok 3, including its significantly enhanced reasoning capabilities, natural language processing abilities, and the newly launched "Deep Search" tool. This tool is designed to handle complex queries, integrating web searches and real-time information from the X platform to provide users with more accurate and in-depth answers.
In response to the last audience question, Elon introduced how the xAI team achieved another engineering miracle: overcoming numerous challenges to build and launch the world's largest training cluster with 100,000 cards in just 122 days.
Audience Question:
What was the most difficult part of this project (Grok 3)? What excites you about it?
Jimmy Ba:
Looking back, I think the most difficult part was coordinating the training of the entire model on 100,000 H100 GPUs, which was almost like battling the ultimate BOSS of the universe — entropy. Because at any moment, a cosmic ray could hit and flip a bit in a transistor, and if there’s an error in one bit during the gradient update, the entire gradient update could go awry.
And now we have 100,000 such GPUs, and we must make them work together every time; at any moment, any one GPU could fail.
Jimmy Ba, a Chinese, assistant professor at the University of Toronto, and one of the founding 12 employees of the xAI team, is a student of AI pioneer Geoffrey Hinton.
Elon Musk:
Yes, it’s worth breaking down how we got the world’s most powerful training cluster up and running in 122 days.
Initially, we didn’t actually plan to build a data center ourselves. We approached data center providers and asked them how long it would take to coordinate 100,000 GPUs in one place. They gave us a time frame of 18 to 24 months. We thought, 18 to 24 months means failure is inevitable So the only way is to do it ourselves.
So, we broke down the problem. For example, we needed a building, and we couldn't build one ourselves; we had to use an existing building. So we basically looked for some abandoned factories that were in good condition, like those that went out of business due to a company's bankruptcy.
We found an Electrolux factory in Memphis. That's why it's in Memphis—home of Elvis Presley and one of the capitals of ancient Egypt.
It's actually a very nice factory; I don't know why Electrolux left, but it provided shelter for our computers.
Then, we needed power, initially at least 120 megawatts, but the building itself only had 15 megawatts. Ultimately, for 200,000 GPUs, we needed 0.25 gigawatts of power.
We initially rented a bunch of generators. On one side of the building, we had rows of generators until we could connect to the utility power.
Then, we also needed cooling. So on the other side of the building, we had rows of cooling equipment. We rented about a quarter of the mobile cooling capacity in the United States.
Next, we needed to install all the GPUs, which are all liquid-cooled. To achieve the necessary density, this is a liquid cooling system. So we had to install all the piping for the liquid cooling system. No one has ever built a liquid-cooled data center on a large scale.
This is the result of a very talented team putting in a tremendous effort.
You might think, it should be up and running now, right?
No. The problem is that the power fluctuations of the GPU cluster are very severe; it's like a huge symphony. Imagine a symphony with 100,000 or 200,000 people participating, where the entire orchestra goes from quiet to loud in 100 milliseconds. This leads to huge power fluctuations, which in turn causes the generators to go out of control; they never anticipated this situation.
To buffer the power, we used Tesla's Megapack to smooth the power. The Megapack had to be reprogrammed, so xAI worked with Tesla, and we reprogrammed the Megapack to handle these severe power fluctuations and smooth the power so that the computers could run properly.
This method worked, although the process was quite complex.
Tesla Megapack arriving in Memphis But even by that time, we still had to ensure that all computers communicated effectively, and all network issues had to be resolved. We debugged countless network cables and were troubleshooting network card issues at 4 a.m. We resolved the issue around 4:20 a.m.
We found many problems, one of which was a BIOS mismatch.
Igor Babuschki:
That's right, the BIOS was not set correctly. We had to compare the output of the lspci command (note: a Linux command used to list all PCI devices in the system) between two different machines. One was working fine, and the other was not. There were many other issues as well.
Elon Musk:
Yes, that's correct. If we really listed all the problems, it would take a long time. But it's interesting; it's not like we magically made it happen.
We had to break down the problems into their components, just like Grok does when reasoning, and then solve each component to complete a coordinated training cluster in a much shorter time than others.
Igor Babuschki:
Then, once the training cluster was up and running and ready for delivery, we had to ensure it remained healthy throughout the process, which was a huge challenge in itself.
Then, we had to ensure that every detail of the training was correct to achieve a Grok level 3 model, which is actually very, very difficult.
We don't know if there are other models capable of Grok level 3, but anyone who trains a model better than Grok level 3 must excel in all aspects of deep learning science and engineering.
It's not easy to achieve that.
Rubble Villager, original title: "Musk Explains: How xAI Built and Launched a 100,000 Card Training Cluster in 122 Days"
Risk Warning and Disclaimer
The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at their own risk