The threat posed by DeepSeek to the United States is intensifying.Just yesterday, DeepSeek's daily active users reached 23% of ChatGPT, with daily app downloads nearing 5 million! a16z co-founder Marc Andreessen postedWho would have thought that the talent who made a key contribution to DeepSeek could have stayed in the United States?Recently, a Harvard University professor revealed this shocking fact: the fourth engineer of DeepSeek's multimodal team could have received a full-time offer from NVIDIA.However, in the end, he chose to return to China and join DeepSeek, resulting in the shaking of the United States' dominant position in the AI field, with related companies losing a trillion in market value, and the global AI landscape being completely overturned.Is this outcome a twist of fate, or an inevitability?The U.S. Missed DeepSeek, Allowing "Qian Xuesen" to Return Home AgainRecently, political scientist, Harvard University professor, and former Assistant Secretary of Defense for Planning Graham Allison asked on X: "Who missed out on DeepSeek?"He lamented on X that DeepSeek has refreshed the understanding of the U.S. AI position, and the U.S. originally had the opportunity to retain one of DeepSeek's key employees, Zizheng Pan:(DeepSeek surpassing OpenAI-related models) has overturned much of our understanding of the U.S. AI dominance.This vividly reminds us how seriously the U.S. must attract and retain talent, including talent from China.Zizheng Pan is the fourth multimodal engineer of DeepSeek's team and played an important role in developing DeepSeek's R1 model.Before returning to China, he interned at NVIDIA for 4 months and received a full-time offer from NVIDIA.Graham Allison believes that Zizheng Pan's decision was due to Silicon Valley companies failing to provide him with such opportunities in the U.S.This "brain drain" has deeply pained Graham Allison, who even elevated Zizheng Pan's return to the level of Qian Xuesen's return!Super talents like Qian Xuesen, Jensen Huang, and Elon Musk can vote with their feet, showcasing their talents and ambitions anywhere.He believes that the United States should do its utmost to avoid such "brain drain":American university coaches are searching for and recruiting the most talented athletes in the world.In the context of Sino-American technological competition, the U.S. should make every effort to avoid losing more talents like Qian Xuesen and Pan Zizheng.NVIDIA's Regret Over Talent LossYuzhidong, a senior research scientist at NVIDIA, shared his thoughts on the choice of intern Pan Zizheng to return to China after DeepSeek surpassed ChatGPT to top the App Store, expressing happiness for his current achievements and sharing views on AI competition:In the summer of 2023, Zizheng was an intern at NVIDIA. Later, when we considered whether to offer him a full-time position, he chose to join DeepSeek without hesitation.At that time, DeepSeek's multimodal team had only 3 members.I am still impressed by Zizheng's decision back then.At DeepSeek, he made significant contributions, participating in several key projects including DeepSeek-VL2, DeepSeek-V3, and DeepSeek-R1. I am personally very pleased with his decision and the achievements he has made.Zizheng's case is a typical example I have seen in recent years. Many of the best talents come from China, and these talents do not necessarily have to succeed only in American companies. On the contrary, we have learned a lot from them.As early as 2022, a similar "Sputnik moment" occurred in the field of autonomous vehicles (AV), and it will continue to happen in the robotics and large language model (LLM) industries.I love NVIDIA and hope to see it continue to be a significant driving force in the development of AGI and general autonomous systems. But if we continue to weave geopolitical agendas and create hostility towards Chinese researchers, we will only undermine our future and lose more competitiveness.We need more outstanding talents, higher professional standards, stronger learning abilities, creativity, and stronger execution. Pan Zizheng is a co-author of DeepSeek-VL2When DeepSeek surpassed ChatGPT to top the App Store download chart, Pan Zizheng shared his feelings on X:Pan Zizheng will join DeepSeek full-time in 2024 as a researcher. He previously worked as a research intern in the AI algorithm group at NVIDIA.In 2021, Pan Zizheng joined Monash University ZIP Lab to pursue a PhD in Computer Science, under the supervision of Professors Bohan Zhuang and Jianfei Cai. Prior to that, he obtained a Master's degree in Computer Science from the University of Adelaide and a Bachelor's degree in Software Engineering from Harbin Institute of Technology (Weihai).During his PhD, Pan Zizheng's research interests mainly focused on the efficiency of deep neural networks, including model deployment, Transformer architecture optimization, attention mechanisms, inference acceleration, and memory-efficient training.Lex Fridman Hardcore Podcast Reveals How China's AI Rising Star Shakes Up the Global LandscapeRecently, Lex Fridman released a 5-hour podcast episode featuring AI2 model training expert Nathan Lambert and Semianalysis hardware expert Dylan Patel.In this information-packed discussion, they focused entirely on DeepSeek, discussing how this Chinese AI rising star is shaking up the global landscape, the double-edged sword of MoE architecture + MLA technology, the open-source push by DeepSeek that forces industry openness, and the hardware magic of Chinese-style extreme optimization.Did DeepSeek Use OpenAI Data or Not?This time, the discussions among the experts were quite sharp, directly addressing the core issues.For example, this key question: Did DeepSeek actually use OpenAI's data?Previously, OpenAI publicly stated that DeepSeek used its own model distillation. The Financial Times bluntly stated, "OpenAI has evidence that DeepSeek used their models for training."Does this stand up morally and legally?Although OpenAI's terms of service stipulate that users are not allowed to use the outputs of their models to build competitors, this so-called rule is actually a reflection of OpenAI's hypocrisy.Lex Fridman stated that they, like most companies, have been using data from the internet for training without permission and benefiting from it.The big shots unanimously believe that OpenAI's claim that DeepSeek trained using its model is an attempt to divert attention and ensure its own monopoly.Moreover, in the past few days, many people have distilled DeepSeek's model into Llama, as the former runs very complex inferences while Llama provides services easily. Is this illegal?Why is DeepSeek's training cost so low?Dylan Patel stated that DeepSeek's cost involves two key technologies: one is MoE, and the other is MLA (Multi-Headed Attention).The advantage of the MOE architecture is that, on one hand, the model can embed data into a larger parameter space, and on the other hand, during training or inference, the model only needs to activate a portion of the parameters, greatly improving efficiency.The DeepSeek model has over 600 billion parameters, while Llama 405B has 405 billion parameters. In terms of parameter scale, the DeepSeek model has a larger information compression space, allowing it to accommodate more world knowledge.However, at the same time, the DeepSeek model only activates about 37 billion parameters at a time. This means that during training or inference, only 37 billion parameters need to be computed. In contrast, the Llama 405B model requires activating 405 billion parameters for each inference.MLA is mainly used to reduce memory usage during the inference process, and it is also utilized during training, employing some clever low-rank approximation mathematical techniques.Nathan Lambert stated that a deep dive into the details of attention will reveal that DeepSeek has put a lot of effort into model implementation.Because, in addition to the attention mechanism, language models have other components, such as embeddings used to extend context length. DeepSeek uses Rotational Position Encoding (RoPE).Combining RoPE with traditional MoE requires a series of operations, such as performing complex rotations on two attention matrices, which involves matrix multiplicationThe complexity of implementing DeepSeek's MLA architecture has significantly increased due to the need for some clever designs. Their successful integration of these technologies indicates that DeepSeek is at the forefront of efficient language model training.Dylan Patel stated that DeepSeek has found ways to improve model training efficiency. One method is to not directly call NVIDIA's NCCL library but to schedule communication between GPUs independently.What sets DeepSeek apart is that they manage GPU communication by scheduling specific Streaming Multiprocessors (SMs).DeepSeek finely controls which SM cores are responsible for model computation and which cores handle allreduce or allgather communication, dynamically switching between them. This requires extremely advanced programming skills.Why DeepSeek is So CheapAmong all companies claiming to provide R1 services, the pricing is far higher than that of the DeepSeek API, and most services do not function properly, with very low throughput.What shocks the big players is that, on one hand, China has achieved this capability, and on the other hand, the price is so low. (The price of R1 is 27 times cheaper than o1)Why is training so cheap, as mentioned above? Why are inference costs also so low?First, it is due to DeepSeek's innovation in model architecture. The MLA attention mechanism is fundamentally different from the Transformer attention mechanism.This multi-head latent attention can reduce the memory usage of the attention mechanism by about 80% to 90%, which is particularly helpful for handling long contexts.Moreover, there is a huge difference in service costs between DeepSeek and OpenAI, partly because OpenAI has a very high profit margin, with a gross margin for inference exceeding 75%.Since OpenAI is currently operating at a loss, having spent too much on training, the profit margin for inference is very high.Next comes the highlight, as several big players let their imaginations run wild, speculating whether this could be a conspiracy: DeepSeek meticulously planned this release and pricing to short NVIDIA and American companies' stocks, coinciding with the release of Stargate...However, this speculation was immediately refuted, with Dylan Patel stating that they simply aimed to release the product as quickly as possible before the Lunar New Year and had no plans for anything major; otherwise, why choose to release V3 the day after Christmas?China's Industrial Capability Has Far Surpassed That of the United StatesThe United States undoubtedly leads China in the field of GPUs and other chips.However, can export controls on GPUs completely stop China? It seems unlikelyDylan Patel believes that the U.S. government is also clearly aware of this, while Nathan Lambert thinks that China will manufacture its own chips.China may have more talent, more STEM graduates, and more programmers. The U.S. can certainly leverage talent from around the world, but this may not necessarily give the U.S. an additional advantage.What really matters is computing power.The total amount of electricity that China possesses is astonishing. China's steel mills are equivalent in scale to the total of the entire U.S. industry, in addition to aluminum plants that require massive amounts of electricity.Even if the U.S. Stargate is really completed and reaches 2 gigawatts of power, it is still less than China's largest industrial facilities.Let's put it this way: if China builds the world's largest data center, it can do so immediately as long as it has chips. So it's just a matter of time, not capability.Currently, the elements needed to build data centers, such as power generation, transmission, substations, and transformers, will constrain the U.S. from constructing increasingly larger training systems and deploying more reasoning computing power.In contrast, if China continues to firmly believe in Scaling Law, just like U.S. executives such as Nadella, Zuckerberg, and Pichai, it could even achieve this faster than the U.S.Therefore, to slow down the development of Chinese AI technology and ensure that AGI cannot be trained on a large scale, the U.S. has introduced a series of bans—intending to "kill" the entire semiconductor industry by restricting the export of key elements such as GPUs and photolithography machines.Can OpenAI o3-Mini Catch Up with DeepSeek R1?Next, several big names conducted tests on several star reasoning models.Interestingly, Google's Gemini Flash Thinking outperforms R1 in both price and performance, and was released in early December last year, yet no one cares...In this regard, several big names feel that its behavior pattern is not as expressive as o1, and its application scenarios are narrower. o1 may not be the most perfect for specific tasks, but it is more flexible and versatile.Lex Frieman stated that he personally really likes one aspect of R1, which is that it displays the complete thinking chain tokenIn the open philosophical questions, as humans who can appreciate intelligence, reasoning, and reflective abilities, reading the original thought chain tokens of R1 evokes a unique aesthetic feeling.This non-linear thinking process is reminiscent of James Joyce's stream-of-consciousness novels "Ulysses" and "Finnegans Wake," which is fascinating.In contrast, the o3-mini feels smart and fast, but lacks highlights, often appearing rather mediocre, lacking depth and novelty.From the image below, it can be seen that the reasoning costs have shown an exponential decline from GPT-3 to GPT-3.5, and then to Llama.DeepSeek R1 is the first reasoning model to achieve such a low cost, which is an impressive achievement; however, its cost level has not exceeded the range expected by experts.In the future, with innovations in model architecture, higher quality training data, more advanced training techniques, and more efficient reasoning systems and hardware (such as next-generation GPUs and ASIC chips), the reasoning costs of AI models will continue to decline.Ultimately, this will unlock the potential of AGI.Who Will Win the AGI RaceFinally, several big names have made predictions about who will be the ultimate winner of the AGI race.Google seems to be the frontrunner due to its infrastructure advantage.However, in the public opinion arena, OpenAI appears to be the leader. It has taken the lead in commercialization, boasting the highest revenue in the current AI field.Currently, who is actually making money in the AI field? Has anyone turned a profit?After some analysis, the big names found that, from the financial statements, Microsoft has already achieved profitability in the AI field, but has invested huge capital expenditures in infrastructure. Google and Amazon are in a similar situation.The massive profits obtained by Meta come from its recommendation system, not from large models like Llama.Anthropic and OpenAI have clearly not yet turned a profit; otherwise, they wouldn't need to continue financing. However, from the perspective of revenue and costs, GPT-4 has already begun to be profitable, as its training costs are only a few hundred million dollars.Ultimately, no one can predict whether OpenAI will suddenly fall. However, for now, all companies will continue to seek financing, as once AGI arrives, the returns brought by AI will be immeasurablePeople may not need OpenAI to spend billions of dollars to develop "the next state-of-the-art model"; a ChatGPT-level AI service may be sufficient.Reasoning, code generation, AI agents, and computer usage are the truly valuable application areas for AI in the future. Those who do not make an effort may be eliminated by the market.New Intelligence, original title: "NVIDIA regrets losing key talent from DeepSeek? The U.S. lets AI 'Qian Xuesen' go, Harvard professor is heartbroken."Risk Warning and DisclaimerThe market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at one's own risk