DeepSeek Open Source Weekly Review - 2: DeepSeek open-sourced in the morning, Nvidia integrated in the afternoon

DeepSeek launched the implementation of MoE EP communication on the second day of Open Source Week, supporting efficient all-to-all communication, inter-node support for NVLink and RDMA, high throughput, and low latency inference kernels. Nvidia quickly integrated it into Megatron-LLM, demonstrating DeepSeek's significant impact on the Nvidia ecosystem. Internally, Nvidia views DeepSeek's support as an important project, with a higher priority than Llama

Today is the second installment of DeepSeek's open-source week, and as expected, the highly anticipated MoE EP communication implementation has been open-sourced, supporting the following features:

✅ Efficient and optimized all-to-all communication

✅ Both intranode and internode support with NVLink and RDMA

✅ High-throughput kernels for training and inference prefilling

✅ Low-latency kernels for inference decoding

✅ Native FP8 dispatch support

✅ Flexible GPU resource control for computation-communication overlapping

Quoting a certain big shot's comment, the capabilities of the team that wrote this communication library are world-class, truly a product of Tsinghua's supercomputing team + having interned at NV, something that ordinary people can't come up with:

Understanding of synchronization mechanisms is master-level
Very understanding of minimizing the number of read/write instructions, using 64/128 bits read/write instructions as much as possible
Avoiding the use of CPU's network card drivers as much as possible
Using the extremely niche OPEN_SH_MEM communication library
Directly modified NV's SM core for communication
Possibly understands NV's underlying architecture better than many at NV

In line with our analysis from yesterday and today, the open-sourcing of DeepSeek, especially the infrastructure open-sourcing, has greatly strengthened NV's ecological moat in the short term, giving NV a sense of effortless victory. For example, DeepEP was open-sourced this morning, and by the afternoon, Nvidia had already integrated it into Megatron-LLM. It is understood that Jensen Huang has prioritized the importance of supporting DeepSeek over Llama within NV, making it the most important open-source project, with internal resources and processes given the green light. DeepSeek itself is also fully optimized for Nvidia's GPUs, such as rewriting the SM core for communication, which AMD's GPUs do not support...

Coincidentally, today Nvidia also released the DeepSeek R1 adaptation status for the B200, soaring to 21,088 Token/s. The B200 8T's bandwidth + FP4 theoretically provides a 3.33x performance improvement over the H200, which is similar to the situation in this official table With further optimization of NV, it is believed that TPS can be improved. Interestingly, NV officially stated that the accuracy of FP4 is only 0.2% lower than that of FP8, and we are looking forward to further benchmarks.

However, if the model cannot continue to scale up, it seems that the results of DeepEP are also very clear, the communication bottleneck of sparse MoE is RDMA scale out rather than scale up, and the hardware barrier of NVLink may be affected.

Today there is also news that Reuters reported, DeepSeek R2 was originally planned to be released in the coming months, and the company now hopes to launch it as soon as possible.

We have also analyzed that the release of NSA (Native Sparse Attention) is aimed at further enhancing the preparation for long texts and long CoT. DeepSeek's experiments also indicate that NSA performs better and faster on long texts than traditional full attention! This is also a further infrastructure-level preparation for R2 and V4.

According to our understanding, R1 is actually a relatively "rough" work, and by refining CoT and data preparation through the process from o1 to o3, a significant leap in intelligence can be expected in the next version. DeepSeek R2 is expected to reach the o3 level, and in terms of coding, it is also hopeful to reach the level of Claude 3.5 sonnet. Referring to the image below, the previous comparison of o3 to o1 capabilities, if such a powerful model can be further open-sourced, it will undoubtedly have a huge impact on the entire downstream applications and model ecosystem.

Looking forward to the work of DeepSeek's V4 and R2.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk