Alibaba DeepSeek moment! Open-source new architecture model: inference 10 times faster, cost reduced by 90%

Alibaba open-sourced a new architecture model Qwen3-Next-80B-A3B this morning, which adopts a hybrid attention mechanism and high sparsity MoE, reducing training costs by 90% compared to Qwen3-32B and improving inference efficiency by 10 times. This model performs excellently in handling ultra-long texts, with performance comparable to Alibaba's flagship model Qwen3-235B, and surpassing Google's Gemini-2.5-Flash, becoming one of the low-energy consumption open-source models. Netizens have praised its architecture, considering its design outstanding

At 2 a.m. today, Alibaba open-sourced its new architecture model Qwen3-Next-80B-A3B, making significant innovations in hybrid attention mechanisms, high sparsity MoE, training methods, and more, marking its own DeepSeek moment.

Qwen3-Next is a hybrid expert model with a total parameter count of 80 billion, activating only 3 billion, with training costs dropping by 90% compared to Qwen3-32B, while inference efficiency has increased tenfold, especially in scenarios with ultra-long text prompts over 32K.

In terms of performance, the instruction fine-tuning model of Qwen3-Next can rival Alibaba's flagship model Qwen3-235B in inference and long-context tasks; the reasoning model surpasses Google's latest Gemini-2.5-Flash reasoning model, becoming one of the strongest low-energy open-source models currently available.

Online experience: https://chat.qwen.ai/

Open-source address: https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d

https://modelscope.cn/collections/Qwen3-Next-c314f23bd0264a

Alibaba API: https://www.alibabacloud.com/help/en/model-studio/models#c5414da58bjgj

Netizens have praised the architecture of Alibaba's new model, stating that just half a year ago, they had a similar discussion with the co-founder! At that time, it seemed to be called something like "dynamic weight attention," but they can't quite remember the exact name. This design is truly outstanding!

Yesterday, I tested several models: ChatGPT-5 in thinking mode, Claude-4, and Grok-4 in expert mode. I just tested Qwen3 Next again. Among all these models, only yours provided me with the correct answer on the first attempt. It's really outstanding!

In the future, this model has defeated Google's Gemini-2.5-Flash

Seeing the application of DeltaNet here is really surprising! I'm curious, if we switch to the model architecture proposed in the paper about AlphaGo moments, how would the performance of this model change?

With 80 billion parameters, ultra-high sparsity, and multi-token prediction, this configuration is stunning! If your GPU has enough memory, running it will definitely be very fast.

Basically, foreigners are very satisfied with Alibaba's innovative model, with a lot of praise.

Brief introduction to the Qwen3-Next architecture

Alibaba believes that the expansion of context length and total parameters are the two core trends for the future development of large models. To further improve training and inference efficiency in long context and large parameter scenarios, they designed a brand new model architecture called Qwen3-Next.

Compared to the MoE structure of Qwen3, Qwen3-Next has made several key improvements, including a mixed attention mechanism, high sparsity MoE structure, optimization methods conducive to training stability, and a multi-token prediction mechanism that enables faster inference.

In terms of core features, Qwen3-Next adopts a hybrid innovative architecture of gated DeltaNet+ gated attention. While linear attention can break the quadratic complexity of standard attention and is more suitable for long context processing, relying solely on linear attention or standard attention has its limitations.

Linear attention is fast but has weak recall ability, while standard attention has high costs and slow speeds during inference. Systematic experimental verification shows that the context learning ability of gated DeltaNet outperforms commonly used methods like sliding window attention and Mamba2. By combining it with standard attention in a 3:1 ratio, with 75% of the layers using gated DeltaNet and 25% retaining standard attention, the model's performance continues to surpass that of a single architecture, achieving a dual enhancement of performance and efficiency The standard attention layer has undergone several enhancements, such as adopting the output gating mechanism from previous research to reduce the low-rank attention problem, increasing the dimension of each attention head from 128 to 256, and applying rotational position encoding only to the first 25% of position dimensions to improve long sequence extrapolation capabilities.

In terms of sparsity design, Qwen3-Next employs a super high sparsity MoE structure, with a total of 80 billion parameters activating only about 3 billion during each inference step, accounting for 3.7%. Experiments show that under the premise of global load balancing, fixing the number of activated experts while increasing the total parameters of experts can steadily reduce training loss. Compared to Qwen3's MoE, Qwen3-Next expands the total number of experts to 512, combining 10 routing experts + 1 shared expert design to maximize resource utilization without affecting performance.

In terms of training stability optimization, the attention output gating mechanism effectively addresses issues such as attention sink and large-scale activation, ensuring numerical stability of the model; for the abnormal increase in normalization weights in some layers of Qwen3's QK-Norm, Qwen3-Next adopts zero-centered RMSNorm and applies weight decay to the normalization weights to prevent unbounded growth; during initialization, the MoE router parameters are normalized to ensure that each expert can be unbiasedly selected in the early stages of training, reducing noise from random initialization. These designs enhance the reliability of small-scale experiments and ensure smooth large-scale training.

The multi-token prediction mechanism is also a highlight of Qwen3-Next. Its natively introduced multi-token prediction (MTP) mechanism not only provides a high acceptance rate MTP module for speculative decoding but also enhances the overall performance of the model. Additionally, the multi-step inference performance of MTP is optimized by maintaining consistency between training and inference through multi-step training, further improving the acceptance rate of speculative decoding in practical scenarios.

During the pre-training phase, Qwen3-Next demonstrates exceptional efficiency. Its training data comes from a uniformly sampled subset of 15T tokens from Qwen3's 36T token pre-training corpus, with GPU time being less than 80% of that of Qwen3-30-3B, and computational costs only 9.3% of Qwen3-32B, yet achieving better performance. In terms of inference speed, the throughput during the padding phase with a context length of 4K is nearly 7 times that of Qwen3-32B, and over 10 times for lengths above 32K; In the decoding stage, the throughput at a 4K context length is nearly 4 times that of Qwen3-32B, and it maintains a speed advantage of over 10 times at context lengths above 32K. In terms of performance, Qwen3-Next-80B-A3B-Base activates only 1/10 of the non-embedded parameters of Qwen3-32B-Base, yet performs better in most benchmark tests, significantly surpassing Qwen3-30B-A3B.

The performance in the post-training phase is equally impressive. The instruction model Qwen3-Next-80B-A3B-Instruct greatly outperforms Qwen3-30B-A3B-Instruct-2507 and Qwen3-32B-Non-thinking, with performance close to the flagship model Qwen3-235B-A22B-Instruct-2507; in the RULER benchmark tests, this model outperforms the Qwen3-30B-A3B-Instruct-2507, which has more attention layers, at all lengths, and defeats the Qwen3-235B-A22B-Instruct-2507, which has more total layers, within a 256K context, confirming the advantages of the hybrid architecture in long-context tasks.

The inference model Qwen3-Next-80B-A3B-Thinking outperforms higher-cost models such as Qwen3-30B-A3B-Thinking-2507 and Qwen3-32B-Thinking, defeating Gemini-2.5-Flash-Thinking in multiple benchmark tests, with key metrics approaching those of Qwen3-235B-A22B-Thinking-2507.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment objectives, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk