Alibaba AI has new moves! The latest inference model QwQ-32B proves that small parameters can achieve large model-level performance. On March 6, the Alibaba Tongyi Qianwen Qwen team launched the inference model—QwQ-32B large language model. According to official information, this model with only 32 billion parameters not only matches the performance of DeepSeek-R1, which has 671 billion parameters (of which 37 billion are activated), but also surpasses it in certain tests. The Alibaba Qwen team stated that this achievement highlights the effectiveness of applying reinforcement learning to powerful foundational models that have undergone large-scale pre-training, hoping to demonstrate that combining powerful foundational models with large-scale reinforcement learning may be a feasible path to general artificial intelligence. In addition to basic reasoning capabilities, QwQ-32B also integrates Agent-related capabilities, allowing it to engage in critical thinking while using tools and adjust the reasoning process based on environmental feedback. Parameter Reduction, Performance Maintained, Cost Only 1/10 According to the official disclosed test results, QwQ-32B performed excellently in several key evaluations: In the AIME24 evaluation set testing mathematical abilities, QwQ-32B performed comparably to DeepSeek-R1, far exceeding o1-mini and similarly sized R1 distilled models. In the LiveCodeBench assessing coding abilities, it also performed comparably to DeepSeek-R1. In the "Most Difficult LLMs Evaluation List" LiveBench led by Meta's chief scientist Yang Likun, QwQ-32B scored higher than DeepSeek-R1. In the instruction-following ability IFEval evaluation proposed by Google and others, it outperformed DeepSeek-R1. In the BFCL test proposed by the University of California, Berkeley, which evaluates the accurate invocation of functions or tools, it also surpassed DeepSeek-R1. Overseas netizens have demonstrated the performance of different inference models in LiveBench scoring and their output token costs. The QwQ 32B model's score is between R1 and o3-mini, but its cost is only one-tenth of theirs. This indicates that QwQ 32B has achieved a good balance between performance and cost: QwQ 32B's LiveBench score is approximately 72.5, with a cost of about $0.25. R1's score is approximately 70, with a cost of about $2.50. o3-mini's score is approximately 75, with a cost of about $5.00 Reinforcement Learning: The "Secret Weapon" of QwQ-32B The outstanding performance of QwQ-32B is mainly attributed to the large-scale reinforcement learning methods it employs. The Alibaba team developed a phased reinforcement learning training strategy based on cold start: Initial Phase: Focused on RL training for mathematical and programming tasks. The team abandoned traditional reward models and instead adopted a more direct validation method, providing feedback for mathematical problems by verifying the correctness of generated answers, and evaluating whether the generated code successfully passed test cases through a code execution server to provide feedback on the code. Expansion Phase: Added RL training for general capabilities. This phase utilized a general reward model and rule-based validators to help the model enhance other general capabilities while maintaining its mathematical and programming abilities. Research shows that as the number of RL training rounds increases, the model's performance in both mathematics and programming continues to improve, confirming the effectiveness of this approach. QwQ-32B is Open Source, Promoting the Paradigm Shift from "Great Efforts Yield Miracles" to "Delicate Efforts Yield Wisdom" Currently, QwQ-32B has been open-sourced on the Hugging Face and ModelScope platforms under the Apache 2.0 open-source license. Additionally, users can directly experience this powerful reasoning model through Qwen Chat. The Alibaba Qwen team stated that QwQ-32B is just their first step in enhancing reasoning capabilities through large-scale reinforcement learning. In the future, they will focus on combining more powerful foundational models with RL based on scaled computing resources and actively explore integrating agents with RL to achieve long-term reasoning, aiming to release higher intelligence through extended reasoning time. As the growth of model parameter scale has entered a bottleneck period, how to further enhance model capabilities under the existing parameter scale has become a focal point in the industry. The breakthroughs of QwQ-32B may lead a new wave of AI technology development, further promoting the paradigm shift from "Great Efforts Yield Miracles" to "Delicate Efforts Yield Wisdom." In this regard, the tech self-media Digital Life Kazik remarked: The significance of this wave of QwQ-32B open-sourcing is very strong. It proves with strength that the RLHF route can still produce remarkable results, breaking some people's excessive pessimism after GPT-4.5 hit a wall. Achieving high performance with a medium scale injects strong confidence into the open-source community, showing that you don't need to have expensive equipment and ultra-large scale to compete with international giants. The release of QwQ-32B is highly consistent with Alibaba's recent announcement of its AI strategy. According to reports, Alibaba Group plans to invest over 380 billion yuan in the next three years to build cloud and AI hardware infrastructure, with total investment exceeding the sum of the past decade Previously, Alibaba's self-developed "Deep Thinking" reasoning model has been launched on the Quark AI search platform, becoming one of the few large-scale C-end AI applications in China that has not integrated DeepSeek. At the foundational model level, Alibaba's Tongyi large model family has entered the ranks of the world's top open-source models. Informed sources have revealed that "larger-scale models will also be gradually integrated into Quark." Risk Warning and Disclaimer The market carries risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at one's own risk