Track Hyper | Alibaba Open Source Tongyi Wanxiang Wan2.2: Breakthroughs and Limitations

On July 28, Alibaba open-sourced the Tongyi Wansiang Wan2.2 video generation model, which supports three functions: text-to-video, image-to-video, and unified video generation. This model adopts the MoE architecture, capable of generating high-quality videos while saving approximately 50% of computing resources, marking a significant advancement for Alibaba in the field of generative AI. This technological path reflects industry development trends and Alibaba's competitive technology layout

Author: Zhou Yuan / Wall Street News

On July 28, Alibaba open-sourced the movie-level video generation model Tongyi Wansiang Wan2.2, which can generate 5 seconds of high-definition video in a single instance.

This time, Wan2.2 has open-sourced three models: text-to-video (Wan2.2-T2V-A14B), image-to-video (Wan2.2-I2V-A14B), and unified video generation (Wan2.2-TI2V-5B).

Among them, the text-to-video model and the image-to-video model are the first in the industry to use the MoE architecture (Mixture of Experts), with a total parameter count of 27B and an activation parameter count as high as 14B. They consist of high-noise expert models and low-noise expert models, responsible for the overall layout of the video and detail refinement, respectively. Under the same parameter scale, this can save about 50% of computational resource consumption.

This is an important move by Alibaba in the field of AI video generation. As the latest action of domestic tech giants in the generative AI race, such a technological path and open-source strategy not only reflect industry development trends but also mirror Alibaba's strategic considerations in technological competition.

Differentiated Technical Architecture Attempts

Among the three models open-sourced by Tongyi Wansiang Wan2.2, the MoE architecture used in the text-to-video and image-to-video models is the most noteworthy technical point in the industry.

By dynamically selecting a portion of experts (sub-models) to participate in inference tasks, the MoE architecture can improve the computational efficiency and performance of the model, especially suitable for training and inference of large neural network models.

This architecture did not emerge out of nowhere but is a targeted design addressing the existing bottlenecks in video generation technology: splitting the model into high-noise expert models and low-noise expert models, where the former is responsible for the overall layout of the video and the latter focuses on detail refinement, forming a clearly defined processing mechanism.

From a technical logic perspective, this design directly addresses the long-standing efficiency issues in video generation.

Traditional models often struggle to balance quality and efficiency when processing long-sequence videos due to the contradiction between parameter scale and computational resources.

The MoE architecture achieves precise allocation of 14B activation parameters under a total parameter scale of 27B through dynamic invocation of activation parameters, reducing computational resource consumption by about 50% under the same parameter scale.

This resource optimization capability has practical application value against the backdrop of high training costs for current AI large models.

With a total parameter scale of 27B and an activation count as high as 14B, the activation ratio exceeds 50% to 51.85%.

How is such a high activation ratio achieved?

Firstly, this is not an easy task; it requires a high level of model architecture design and optimization capability. For example, the flagship model GLM-4.5 released by Zhipu AI has an activation ratio of 9%, allowing it to achieve an API price only 10% of Claude's, because Zhipu AI has accumulated four years of optimization results for the Transformer architecture.

To build an architecture that can reasonably allocate different expert model responsibilities and ensure orderly operation of high-noise expert models and low-noise expert models at different denoising stages, one needs to have an extremely precise grasp of the data flow and processing logic in the video generation process At the same time, in the dynamic management of parameter activation, how to accurately activate the corresponding 14 billion parameters among numerous parameters based on the characteristics of input data and the requirements of the denoising task, avoiding resource waste caused by ineffective activation, while ensuring the efficiency of collaborative work among activated parameters, is a huge challenge faced by the R&D team.

This involves complex algorithm design and extensive experimental debugging to find the most suitable parameter activation strategy for video generation tasks. In other words, it requires the technical team to have precise control over the data needed for the model tasks and to adopt efficient activation strategies and methods accordingly.

It is worth noting that the MoE architecture has been widely applied in the NLP (Natural Language Processing) field, but it is still a novelty in the video generation domain.

The spatiotemporal complexity of video data far exceeds that of text; how to achieve seamless collaboration among different expert models when processing dynamic images is key to the technical implementation.

The solution proposed by the Tongyi Wanxiang team is to divide expert responsibilities according to the denoising stage. Whether this approach can become an industry-wide paradigm still needs to be validated by the market.

Alibaba's choice to open-source these three models carries significant implications for its business strategy.

Currently, the AI video generation field presents a pattern of parallel closed-source competition and open-source exploration, with leading companies tending to treat core models as technical barriers for commercial services, while the open-source model attempts to expand technological influence through ecosystem co-construction.

From the developer's perspective, the open-source of Wan2.2 provides a directly usable technical sample.

Developers can obtain model codes on platforms like GitHub and HuggingFace, which lowers the research threshold for video generation technology. For small and medium-sized enterprises, there is no need to build models from scratch; they can conduct secondary development based on existing frameworks, which will accelerate the scene implementation of technology to some extent.

In terms of industry competition, this open-source move may accelerate the iteration speed of video generation technology. Previously, many companies at home and abroad have launched video generation models, but most are based on closed-source API services.

The open-source of Tongyi Wanxiang Wan2.2 is equivalent to publicly sharing part of the technical path with the industry, and other companies may optimize and upgrade based on this, forming a technological leapfrog.

Actual Application Potential and Limitations

From the application scenario perspective, Wan2.2's ability to generate 5-second high-definition videos is currently more suitable as a creative tool rather than a production tool.

In the early planning of film and television, creators can quickly generate clips through text or images for visualizing creative proposals; in the advertising industry, it can assist in producing drafts for product display short videos. These scenarios do not have high requirements for video duration but can significantly enhance early communication efficiency.

However, its limitations are equally evident: the duration of generating a 5-second high-definition video at a time means that complex narratives still require manual stitching, which falls short of the actual production needs of "cinema-level" quality.

Although Alibaba officials have stated that they will extend the duration in the future, extending the video generation duration is not a simple technical addition; it requires solving issues such as logical coherence and visual consistency over longer sequences, which poses higher demands on the model's spatiotemporal modeling capabilities In terms of aesthetic control, the "cinema-level aesthetic control system" indeed lowers the threshold for professional aesthetic expression by parameterizing the design of light, shadow, and color.

However, the precision of this control still relies on the professionalism of the prompts. Ordinary users may find it difficult to fully utilize its functions if they lack basic aesthetic knowledge.

Moreover, whether the style of the images generated by the model can truly achieve "cinema-level" still needs to be verified by feedback from professional creators.

In the global coordinate system of AI video generation technology, the open-source release of Wan2.2 is an important statement from Chinese enterprises in this field.

Currently, there are models internationally that have achieved longer-duration video generation and have advantages in visual realism.

The feature of Wan2.2 lies in the resource efficiency improvement brought by the MoE architecture. Whether this differentiated path can secure a place in the fierce competition depends on its practical application results.

For the entire industry, video generation technology is still in a phase of rapid evolution. The leap from text-to-image generation to text-to-video generation is a comprehensive test of computing power, data, and algorithms.

The emergence of Wan2.2 is essentially a technological milestone in this evolutionary process. Its value lies not in disrupting the industry, but in providing a new technological option for the industry.

In the future, as the duration of models extends and detail processing capabilities improve, video generation technology may gradually penetrate more fields. However, this process requires time and will inevitably be accompanied by breakthroughs in technical bottlenecks and validation of business models.

For enterprises, how to balance investment in technological research and development with commercial returns will be a more challenging issue than technological breakthroughs.

Risk Warning and Disclaimer

The market has risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at one's own risk