AI video generation "covert battle" is gaining momentum

Roll it up

User payment has not yet been established in large language models, but it is quietly taking root in the AI video generation sector.

In June of this year, the annual revenue of AI video generation startup Runway exceeded $90 million (approximately RMB 640 million); in the second quarter of the same year, Kuaishou (1024.HK)'s AI video generation application "Keling" generated over RMB 250 million.

Domestic startups are flocking to the table.

Beijing Shengshu Technology Co., Ltd. (hereinafter referred to as "Shengshu Technology")'s "Vidu" and Beijing Aishi Technology Co., Ltd. (hereinafter referred to as "Aishi Technology")'s "Paimo" have both surpassed 10 million users; as the first IPO of the "Hangzhou AI Six Little Dragons," Manycore Tech Inc. (hereinafter referred to as "Qunhe Technology") also plans to launch AI video generation products targeting end consumers within the year.

The market's expectations for the commercialization prospects of AI videos are not limited to personal creators generating short videos, but also extend to film creation, embodied intelligence, and more fields.

However, issues such as spatial consistency and content stitching breakdown have also led AI video generation models into the controversy of "seller's show" and "buyer's show."

Although the moment for DeepSeek, which belongs to the AI video generation industry, has not yet arrived, with the increased investment from major companies, the market has reason to believe that the future development path will become increasingly clear.

Length Competition

In February 2024, OpenAI launched Sora 1.0, achieving a breakthrough compared to the previous Runway, which could only generate videos of 3-4 seconds, becoming the world's first AI video generation model capable of generating videos up to 60 seconds long.

Subsequently, domestic models have gradually caught up.

Currently, there are not only internet giants like ByteDance, Kuaishou, and Baidu, but also startups like Shengshu Technology and Aishi Technology exploring the field of AI video generation applications.

A product manager from a technology company in the south told Xinfeng that the biggest change in the AI video generation field this year mainly lies in length, meaning that longer videos can now be generated through AI.

Although the current AI video generation model companies typically generate videos lasting around 5-10 seconds at a time, a coherent video can already be formed by generating individual shots.

The film industry is one of the first to experiment.

The 50-episode animated short series "Tomorrow is Monday," launched in August this year, was generated using Shengshu Technology's Vidu AI video model.

In practice, the production team of "Tomorrow is Monday" used hand-drawn core character designs by original artists, and then extended the animation through Vidu's image-to-video and reference generation functions.

Shengshu Technology told Xinfeng that about 80% of the content of "Tomorrow is Monday" was generated by Vidu Q1's image-to-video function and reference video generation, deeply integrating multiple core links from artistic design to animation production. This allowed a production team of fewer than 10 people to complete all content production for the first season of "Tomorrow is Monday" in 45 days, averaging less than one episode per day, while the traditional production cycle for a 2-minute animated series takes up to a week, improving production efficiency by at least 7 times One of the important scenarios for "Keling," a subsidiary of Kuaishou, is film and television production.

According to Kuaishou's management during the earnings call, the current customer base of "Keling" includes a wide range of creators, including professionals, e-commerce and advertising industry practitioners, and film production studios.

The length limitation is still being further broken.

Recently, Baidu upgraded its AI video generation model "Baidu Steam Engine," allowing users to generate AI videos of unlimited length, breaking the previous limitation where AI could only generate short videos of 5-10 seconds or relied on the first and last frames to control the duration.

In use, users only need to input images and prompts to generate videos of any length.

A product manager from a Southern technology company believes that the breakthrough in video length is not just the result of "stacking computing power," but the key driving force comes from algorithm optimization and increased data volume.

According to Baidu, the long video generation technology mainly introduces autoregressive diffusion models, combining the long-sequence capabilities of autoregression with the strong consistency advantages of diffusion, enabling accurate generation of long videos that conform to the laws of physics and have high consistency.

Xinfeng participated in the internal testing of Baidu Steam Engine, using a character as the first image and prompts like "1-5s camera follows, character walks quickly. 6-10s camera follows, character walks forward towards the stairs. 11-15s character walks forward, camera follows, right pan. 16-20s character walks forward, camera follows, right pan, circling to the front of the character." to generate a 20-second short video. (See 「Baidu Steam Engine」AI Video Generation Model)

In the video, it can be seen that although the character's facial expression changes seem like a different face, and objects appear to disappear out of thin air, the character's movement trajectory is natural, and the background does not show any collapse.

Price War Smoke

Although domestic large language models have not yet found a path to charge C-end users, AI video generation model companies are already exploring commercialization models.

From the charging situation, there are significant differences among companies.

Taking the standard version as an example, Keling and Shengshu Technology's Vidu are priced at 66 yuan and 59 yuan, respectively; Aishi Technology's Pai Wo and ByteDance's Ji Meng are both priced at 79 yuan.

However, Vidu and Ji Meng belong to the "more quantity without increasing price" category, allowing the generation of 200/month and 216/month videos, respectively. In contrast, Keling and Pai Wo can only generate dozens of videos.

Each company's commercialization has achieved certain results.

Currently, Kuaishou is one of the few large companies that have disclosed the commercialization results of AI video generation applications, with "Keling" revenue exceeding 250 million yuan in the second quarter of 2025.

For startups, Shengshu Technology's Vidu has surpassed an annual recurring revenue (ARR) of 20 million USD (equivalent to 140 million yuan) after being online for 8 months; Aishi Technology's Pai Wo claims that subscription revenue has already covered costs.

However, large companies have quietly started a price war to attract professional creators. According to Baidu, Baidu's steam engine has already been applied in various scenarios such as search and marketing, with pricing as low as 70% of the industry standard. Recently, Keling launched the 2.5 Turbo model, one of its core selling points being "nearly 30% cheaper than the 2.1 model in the same tier, highlighting its cost-performance advantage."

On the other side of the price war, many companies are eager to try.

Xinfeng learned that Qunhe Technology, which is sprinting for an IPO on the Hong Kong Stock Exchange, is also developing an AI video generation product based on 3D technology, expected to be released within the year.

Insiders at Qunhe Technology revealed to Xinfeng that this AI video generation product will be open to C-end users in the future.

Qunhe Technology's significant advantage lies in its large and physically accurate indoor spatial dataset.

"We have accumulated massive data during the development of tools (like home decoration design software CoolJia), and this vast data is different from the 3D models generated directly by AI. It includes physically accurate interactive models, with materials that are also physically correct, and the physical coefficients of the surfaces are all included, along with structured information and structured annotations," pointed out Huang Xiaohuang, Chairman of Qunhe Technology.

In August of this year, Qunhe Technology's dataset InteriorGS even topped the trends list of Hugging Face, the world's largest AI open-source community, becoming the world's first large-scale 3D dataset suitable for free movement of intelligent agents.

This may bring more pressure to many companies, requiring all parties to further expand the boundaries of commercialization.

Currently, the market's imagination for this industry extends beyond the film and advertising sector to scenarios like robot training.

Robot training has always faced pain points such as the scarcity of training data, limited scene coverage, and high collection costs, but AI video generation applications can provide virtual scenes for robots to train in, thereby better understanding the operational rules of the real world.

Some robot companies are developing their own algorithms. For example, in March this year, the robot company Jujidongli released the embodied intelligent operation algorithm LimX VGM, which uses video generation technology to promote breakthroughs in embodied brains.

A participant in the project admitted to Xinfeng that, limited by the amount of data, the generalization ability of the current video generation large model is limited.

However, this person remains optimistic and is quite hopeful about the industry trend of using AI video generation models for robot virtual environment training.

At a previous earnings meeting, Kuaishou's management stated plans to expand the application of "Keling" in game production, professional films, and visual production.

Buyer Show VS Seller Show

Although currently, all AI video generation companies claim to have improved spatial consistency, Xinfeng's tests show that issues such as facial expression distortion during the main subject's movement and the background appearing with a mix of clarity and blurriness are rampant.

Taking "Pai Wo" as an example, Xinfeng used the image-to-video method to generate a short video of a person dancing, but encountered problems such as facial deformation and objects disappearing out of thin air. (See “Pai Wo” AI Video Model Generation) An industry insider in Hangzhou told Xin Feng that the occasional issues of facial detail and background consistency in complex motion scenarios are technical challenges faced by the industry as a whole, with the core difficulty lying in the model's precise modeling of long-term motion trajectories and multi-scale semantic coherence.

Long Tianze, product manager at Qunhe Technology, believes this is related to the source of training data.

"The core issue is that current AI video algorithms learn based on 2D image sequences, so they cannot truly understand 3D space and rules. They learn how to make the previous frame visually resemble the next frame, but they do not understand the real 3D spatial relationships or the basic logic of how the so-called physical world operates," Long Tianze pointed out.

Currently, all parties are mainly addressing the spatial consistency issue from the perspectives of optimizing algorithms and constructing datasets.

Shengshu Technology told Xin Feng that they are currently optimizing through three main paths: first, optimizing the spatiotemporal joint attention mechanism based on their self-developed U-ViT architecture to enhance the model's predictive ability regarding the correlation between the subject's motion trajectory and the background; second, constructing a large-scale high-quality video training dataset to specifically strengthen the semantic understanding of complex motion patterns; third, introducing dynamic masking and consistency compensation algorithms to real-time repair inter-frame anomalies during the post-generation phase.

"Currently, our reference live video function has achieved consistency enhancement from facial to subject across multiple levels, and we will focus on breaking through the stability boundaries under significant motion in the future," Shengshu Technology stated.

Qunhe Technology, on the other hand, is advancing the workflow development for 3D video generation, hoping to reduce noticeable model penetration and distortion reactions in changing environments.

However, the challenge with this approach is that users need to master the data input for video generation, among other things.

The Boundaries of Privacy

High-quality datasets are the training materials that many AI video generation model companies currently crave.

Some foreign major companies, in order to enhance the consistency of character subjects in AI video generation models, have even resorted to downloading adult films as training materials.

Meta has faced such scrutiny.

In July of this year, two American adult film companies, Strike 3 Holdings and Counterlife Media, brought Meta to court, alleging that it secretly downloaded 2,396 adult films to train its AI model.

"This is indeed a very new case involving copyright infringement, and it is estimated that Meta will still claim fair use," a practicing intellectual property lawyer in the U.S. told Xin Feng. "Currently, there are no unified rules regarding these training materials; we can only move forward amid controversy."

In contrast, domestic platforms may have more flexible space regarding training materials, especially video platforms which have unique advantages.

Although video platforms do not have exclusive rights to the videos published by users, they generally have usage rights.

For example, Kuaishou's "Basic Function Privacy Policy" explicitly states that in order to achieve advertising push and placement, and to assist in evaluating the effectiveness and efficiency of advertising placements, it may need to collaborate with advertisers, service providers, and third-party partners to read some user information and data This may mean that video platforms like Kuaishou and Douyin will have more data advantages in the AI video generation track compared to other companies.

As the AI video generation track gradually develops, the boundaries of data usage may also become clearer