Track Hyper | Baidu's Strategy for Specific Scenarios in AIGC Video

Baidu launched the MuseSteamer video generation model and the "Hui Xiang" platform, aimed at addressing content production pain points in search, advertising, and recommendation scenarios, marking its entry into the AI video generation field. Although Li Yanhong has stated that the investment cycle for generative models is long, Baidu's technical team successfully solved the core challenges of video generation, achieving natural coordination of visuals and sound. MuseSteamer focuses on multimodal semantic alignment in the Chinese context, categorizing hundreds of millions of Chinese video data through a "scene granularity decomposition" approach to enhance the accuracy of generated videos

Author: Zhou Yuan / Wall Street News

In the process of generative AI technology moving from the laboratory to industrial applications, video generation has always been a key area of focus in the industry due to its high technical complexity and diverse scene requirements.

On July 2, Baidu's commercial R&D team launched the video generation MuseSteamer model and the "Hui Xiang" platform, targeting the practical pain points of native content production in search, advertising, and recommendation scenarios. They aim to explore feasible paths for AIGC video implementation through technical adaptability optimization, announcing Baidu's entry into the AI (artificial intelligence) video generation field.

It is worth mentioning that in 2024, the explosive popularity of Sora triggered a wave of large generative video models. Baidu's founder, chairman, and CEO Robin Li stated in an internal speech that the investment cycle for video generation models like Sora is too long, potentially taking 10 or 20 years to see business returns, and regardless of how popular it is, Baidu will not pursue it.

The Baidu technical team, unafraid to "slap" Robin Li in the face, likely solved the core challenge of video generation—how to achieve natural collaboration between visual elements and audio information on the timeline. Additionally, on July 2, it was reported that Robin Li mentioned in a closed-door meeting in 2024 that based on multimodal needs, a relatively specific video generation scenario could be created.

MuseSteamer is indeed a relatively specific video generation model, and from this perspective, it does not really count as slapping the boss in the face.

The technical design of this model addresses the issue of multimodal semantic alignment focused on the Chinese context.

Compared to English, Chinese has stronger semantic ambiguity and context dependence. A phrase like "this product is very powerful" may require visual representation of product performance testing or may need to convey admiration through facial expressions, with corresponding sound effects varying widely.

To solve this problem, MuseSteamer's underlying data processing adopts a "scene granularity decomposition" approach: categorizing hundreds of millions of Chinese video data into 23 high-frequency scenes such as "life services, e-commerce display, knowledge popularization," and further subdividing each scene into three levels of tags: "action - emotion - effect."

For example, in the e-commerce scene, "clothing display" is decomposed into sub-tags like "static hanging (action) - no emotion (emotion) - fabric texture (effect)," allowing the model to accurately understand the audio-visual representation corresponding to descriptions like "this dress has a great drape."

This scene-based training approach is directly reflected in the generation results.

In tests, for the instruction "explain the mobile phone camera function," the model can automatically match the combination of "lens zoom (visual) + button sound effect (audio) + smooth narration (voice)," while similar English models often exhibit a mismatch of "rapid visual switching paired with slow narration."

Although Baidu's optimizations do not involve disruptive technological innovations, they precisely address the actual needs of Chinese commercial content production.

Liu Lin, general manager of Baidu's commercial R&D system, stated that breakthroughs in video length and image quality in the digital content creation field usually signify a qualitative change in creative freedom MuseSteamer supports the generation of 10-second long videos with cinematic aesthetics and 1080P high definition, providing greater expressive space for video creation.

Liu Lin stated that in traditional AIGC video creation practices, videos are generally generated first, followed by voiceovers and sound effects. This fragmented creative process not only consumes a lot of time but also weakens the complete artistic expression of the work.

MuseSteamer innovatively supports the integrated generation of videos with sound effects and character dialogues. In terms of video length, MuseSteamer can generate two versions of 5 seconds and 10 seconds, both achieving 1080p clarity.

Baidu has simultaneously released the family versions of the MuseSteamer model, including Turbo, Lite, and Pro, along with corresponding voiced versions, each targeting different creative needs and cost considerations.

The version matrix of the "Hui Xiang" platform essentially responds to the differentiated cost structures of different users.

The free public testing strategy of the Turbo version targets the cost-sensitive characteristics of small and medium-sized businesses: Taobao store owners are most concerned about "spending money but not meeting platform algorithm recommendation preferences" when trying to generate product videos. The free model allows them to quickly test the correlation between different visual styles and conversion rates.

The paid design of the Pro version corresponds to the time cost pain points of professional institutions; the entire series of voiced versions controls "marginal costs."

In traditional advertising production, adding a dialect voiceover requires additional payment to voice actors, while the voiced version supports the instant generation of 8 dialects, including Cantonese and Sichuan dialect, through the application of Chinese speech synthesis technology, significantly reducing the marginal production costs of localized marketing content.

As one of the earliest domestic tech companies to layout large models, Baidu is competing in the video generation track, and compared to competitors like ByteDance and Kuaishou, it is indeed a "latecomer."

Kuaishou's Keling AI announced in May this year the launch of the new 2.1 series model, generating a 5-second video in high-quality mode (1080p) in less than 60 seconds.

Information from Kuaishou's official website shows that Keling AI's annualized revenue run rate exceeded 100 million USD just 10 months after its launch (in March this year), with monthly payments in April and May exceeding 100 million RMB.

In addition to announcing its lead investment in Tsinghua University's video large model company, Shengshu Technology, in 2024, Baidu has not made more moves in the generative video field; by March this year, Baidu released the Wenxin large model 4.5 and 4.5 Turbo, achieving mixed training of text, images, and videos.

Compared to its competitors, Baidu appears to be taking a differentiated competitive path in the domestic AIGC video track: focusing on "scene-specific generated videos" rather than an all-scenario model.

Compared to similar products focused on general entertainment content, the core advantage of "Hui Xiang" lies in its deep integration with commercial scenarios such as search and advertising For example, the videos generated by HuiXiang can directly become a functional module of Baidu's information flow advertising system, automatically matching user search keywords for dynamic optimization; purely tool-based products find it difficult to replicate the closed-loop capability of "creation - distribution - feedback."

Such collaborative scenarios are also reflected in the data accumulation layer.

The hundreds of millions of user interaction data on Baidu's advertising platform (such as at which second of the video the user clicked the purchase button) become the optimization basis for MuseSteamer, allowing the model to learn commercial rules like "the conversion rate is highest when promotional information appears in the 8th to 10th seconds of the video." This data barrier is more competitive than mere model parameters.

Indeed, Baidu's pursuit is still profit, and of course, the consideration of commercial value is the foundation for a commercial company's existence.

With the continuous iteration of technology, the competitive focus of AIGC video tools has shifted from "can it generate" to "can the generated content be used."

The products launched by Baidu this time may not necessarily lead in technical parameters, but by accurately capturing the demands of commercial scenarios, they provide a feasible paradigm for the industry to implement technology.

Therefore, the value of Baidu's "HuiXiang" lies not in disrupting content production, but in using technology to fill the efficiency gaps in traditional processes. It should be said that this is a pragmatic advancement path, as the ability to commercialize is the main driving force behind the rapid development of technology.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at one's own risk