Tencent open-sources Hunyuan Voyager: Dominates three major evaluations, crushing all competitors

Tencent's Hunyuan team has launched the Hunyuan World Model - Voyager, claiming to be the industry's first ultra-long roaming world model that supports native 3D reconstruction, aimed at solving the technical bottlenecks in 3D scene generation. Voyager can generate high-fidelity 3D scenes and export videos in 3D format, promoting the development of fields such as virtual reality, physical simulation, and game development. This innovation marks a new stage in 3D scene generation technology

In the realm of artificial intelligence and computer vision, 3D scene generation has long been recognized as a tough nut to crack.

In popular fields like virtual reality (VR), augmented reality (AR), and game development, who isn't eagerly waiting for high-quality, interactive 3D scenes? The demand grows day by day, but the technological bottleneck remains firmly in place.

Tencent's Hunyuan team has thrown down a trump card — Hunyuan World Model-Voyager. Claimed to be the industry's first ultra-long roaming world model supporting native 3D reconstruction, it sounds like a complete "regime change" for the 3D scene generation field.

Let's first talk about why this is so difficult?

The technical routes for 3D scene generation have always been quite tangled. One path is purely focused on video generation, which has the advantage of continuous motion, providing an immersive experience. However, the fatal drawback is that what you see is merely an "image," with no real interaction with the scene. Want to do some physical simulation or VR experience in it? That's basically impossible because it lacks a true 3D structure.

The other path is more straightforward, directly generating a 3D world. This route sounds appealing, with strong spatial structure consistency and good potential for future applications. But the problem is, where do you find high-quality 3D training data? It's both expensive and scarce. Moreover, the massive memory consumption of 3D representation makes it difficult for the model to generalize to more diverse and larger scenes. Both paths seem to hit a dead end.

Hunyuan World Model-Voyager breaks through the ceiling of traditional video generation in spatial consistency and exploration range, capable of generating ultra-long-distance, globally consistent roaming scenes. The most impressive part is that it can directly export the generated video into 3D format. This provides the most needed high-fidelity 3D scene roaming capability for fields like virtual reality, physical simulation, and game development. One could say that the emergence of Voyager officially announces the entry of 3D scene generation technology into the next era.

In the words of Tencent's Hunyuan team, Voyager is the official extension of Hunyuan World Model 1.0. It's worth noting that it has only been two weeks since they released the HunyuanWorld 1.0 Lite version. This speed of iteration can only be described as "terrifying" in terms of Tencent's research and development strength and investment in the AI field.

So, how exactly does this thing work?

Behind Hunyuan World Model-Voyager are two "god-level" core components working in synergy. It is their design that has turned the ideal of long-distance, world-consistent video generation and 3D reconstruction into reality.

The first component is called "World-Consistent Video Diffusion." You can think of it as a "director" that understands both art and physics Traditional video generation models are mostly "artsy youths," only concerned with whether the visuals look good (generating RGB videos), completely ignoring the depth information of the physical world.

However, Voyager, this "director," is different. When generating videos, it innovatively incorporates scene depth prediction, effectively handling both video generation and 3D modeling simultaneously. It can synthesize RGB-D videos that allow for free control of perspective and are spatially coherent based on the initial image you provide and the specified camera movement trajectory. The "D" stands for depth, meaning that each frame of the video comes with 3D point cloud information.

The brilliance of this approach lies in:

First, it is a multi-modal joint generation, producing RGB videos and depth videos simultaneously while ensuring precise alignment, directly saving the hassle of post-processing and maintaining high data quality.

Second, it employs a conditional generation mechanism based on existing world observations, ensuring that no matter how long the generated video is, it remains visually and geometrically unified from start to finish, avoiding bizarre situations like walls tilting or tables disappearing as you move.

Finally, it is an end-to-end generation, unlike older methods that require additional 3D reconstruction tools like COLMAP to "patch things up," inherently ensuring cross-frame consistency.

The second component is called "Long-Range World Exploration." If the first component is the "director," then this component is an exploration team with infinite energy. It addresses the problem of traditional models getting lost after running for a while.

Its core secret is an efficient "world caching" mechanism. Specifically, it first generates an initial 3D point cloud as a "base" using the mixed-element world model 1.0, and then projects this "base" information to the new perspective you want to explore, serving as "navigation" for the diffusion model.

To cope with increasingly large scenes, this "exploration team" has also learned "point cloud culling" technology, intelligently managing and optimizing massive point cloud data, significantly improving computational efficiency. Even more cleverly, it adopts a self-regressive inference method, simply put, "looking and remembering while walking." The newly generated video frames will update the "world cache" in real-time, forming a closed-loop system.

As a result, no matter how intricate your camera trajectory is, it can maintain geometric consistency, not only expanding the roaming range but also supplementing the mixed-element world model 1.0 with new perspective content, elevating the overall generation quality. Coupled with a technique called "context-aware consistency technology" to ensure smooth video sampling, it ultimately provides you with a cinematic immersive experience By combining these two components, Voyager can generate a globally consistent 3D point cloud world from a static image, allowing you to explore as you wish with a "virtual camera." While exploring, it also generates RGB video with precise depth information, making high-quality 3D reconstruction effortless.

A Large Model Fed with "Brutal Aesthetics"

How much "mental nourishment" is needed to train a "monster" like Voyager? They built a system that can be described as a "data perpetual motion machine"—an entirely automated video reconstruction pipeline. This system can automatically estimate the camera pose and real metric depth from any input video. What does this mean? It means they have completely eliminated the expensive and time-consuming manual labeling, allowing for scalable and diverse production of training data.

The workflow of this data engine is roughly as follows:

First, the video is fed in for preprocessing, selecting high-quality frames. Then, using SLAM (Simultaneous Localization and Mapping) and bundle adjustment algorithms, the camera position and orientation for each frame are automatically calculated, which is key to training a controllable camera model.

Next, a depth estimation model predicts the depth information for each frame, pairing it with the RGB image, creating the "RGB-D combo" that Voyager loves to consume. Finally, the system automatically checks alignment and verifies data quality, discarding any unqualified samples.

With this automated pipeline, the team integrated videos shot in the real world with those rendered using the Unreal Engine, amassing an ultra-large-scale dataset containing over 100,000 video clips. This dataset is not only large and satisfying but also diverse, covering various scenes and styles, with each piece of data carrying valuable "labels" such as camera pose and metric depth.

It is this high-quality, diverse dataset that has made Voyager so powerful.

To validate the results, the research team used a public dataset called RealEstate10K as the "examiner." This dataset is quite significant, extracted from about 10,000 videos on YouTube, containing approximately 10 million frames of images and corresponding camera motion trajectories, serving as the gold standard for evaluating video generation and 3D reconstruction tasks. Many of Voyager's key performances were derived from this dataset.

All Talk and No Action is Just Hot Air

To test how capable Voyager really is, the Tencent Hunyuan team conducted a comprehensive "exam" from three dimensions: video generation quality, 3D scene reconstruction ability, and world generation capability.

First is video generation quality. The research team put Voyager alongside four mainstream open-source camera-controllable video generation methods for a competition. They randomly selected 150 video clips from the RealEstate10K test set and scored them using three industry-recognized metrics: PSNR, SSIM, and LPIPS, which measure the perceptual similarity and structural consistency between generated images and real images

What are the results? Just look at the table.

Voyager has achieved comprehensive leadership in all indicators, and it can be said that it has undoubtedly taken first place. The PSNR indicator reached 18.751, nearly 0.5 higher than the second place; the SSIM indicator is 0.715, also leading the pack; the LPIPS indicator is better the lower it is, and Voyager's 0.277 is the lowest score in the field, indicating that the content it generates looks most like real video to the human eye.

Looking at the specific generation effect comparison, the gap is even more obvious. Especially in the last set of examples, only Voyager successfully retained the detailed features of the product in the input image. In contrast, the other methods either produced obvious flaws or, as in the first example, completely "collapsed" when the camera moved significantly, generating completely unreasonable results.

Next is a more hardcore evaluation of scene generation quality. Since the competitors can only generate RGB frames, the research team kindly used a tool called VGGT to help them estimate camera parameters before using the videos they generated to initialize the point cloud.

On the other hand, Voyager has it much easier, as it directly generates RGB-D content, requiring no intermediate processing at all, and can be directly used for high-quality 3D Gaussian Splatting (3DGS) reconstruction.

From the table data, it can be seen that even when the competitors used the VGGT "plug-in," Voyager's reconstruction results are still the best, indicating that the videos it generates indeed excel in geometric consistency. When Voyager uses its generated depth information to initialize the point cloud (without any post-processing), the effect can be even better, directly proving the strength of its depth generation module.

From the qualitative results, for example, in the last set of chandelier examples, Voyager retained the complex details of the chandelier well, while the other methods couldn't even reconstruct the basic shape, making the difference clear Finally, it is the ultimate test of world generation capability. The team evaluated Voyager against the static benchmark WorldScore, proposed by Stanford University's Fei-Fei Li team, which is specifically designed for the unified assessment of world generation models and holds significant value.

The results once again shocked the audience. Voyager topped the list with a comprehensive score of 77.62, leaving other models far behind. In various sub-indicators, it ranked first in object control, content alignment, style consistency, and subjective quality, with camera control in second place, and also performed excellently in 3D consistency and photometric consistency.

This fully demonstrates that Voyager has the capability to compete with top 3D methods in camera motion control and spatial consistency. Especially with the highest score in subjective quality evaluation, it once again verifies the visual realism of the videos it generates.

So, how will this change our world?

The release of Voyager is not just a refresh of technical parameters; it truly opens up a vast application blue ocean. As the first world model that can bridge "ultra-long roaming" and "native 3D," it brings disruptive imaginative space to several industries.

In the fields of virtual reality (VR) and augmented reality (AR), Voyager is like a blessing from heaven. In the past, 3D scenes in VR/AR applications were largely reliant on "manual labor," and modelers faced great difficulties, as it was not only time-consuming and labor-intensive but also challenging to achieve real-time generation of large-scale scenes. Now that Voyager has arrived,

it can generate a world-consistent 3D point cloud from a single image and supports custom paths for exploration. This means developers can generate large-scale 3D scenes at lightning speed, significantly reducing both development time and costs. Moreover, the RGB-D videos it generates can be directly used for rendering, maximizing efficiency.

The game development industry has also received a boon. In traditional game development, 3D scene modeling is a heavy and arduous task. Voyager's automated 3D scene generation capability is like a "magical tool" for game developers. Whether for rapid development of game prototypes or for generating scenes in open-world games that require vast maps, Voyager can greatly enhance efficiency. It can even generate dynamic content in real-time based on user input, bringing more possibilities to gameplay.

For film production and animation, Voyager's controllable video generation capability makes creation more free. In the past, complex camera movements could now potentially be accomplished with just an image and a camera path. This not only improves efficiency but also liberates creative freedom In the field of architecture and urban planning, Voyager is a powerful visualization tool. Designers can quickly transform their design sketches or photos into detailed 3D scenes for free exploration, significantly enhancing communication efficiency with clients and colleagues.

Even in the education and training sector, Voyager can shine brightly. Imagine medical students conducting virtual dissection studies using finely detailed 3D models of human organs generated by Voyager, while engineering students can dismantle and observe the 3D structures of complex machinery. This immersive learning experience far surpasses what books and PPTs can offer.

The release of the Hunyuan World Model - Voyager beautifully resolves the core contradictions of traditional approaches and sets a new technical benchmark for the industry.

The Tencent Hunyuan team also stated that Voyager, along with the previous Hunyuan World Model 1.0 and 1.0 Lite versions, forms a complete technical system.

With its open-source nature, more developers and researchers will be able to stand on the shoulders of this "giant" to explore and create more possibilities.

AIGC Open Community, original title: "Tencent Open Sources Hunyuan Voyager: Dominating Three Major Evaluations, Crushing All Competitors"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at one's own risk