Two papers from NVIDIA introduce a new paradigm of embodied intelligence after VLA

In 2025, VLA (Vision-Language-Action Model) became a hot topic in the field of embodied intelligence, but it has serious deficiencies in physical action execution and generalization capabilities. Two papers titled "DreamZero" and "DreamDojo," released by NVIDIA in early 2026, proposed a new paradigm that emphasizes learning from videos, enabling zero-shot generalization, breaking the limitations of data scarcity, and pointing out that the lack of a world model is the fundamental issue of VLA

In 2025, the hottest term in the field of embodied intelligence is VLA (Vision-Language-Action Model).

It has become a consensus sweeping across the entire industry, a standard answer regarding embodied foundational models. Over the past year, capital and computing power have flooded into this track, and basically all major model companies are using this paradigm.

However, reality in the physical world quickly doused all practitioners with cold water. This is because VLA is weak in executing physical actions.

It can understand extremely complex textual instructions. But when a robotic arm actually goes to grasp something, it may struggle to even adjust its wrist position to avoid the obstruction of a cup handle, let alone perform actions like untangling shoelaces that involve complex physical deformations.

Another fatal flaw of VLA is its generalization. The reason everyone is updating models is to avoid programming for every special environment, emphasizing the generalization ability of large models. As a result, now, any action that exceeds the training-defined environment, VLA basically cannot generalize, and it cannot even perform in environments similar to the training environment.

The entire industry attributes the inability to generalize to insufficient data. Major companies have begun to invest billions of dollars, using various methods to collect data, attempting to fill the knowledge gaps of VLA with massive simulated demonstrations.

However, in early 2026, NVIDIA published two papers, "DreamZero: World Action Models are Zero-shot Policies" and "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos," which constructed a completely new paradigm for embodied intelligence foundational models, breaking the deadlock of data competition.

Together, they presented the possibility of a fully learned embodied model from videos that can generalize to execute different tasks with zero-shot.

What VLA Lacks is Not Data, but a World Model

To understand the disruptive nature of DreamZero and Dream Dojo, one must first analyze the systemic flaws of VLA at a fundamental level.

The biggest problem with VLA is the lack of a world model. The underlying architecture of VLA limits its cognitive approach. From a lineage perspective, VLA is more closely related to LLMs, while its relation to pure vision and pure physics is weaker. It maps pixel blocks of images to the semantic space of text through a cross-attention mechanism, where it understands the concepts of cups and tables and their relative positions in a two-dimensional image.

However, the physical world is not a two-dimensional semantic slice. The physical world is continuous, filled with mass, friction, gravity, and geometric collisions VLA has a relatively weak understanding of physical actions and the world because it is essentially a "translator."

We can explain this using the state transition equations in physics. A complete world model is essentially learning a conditional probability distribution. It can predict what the world will look like in the next second given the current state of the world (visual observations) and the actions the robot is about to take.

VLA has never learned this equation. VLA learns the function relationship that maps static visual observations + language instructions directly to executable actions; however, it has not been systematically trained to predict the consequences of actions or to perform counterfactual trial and error. Therefore, once the environment, materials, or constraints are slightly altered, performance can drop sharply.

This is akin to asking a person to memorize the answers to ten thousand geometry problems without understanding the principles of geometry. When faced with the original problems, they can quickly write out perfect answers; when encountering new problems with slightly altered conditions, they completely crash.

The generalization of VLA is essentially just interpolation in a high-dimensional semantic space. When the physical forms exceed the envelope of the training set, interpolation fails.

In contrast, video generation models have made significant progress. The physical interaction scenes generated by Veo3, Sora 2, and the recently popular Seedance 2 are already quite realistic, with fluid, rigid, and flexible material movements so coherent that they are almost indistinguishable from the real world. This indicates that large-scale video generation models have likely compressed and internalized the fundamental operating laws of the physical world from vast amounts of internet video, forming some world models.

Even with such power, video generation has primarily been used to provide simulated data for VLA rather than being integrated into the robot's workflow.

In fact, the idea of using video generation models to control robots is not new. Before DreamZero, both academia and industry proposed several solutions. However, these methods invariably fell into dead ends in engineering and logic.

For example, LVP (Large-scale Video Planner). Its approach is to generate a future video plan on how to complete a task directly from an image and a sentence. Then, it reconstructs the hand movements in the video into 3D trajectories. It uses video pre-training, rather than language pre-training, as the core capability of the robot.

The second approach is similar to NVIDIA's own DreamGen, which generates video and then retroactively drives actions. This was a previously highly anticipated route. It splits the entire foundational model architecture into two halves: the upper half is a video model responsible for predicting the future; the lower half is an independently trained IDM network that observes the predicted video, retroactively infers, and outputs actions.

The biggest problem with the above two phased models is that the action and video generation are not aligned. The action part requires extreme accuracy, but video generation is difficult to perfect. Once the generated future images contain slight pixel artifacts or physical illusions, whether it's IDM or point tracking, it directly confuses the system and amplifies the errors exponentially. If the position of the robot's fingers in the video is off by a micron, the robot in reality won't be able to grasp anything at all. The robustness is extremely poor.

The third method is Unified Video-Action (UVA). This is considered the most advanced approach, attempting to learn video and action within the same latent space of a diffusion model, balancing video prediction and action prediction. During inference, it skips video generation through "decoding decoupling" to ensure speed. However, its architecture uses a Bidirectional Diffusion framework. To match the length of language instructions, the generated video sequences must be significantly compressed. This approach completely distorts the native temporal flow of the video. With time distorted, aligning action instructions with visual frames becomes nearly impossible, so the generalization of this method is naturally very poor.

In addition, these methods share a fatal common flaw: they are too slow. Video diffusion models require multiple iterations to denoise, and generating a few seconds of action often takes dozens of seconds of computation. If a robot takes 5 minutes to put a bowl in a cabinet, you would probably go crazy just watching.

Therefore, before 2026, among all new embodied intelligence companies, almost only 1X Technologies, which recently launched a home robot, is attempting this video prediction method. They utilize massive amounts of "Shadow Mode" data, where the model runs predictions in the background while humans remotely operate, using this high-quality paired data to train the fragile IDM.

However, a temporary failure does not mean the direction is negated.

At last year's robotics conference, I interviewed many domestic embodied intelligence scholars. At that time, Google’s Veo 3 and Genie 3 had just been released. Most scholars were impressed and realized the world understanding capability of video generation models.

Therefore, in discussions, they almost unanimously suggested that generation might be the most reliable path for future embodied intelligence. This is more likely than generating data in simulated environments. Simulators (like Isaac Gym or MuJoCo) are limited by human-coded physics engines and can never exhaust the complexity of real-world materials, the variability of light and shadow, and the non-linearity of contact forces. Only generative models that absorb video data from all of humanity can be considered the true super simulator that encompasses the physical laws of everything However, at that time, this thinking was still at the level of "data," and the idea of video generation replacing VLA had basically not entered the field of vision.

But NVIDIA's research is likely the turning point that makes this idea the first effective engineering path.

DreamZero, Embodied Intelligence Based on World Models

As mentioned earlier, there were three main issues faced when using video generation models to construct robotic movements in the past.

First is the alignment problem caused by step-by-step processes. Second is the poor integration mode that renders it unusable. Third is the issue of being too slow. In response to this, NVIDIA first provided a solution using DreamZero.

First, DreamZero adopts an end-to-end training method that synchronizes video and action prediction. This resolves the misalignment issue of the previous phased models.

Second, to address the spatiotemporal confusion of UVA, DreamZero completely abandoned the early bidirectional architecture and instead constructed a 14B parameter autoregressive Diffusion Transformer (DiT). This is currently the standard architecture for video generation models. It predicts video and actions in strict chronological order from left to right, just like a language model generates text. It predicts video and actions simultaneously in the same diffusion forward pass.

This brings two benefits. First, it retains the native frame rate, achieving absolute alignment of actions and visuals on the timeline. Second, it utilizes KV Cache technology. The model does not need to recompute historical visuals from scratch each time, greatly saving computational power.

Next, to solve the "error accumulation" and hallucination problems caused by autoregression, DreamZero also introduces real observation injection.

The model predicts the visuals and actions for the next 1.6 seconds, and the robot completes the execution. At the moment the action is completed, it captures the absolutely real current physical world visuals taken by the camera, directly encodes them, and inserts them into the KV Cache, covering and replacing the false visuals generated by the model just now.

This step instantly cuts off the causal chain of error accumulation. The model is forced to always stand on the absolute real physical foundation to think about the next step.

Finally, and most importantly, is to solve the problem of slow generation. To achieve the frequency required for robot control, DreamZero invented the DreamZero-Flash technology. The diffusion model is slow because it needs to traverse a long denoising chain during inference. If the number of steps is forcibly reduced (for example, using only 1 step for denoising), the quality of the generated actions will plummet, as the images remain in a blurry state filled with noise, and the model cannot extract precise actions from it.

The solution of DreamZero-Flash is "decoupled noise scheduling." During training, it no longer allows the video and actions to be at the same noise level. It forces the model to look at extremely blurry visuals filled with high-intensity noise to predict completely clean and precise action signals. This is equivalent to training the model to make correct responses based on physical intuition without being able to see the future.

For humans, this is an impossible task; if you can't see, you can't perform the action. But for the model, this seems entirely feasible. After this training, during the inference phase, the model only needs to perform a mere 1 step of denoising to generate accurate actions. The inference time has been compressed from 350 milliseconds to just 150 milliseconds.

This allows the system to output action blocks at a frequency of 7Hz, achieving relatively smooth real-time execution in conjunction with the underlying controller.

After a series of transformations, DreamZero has demonstrated terrifying potential in the world of video generation models.

The most prominent feature is its generalization ability. In tests with the AgiBot dual-arm robot, researchers presented tasks that were completely unseen in the training set, such as untangling shoelaces, removing a hat from a mannequin's head, and painting with a brush.

When the VLA trained from scratch attempted these tasks, the progress was nearly zero, and it struggled with the starting points. However, DreamZero achieved an average task progress of 39.5%, with certain specific tasks (like removing a hat) reaching as high as 85.7%.

This is because DreamZero's learning process is revolutionary. By jointly predicting video and actions during training, it is forced to establish causal chains of evolution in the latent space. It knows that if it does not release the gripper, the object being held will not fall; it knows that if it pushes a cup of water forward, the water will spill out.

By pre-setting a video-based world model, WAMs have developed physical intuition. When faced with unseen tasks, it does not search its memory for similar actions but simulates the physical consequences of the actions in its mind. As long as this physical consequence aligns with the semantic goals of the language instructions, it can directly emerge to execute the action This is why it can complete the complex task of untying shoelaces in a zero-shot scenario.

Even more astonishing is the cross-embodiment capability.

Under the traditional VLA paradigm, to make a new type of robot work, you must hire someone to record exclusive remote operation data for that robot. But in DreamZero, researchers only let the model watch human perspective recordings (pure video, without any motor action parameters) for just 12 minutes. The model achieved a 42% relative improvement in performance on unseen tasks.

Subsequently, they directly transferred the model trained on AgiBot to a completely different YAM robot. After feeding it just 30 minutes of unstructured "play data," the model completed body adaptation and perfectly retained the ability to generalize and execute complex instructions in zero-shot scenarios.

This is the dimensionality reduction impact of world models. The laws of physics are universal; it only requires minimal data to fine-tune its understanding of the new body's kinematic boundaries.

The biggest problem with VLA is perfectly solved by DreamZero's pre-set world model action model WAM (World Action Model). It can achieve good generalization without the need for massive amounts of robotic data training.

However, we must remain clear-headed. The engineering path based on video generation still has many bottlenecks.

Compared to VLA, which can run at astonishing speeds of 20Hz or 30Hz on consumer-grade graphics cards, DreamZero's optimized 7Hz is still slow. Moreover, it has higher hardware requirements, relying on computing clusters composed of top-tier chips like H100 or GB200 for parallel inference. This is unacceptable for independent robots deployed at the edge under current computing cost conditions.

However, the decline in computing costs follows Moore's Law, while the physical cognitive ceiling of algorithm architecture is a hard limit. Using expensive computing power to gain generalization capabilities that originally did not exist is a worthwhile trade-off in the long-term perspective of technological evolution.

The success of DreamZero means that transitioning from VLA to video world models is no longer an academic fantasy, but a feasible possibility that has already been realized.

The data needed for world models is different from VLA

In the experiments of DreamZero, NVIDIA discovered an counterintuitive conclusion.

We usually think that more data is better. If the robot cannot learn, then collect another ten thousand hours of data. But in the context of world models, this rule fails DreamZero reveals a new rule: Data Diversity > Data Redundancy.

Researchers conducted a set of controlled experiments, preparing two datasets, both with a total duration of 500 hours.

● Dataset A (Redundant Group): Contains 70 tasks, each with a large number of repeated demonstrations, with minimal changes in position and environment. This is the "drill" mode favored by traditional VLA.

● Dataset B (Diverse Group): Contains 22 different environments and hundreds of tasks, with extremely chaotic data and almost no repetitions.

The result shows that DreamZero trained on chaotic data achieved a generalization success rate of 50% on unseen tasks. In contrast, the model trained on finely repeated data had a success rate of only 33%.

Why? This is because the learning logic of VLA and WAM is fundamentally different. VLA is about memorization. WAM is about learning physics.

DreamZero proves that for learning physical laws, watching an egg fry on Mars once is more valuable than watching it fry in the kitchen 1,000 times.

This is because the former provides new physical boundary conditions, while the latter merely increases redundancy through repetition. What the world model needs is coverage, not redundancy.

The next step is to train the world model better

The significance of DreamZero is that it proves the path of WAM is completely viable and can generalize very well.

However, to continuously enhance the capabilities of models like DreamZero, we still need to train it further. We should strengthen its world model based on video generation as much as possible, ideally with a stricter posterior judge that can guide it to continuously improve accuracy during post-training.

This is the role of Dream Dojo in another paper. DreamZero creates the engine, while Dream Dojo refines the fuel that continuously optimizes this engine.

As its name suggests, it acts like a dojo, aiming to transform the training of the world model from a one-time research demo like DreamZero into a repeatable industrial process. This process encompasses the entire lifecycle from data ingestion, representation alignment, to rolling prediction and error diagnosis.

Before the emergence of Dream Dojo, VLA (Vision-Language-Action) models always faced obstacles in data, encountering three major deadlocks.

Label Scarcity: The internet is flooded with videos, but there are only visuals, without action data (Action Labels).
Engineering Hell: Robots come in all shapes and sizes. Different degrees of freedom (DOF), different control frequencies, different interface formats. Attempting to unify this data is a nightmare for engineers.
Uncontrollable: Many videos generated by models look correct, but are physically incorrect. If actions and consequences are not aligned, the model cannot perform counterfactual reasoning. Without reasoning, there can be no planning But now, with the advent of video generation models, these issues are no longer a problem. DreamDojo does not create a world model from scratch; it builds upon the foundation that "video foundational models have learned the visual and temporal laws of the world to a certain extent," and then reinforces the crucial interaction causality and controllability for embodied intelligence.

Since there is no motor data in human videos, we no longer need motor data.

DreamDojo is no longer fixated on the readings from sensors but seeks the physical essence of actions. An action is essentially a force that causes a change in the state of the world.

DreamDojo has designed a self-supervised encoder that specifically focuses on the preceding and following frames of the video. It continuously asks itself one question, what force causes the previous frame to transform into the next frame?

The answer automatically extracted by the machine is continuous latent actions.

DreamDojo no longer records absolute joint poses. Because absolute poses are too sparse and difficult to learn in high-dimensional space. It records the changes instead. Each frame is zeroed based on the current state. This narrows and concentrates the distribution of actions, making it easier for the model to learn the general physical law of moving slightly to the left, rather than memorizing coordinates.

It's like not needing to know which muscle a person used (sensor data); just observing them waving their hand to smash a cup, and the cup breaks, allows the model to extract the entire process of the latent action of waving and smashing.

At the same time, to enhance controllability, DreamDojo does not treat the entire action trajectory as a global condition to be fed in; instead, it stitches together four consecutive actions into a chunk, injecting it only into the corresponding latent frame. Through this segmentation, the model is forced to understand that this tiny slice of action leads to the change in the next moment's scene, preventing causal confusion in the world model.

In this process, the video model shifts the training objective from predicting how similar the future will look to whether the action changes the future's direction and magnitude consistently.

This completely bridges the species isolation between different embodiments. When different bodies perform the same action in different scenes, the latent actions tend to be similar. The model no longer needs to know that the elbow motor rotates 30 degrees; it only needs to know that this latent action will result in the cup being picked up.

And because the action laws in this latent space are the same for everyone, there is no spatial heterogeneity, and there are no incompatible data formats.

On the basis of the video generation world model, DreamDojo uses the mathematically universal language of continuous latent actions to convert all of humanity's video assets into experiences that robots can understand. To achieve this goal, the NVIDIA team built a dataset called DreamDojo-HV (along with In-lab and EgoDex), which is a first-person human interaction mixed dataset of approximately 44,711 hours, covering an extremely wide range of daily scenarios and skill distributions. It includes tens of thousands of scenes, thousands of tasks, and a long-tail distribution of tens of thousands of objects.

This scale is 15 times larger than the previous largest robot world model dataset, with a scene richness that is 2000 times higher.

As a result, DreamDojo, without having seen any real robots, was able to control real robots to complete tasks it had never seen before, relying solely on pre-training from watching human videos, with only a small amount of fine-tuning. Through distillation technology, they compressed this massive world model to run at 10 FPS in real-time.

Thus, with the combination of Dream Dojo and DreamZero, this closed loop of embodied intelligence based on world models has finally come together.

Its foundation is a video generation model because it understands physics. The architecture is represented by DreamZero's World Action Model (WAM), which can make decisions by predicting the future, while keeping execution and low latency thin and usable. Its progressive fuel is that DreamDojo thickens physics and verifiability, allowing human videos from across the internet to be transformed into robotic experiences through potential actions.

We no longer need tens of thousands of PhDs to remotely operate robots. Just let the robot sit there, watching videos of humans working day and night, and it can learn everything about the physical world.

This may very well be a paradigm shift in embodied intelligence

The emergence of DreamZero has sounded the death knell for the pure VLA era of embodied intelligence.

This paradigm shift may profoundly reshape the entire industry's ecology.

First is the disruption of the data collection philosophy. Under the VLA paradigm, practitioners fell into the prisoner’s dilemma of remote operation data, believing that only by spending heavily to collect tens of thousands of hours of precise action pairing data could robots become smarter. But DreamZero demonstrated the terrifying potential of cross-body learning; merely by watching pure videos of human behavior, the model can absorb physical strategies.

And Dream Dojo means that the hundreds of billions of human life videos on YouTube and TikTok, which were originally thought to lack action labels and be useless to robots, will be completely unlocked From high-cost physical remote operation to low-cost internet video mining, this is a dimensionality reduction strike to acquire common knowledge.

Most importantly, our understanding of machine intelligence is undergoing a fundamental shift.

In the VLA era, we tried to make machines work by teaching them to read, resulting in a clumsy translator. Now, we are beginning to teach machines to dream, generating, predicting, and simulating the evolution of the physical world in their minds.

When a machine no longer mechanically repeats data but can internally construct a miniature universe that complies with the laws of physics and deduce the consequences of its actions within it, we have already reached the true starting point of general embodied intelligence.

This is a steeper path, but it will undoubtedly lead to a broader future.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at one's own risk