DeepMind reveals astonishing answer: Agents are world models! Coinciding with Ilya's prediction from 2 years ago

Wallstreetcn
2025.06.06 08:16
portai
I'm PortAI, I can summarize articles.

DeepMind scientist Jon Richens presented a paper at ICML 2025, proposing the view that agents are essentially world models, arguing that achieving AGI requires learning predictive models of the environment. Shangmin Guo from the University of Edinburgh supports this conclusion and points out that strategies can be unified with world models into a single LLM. The research also aligns with Ilya's 2023 perspective, emphasizing the importance of world models in AGI. Although there are model-free agents, whether they have learned implicit world models remains to be explored

Just now, DeepMind scientist Jon Richens published a paper at ICML 2025, which has stirred up a lot of discussion.

Is a world model necessary for achieving human-level intelligence (i.e., AGI), or is there a shortcut without a model?

Starting from first principles, they revealed a surprising answer—

The agent is the world model!

Specifically, the formal answer to this question is as follows.

Any agent that can generalize to multi-step goal-directed tasks must have already learned a predictive model of its environment.

This model can be extracted from the agent's policy; to enhance the agent's performance or enable it to accomplish more complex goal tasks, a more accurate world model must be learned.

Paper link: https://arxiv.org/pdf/2506.01622

Industry: Significant Implications

Shangmin Guo, a PhD student at the University of Edinburgh, stated that he fully agrees with this conclusion from Google DeepMind, and they have also been intentionally training strategies for world modeling.

Coincidentally, they just published an article discovering that strategies and world models can be unified into a single LLM, thus completely eliminating the need for an external dynamic model!

Moreover, a viewpoint proposed in another article that has been submitted to RLC 2025 corroborates this research.

Some have also found that this research aligns with a statement made by Ilya in 2023—

There exists a deeper underlying principle, a fundamental law governing all agents

Some have proposed a very novel research idea: graphs—network graphs—are a very good abstract form of world models. Because there is no structure that we cannot describe using graphs.

Perhaps the importance of world models for AGI lies precisely in addressing the practical problems of complexity through dimensionality reduction.

Is there a shortcut without a model?

World models are the foundation of human goal orientation, but they are difficult to learn in a chaotic open world.

However, we have now seen many general, model-free agents, such as Gato, PaLM-E, Pi-0...

So, have these agents learned an implicit world model, or have they found another way to generalize to new tasks?

After investigation, researchers found that any agent capable of generalizing to a wide range of simple goal-oriented tasks must have learned a predictive model that can simulate its environment. Moreover, this model can always be recovered from the agent.

Specifically, they demonstrated that for a sufficiently broad set of simple goals (for example, guiding the environment to a desired state), as long as a goal-conditioned policy satisfies a certain upper bound on regret, it is possible to recover a bounded error approximation of the environment transition function from that policy!

In summary, to achieve lower regret or to accomplish more complex goals, agents must learn increasingly accurate world models.

And "goal-conditioned policies" are essentially equivalent to world models in terms of information!

However, this equivalence only applies to goals with multi-step time spans, while those short-sighted agents, which only consider immediate rewards, do not need to learn world models.

In conclusion, there is fundamentally no such thing as a "shortcut without a model"!

If you want to train an agent capable of completing a wide range of goal-oriented tasks, you cannot avoid the challenge of learning a world model Moreover, in order to enhance performance or versatility, agents also need to learn increasingly precise and detailed world models.

So, what world knowledge is actually contained within the agents?

To explore this answer, researchers have derived some algorithms that can recover the world model given the known agent strategies and goals.

These algorithms complete the trinity relationship of planning and inverse reinforcement learning.

Planning: World Model + Goal → Strategy

Inverse Reinforcement Learning: World Model + Strategy → Goal

The loop proposed by researchers: Strategy + Goal → World Model

In this process, the agent demonstrates astonishing emergent capabilities!

This is because, in order to minimize training loss across multiple goals, the agent must learn a world model that enables it to solve tasks that have not been explicitly trained.

Even simple goal-directedness can give rise to various abilities, such as social cognition, reasoning under uncertainty, and intention recognition.

Additionally, in previous research, they found that achieving robustness requires a causal world model.

However, in fact, task generalization does not require extensive causal knowledge of the environment.

Here, there exists a causal hierarchy system, but it targets agent nature and agent capabilities, rather than the reasoning process.

Now, let us carefully read this wonderful paper and embark on a feast of thought!

The characteristic of human intelligence is the world model

One major characteristic of human intelligence is the ability to complete new tasks with almost no supervision, a capability that can be formalized as "few-shot learning" and "zero-shot learning."

Now, LLMs are beginning to exhibit these abilities, which gives us hope for AGI—a system capable of completing long-sequence, goal-directed tasks in complex real-world environments.

In humans, this flexible goal-directed behavior heavily relies on rich psychological representations of the world, known as the "world model."

However, must one have a world model to achieve AGI?

This question has been a topic of debate in the industry.

In 1991, Brooks proposed the famous viewpoint in "Intelligence Without Representation": the world itself is the best model. All intelligent behavior can be generated through the interactions of the agent in the "perception-action" loop, without the need to learn explicit world representations.

Paper link: https://people.csail.mit.edu/brooks/papers/representation.pdf However, increasing evidence suggests that, in fact, model-free agents may be implicitly learning world models and even learning implicit planning algorithms.

This raises a fundamental question: Can we achieve human-level AI through "model-free shortcuts"? Or is learning a world model inevitable?

If a world model is necessary, how precise and comprehensive does it need to be to support a certain level of capability?

The answer in this paper is—

In a sufficiently diverse set of simple goal tasks, any agent that can meet the "regret bound" has necessarily learned an accurate predictive model of its environment.

In other words: The agent's policy already contains all the information needed to accurately simulate the environment.

More importantly, this conclusion by the researchers holds for any agent that meets the "regret bound," regardless of its training method, architecture, or even without assuming rationality.

Moreover, in Section 3, the researchers propose a new algorithm for extracting world models from general agents.

The results indicate that even if the agent significantly deviates from our set "capability assumptions," these algorithms can still recover an accurate world model!

Experimental Setup

In this experiment, uppercase letters represent random variables, while lowercase letters represent the values or states of those variables, i.e., X=x.

We assume the environment is a controllable Markov process, specifically a Markov decision process (MDP) without a specified reward function or discount factor.

Formally, a cMP consists of the following elements:

  • State set S

  • Action set A

  • Transition function

The sequence of state-action pairs evolving over time is called a trajectory, denoted as

A finite prefix of a trajectory is called history, denoted as

Definition 1 defines a controllable Markov process.

In Assumption 1, the researchers assume that the environment is described by a irreducible, stationary, finite-dimensional controllable Markov process (Definition 1) and includes at least two actions

The goal of researchers is to define a class of simple and intuitive objectives that allow us to reasonably expect agents to achieve these goals.

Thus, they proposed Definition 2.

Using Definition 2, complex composite goals can be constructed by combining objectives in a sequential or parallel manner.

  • Sequential combination: For example, first complete goal φA, then complete goal φB;

  • Parallel combination: As long as either φA or φB is satisfied.

Then, they proposed Definition 3.

For example, a maintenance robot is assigned the following tasks: either repair a malfunctioning machine or find an engineer and notify him that there is a problem with the machine.

Repairing the machine requires executing a series of predetermined actions a_1, a_2,…,a_N, and achieving the corresponding expected states s_1, s_2,…,s_N at each step.

The process of finding and notifying the engineer requires the robot to move to the engineer's location S=s_seng and perform a notification action A=a′.

The overall goal of the robot can be represented as a composite goal: ψ=ψ1∨ψ2. That is, completing either the repair task or the notification task is sufficient.

Agent

The goal of this research is to propose a simplified definition to describe agents that can achieve multiple objectives in their environment.

To this end, researchers focus on goal-conditioned agents, whose strategy is to map the history h_t and goal ψ to actions a_t (as shown in Figure 2).

The figure introduces an agent-environment system.

The agent is a function that maps from the current state s_t (or history) and goal ψ to action a_t.

The dashed line in the figure represents Algorithm 1, which can recover the state transition probabilities of the environment based on this agent mapping relationship.

It should be noted that this definition does not restrict the agent to rely on the complete historical environment to choose actions— Any strategy (such as a Markov strategy) can be represented in this way.

To simplify the analysis, researchers assume:

  • Complete observability: The state of the environment is fully visible to the agent.

  • Deterministic strategy: The agent follows a deterministic strategy.

Based on this, it is naturally defined that the optimal goal-conditioned agent for a given environment and goal set Ψ is one that maximizes the probability of achieving the goal ψ for all ψ ∈ Ψ, see Definition 4.

In reality, agents are rarely optimal, especially when performing tasks that require coordination of multiple sub-goals over a long time span in complex environments.

Therefore, researchers relaxed Definition 4 and defined a class of bounded agents, which can achieve goals within a maximum goal depth Ψn, with a failure rate that is bounded relative to the optimal agent.

Bounded agents are defined by two parameters (see Definition 5 below):

  • Failure rate δ ∈ [0, 1], which sets a lower bound on the probability of the agent achieving the goal relative to the optimal agent (similar to "regret");

  • Maximum goal depth n, where this regret bound only applies to goals with a depth less than or equal to n.

This definition naturally encompasses the types of agents we are concerned with—they have a certain capability in achieving goals of a certain complexity (parameterized by δ and Ψn).

Importantly, Definition 5 only assumes that the agent possesses a certain capability.

Agents are World Models

Ultimately, researchers proved the "equivalence" of conditional strategies and world models:

The transition function of the environment's approximation (world model) is determined solely by the agent's strategy and has a finite error.

Thus, learning such goal-conditioned strategies is informationally equivalent to learning an accurate world model.

This requires a reduction proof, with detailed proof found in the original appendix.

Specifically, researchers assume that the agent is a goal-conditioned bounded agent (Definition 5), meaning it has a certain (lower bound) capability in goal-oriented tasks of some finite depth n (Definition 3).

First, the researchers provide the pseudocode for Algorithm 1 used in the proof of Theorem 1.

In the case of a goal-conditioned strategy with a given regret bound, Algorithm 1 is used to derive a bounded error estimate of the transition probabilities Subsequently, the researchers presented Algorithm 2, which is an alternative algorithm for estimating Pˆss′(a). Its error bounds are weaker than those of Algorithm 1, but its implementation is clearly simpler.

Algorithm Combination

Algorithm 1 can recover a bounded error world model from a finite agent with target conditions.

Algorithm 1 is general, meaning that it applies to all agents that satisfy Definition 5 and all environments that meet Hypothesis 1.

It is also unsupervised; the only input to the algorithm is the agent's policy π.

The existence of this algorithm transforms π into a bounded error world model, meaning that the world model is encoded in the agent's policy, and learning such a policy is informationally equivalent to learning a world model.

The accuracy of the world model recovered from Theorem 1 improves as the agent approaches optimality (δ→0) and/or as the depth n of achievable sequential goals increases.

A key conclusion from the derived error bounds is that for any δ<1, if n is sufficiently large, we can recover an arbitrarily accurate world model.

Therefore, to achieve long-term goals, even with a high failure rate (δ∼1), the agent must learn a highly accurate world model.

The error bounds also depend on the transition probabilities.

This means that for any δ>0 and/or finite n, there may exist low-probability transitions that the agent does not need to learn.

This is consistent with the intuition that suboptimal or finite-time-span agents only need to learn a sparse world model that covers more common transitions.

However, to achieve higher success rates or longer time-span goals, a higher-resolution world model is required.

Figure 3: The average error ⟨ϵ⟩ in the world model recovered by Algorithm 2 and the trend of average error with ⟨δ(n=50)⟩.

Figure 3a shows that as the agent's generalization ability improves, the error (⟨ϵ⟩) of the recovered world model shows a significant downward trend.

This indicates that to maintain stable performance on more complex goals, the agent must construct a higher-precision internal world model.

This experiment validates the expectations regarding error convergence in the theoretical derivation.

Nmax(⟨δ⟩=0.04) indicates the maximum achievable goal depth under the condition that the agent reaches an average regret value of ≤0.04. The scaling relationship of the error is O(n^−1/2), which is consistent with the scaling relationship between the worst-case error ϵ and the worst-case regret value δ in Theorem 1.

Figure 3b shows the trend of average error with ⟨δ(n=50)⟩, which is the average regret value achieved by the agent on a goal with depth n=50 In the two figures, the error bars represent the 95% confidence interval of the average value from 10 experiments.

Myopic Agents: No Need to Learn World Models

Theorem 1 provides a trivial error bound, but these world models can only be extracted from agents with a maximum target depth of 1.

It is unclear whether this means that agents optimizing only immediate results (myopic agents) do not need to learn world models, or if Theorem 1 can capture this type of agent.

To address this issue, researchers derived results for myopic agents.

These agents satisfy the regret bound for n=1, and only have a trivial regret bound (δ=1) for any n>1.

Theorem 2 implies that there is no process that can even partially determine the transition probabilities from the strategies of myopic agents.

Theorem 2 explicitly constructs optimal myopic agents to illustrate this point, with detailed proof found in the original text Appendix B.

Therefore, the strategies of such agents can only provide trivial bounds on the transition probabilities.

Thus, for myopic agents, learning a world model is not necessary—

A world model is only required when agents pursue tasks that involve multiple sub-goals and require multiple steps to complete.

New Intelligence, Author: New Intelligence, Original Title: "DeepMind Reveals Surprising Answer: Agents Are World Models! Coinciding with Ilya's Prediction Two Years Ago"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at their own risk