The next entry point for superintelligence: Google, Meta, NVIDIA... Tech giants are all doubling down on "world models"

AI giants such as Google DeepMind, Meta, and NVIDIA are shifting their research focus towards "world models" in hopes of gaining an edge in the race towards machine "superintelligence." "World models" understand the physical world by learning from video and robotic data, with broad application prospects. NVIDIA executives stated that the potential market size could reach up to $100 trillion, covering fields such as autonomous driving, robotics, and manufacturing

With the slowdown in the advancement of large language model technology, a new AI competition centered around "world models" is quietly unfolding among tech giants. This trend signifies that the focus of competition in the AI field may be shifting from language to understanding and simulating the physical world.

According to a report by the Financial Times on September 29, companies such as Google DeepMind, Meta, and NVIDIA are attempting to gain an edge by developing a new type of system. These systems no longer rely solely on language but instead learn from video and robotic data to understand and navigate the physical world.

The potential market for "world models" is considered extremely vast. Rev Lebaredian, Vice President of NVIDIA Omniverse and Simulation Technology, stated that "world models" will bring technology into tangible fields such as manufacturing and healthcare, with a potential market size that could "reach up to $100 trillion."

"World models" are seen as a key step in advancing autonomous driving, robotics, and so-called "AI agents," but their training also faces significant data and computational challenges.

Simulating the Physical World: Latest Technological Breakthroughs

In recent months, several AI companies have successively announced progress in the field of "world models," highlighting the heating up of this sector.

Google DeepMind released Genie 3 last month, a model capable of generating video frame by frame while considering past interactions, changing the traditional approach of generating an entire video at once. Shlomi Fruchter, co-lead of the Genie 3 project, stated that by constructing environments that simulate the real world, AI can be trained in a more scalable way, "without the consequences of making mistakes in the real world."

Meta is attempting to mimic the way children learn passively by observing the world, training its V-JEPA model with raw video content. The Facebook Artificial Intelligence Research Lab (FAIR), led by Meta's Chief AI Scientist Yann LeCun, released the second version of this model in June and has begun testing it on robots.

Meanwhile, chip giant NVIDIA's CEO Jensen Huang asserted that the company's next major growth phase will come from "physical AI," with these new models set to revolutionize the robotics field. NVIDIA is leveraging its Omniverse platform to create and run such simulations to support its expansion into robotics.

One recent application of "world models" is in the entertainment industry. The startup World Labs, founded by AI pioneer Fei-Fei Li, is developing a model that can generate video game-like 3D environments from a single image.

Video generation startup Runway also launched a product last month that creates game scenes using "world models." Its CEO Cristóbal Valenzuela pointed out that compared to previous models, the "world model" system can better understand and reason about the physical laws in a scene

Why Are Giants Betting on New Tracks?

The reason tech giants are turning their attention to "world models" is primarily driven by the widespread belief in the industry that large language models (LLMs) are hitting their performance ceiling.

Despite significant investments from major companies, the performance leaps of the new generation of LLMs released by organizations like OpenAI, Google, and Musk's xAI have begun to slow down.

Yann LeCun, Meta's chief AI scientist and regarded as one of the "fathers" of modern AI, has consistently warned that LLMs will never achieve reasoning and planning capabilities akin to those of humans.

However, building these models requires the collection of vast amounts of data from the physical world and computational power, which remains a significant technical challenge yet to be overcome. Nevertheless, companies like NVIDIA and Niantic are attempting to fill data gaps by generating or predicting environments through models.

Although the prospects are promising, the road to mature "world models" remains long. LeCun and others at Meta believe that achieving machines driven by next-generation AI systems with human-level intelligence may still take another decade