Hyper Track | XPeng Robotics Center establishes Intelligent Mimetic Department

Aiming at the multimodal field of robotics, three major breakthrough directions

Author: Zhou Yuan / Wall Street News

Recently, it has been reported that XPeng's Robot Center has established a new Intelligent Mimetic Department, focusing on the multimodal field of robotics, with research directions covering embodied intelligence native multimodal large models, world models, spatial intelligence, and other cutting-edge areas.

Public information shows that Ge Yixiao, who leads this department, has an impressive background.

Ge Yixiao previously served as a technical expert at Tencent ARC Lab and was promoted to Tencent T12 technical expert level at just 28 years old, making significant contributions in the multimodal field. He was awarded the Tencent Technology Breakthrough Award for two consecutive years in 2023 and 2024.

After graduating from the School of Automation at Huazhong University of Science and Technology, Ge Yixiao entered the MMLab at The Chinese University of Hong Kong to pursue a Ph.D., focusing on representation learning in computer vision, and has published several papers at top international conferences such as NeurIPS, ICLR, and ECCV.

Currently, the department has only three members, including Ge Yixiao, but this is just the beginning. The department has already started recruiting through social recruitment, campus recruitment, and internships, with the job position of "Research Scientist (Multimodal Direction)." From the job description, which mentions "building industry-leading embodied intelligence native multimodal large models, world models, with the potential for application in general humanoid robots and more embodied scenarios," as well as "creating technological influence and leading international industry development," one can deeply feel XPeng's high expectations for this new department.

Three Major Research Directions

In the evolution of robotics technology, traditional robots have significant shortcomings in perception and interaction, being able to work only based on a single or a few information sources, which greatly limits their operational capabilities in complex environments.

Firstly, the emergence of embodied intelligence native multimodal large models is expected to fundamentally change this situation.

This aims to endow robots with comprehensive perception and interaction capabilities, allowing them to process visual, auditory, tactile, and other multimodal sensory information simultaneously, much like humans.

For example, in home service scenarios, most household robots can currently only perform simple cleaning tasks and often struggle with complex instructions.

If substantial progress is made with embodied intelligence native multimodal large models, robots will be able to accurately recognize their owner's voice commands and hand gestures while perceiving obstacles in the surrounding environment, thus smoothly completing complex and detailed tasks such as tidying up rooms and caring for the elderly.

In industrial production scenarios, robots can integrate visual recognition of component shapes and positions with tactile perception of assembly force, achieving high efficiency and precision in product assembly, significantly enhancing production efficiency and quality.

From a technical principle perspective, this model needs to overcome challenges such as multimodal data fusion and unified representation learning, constructing a framework capable of collaboratively processing various sensory information, which poses extremely high demands on algorithm design and computational power support.

Secondly, the construction of world models aims to enable robots to deeply understand the operational rules of the world through observation and interaction.

In the past, robots heavily relied on preset programs when performing tasks, lacking flexibility in the face of environmental changes or new tasks. World models can help robots infer the state information of the world that has not been perceived and make reasonable predictions about future state changes In a factory environment, robots utilize world models to gain an in-depth understanding of factory layouts and equipment operating mechanisms, allowing them to anticipate potential issues during execution, such as delays in parts supply and conflicts in operational processes. This enables them to adjust their work pace and methods in advance, enhancing production efficiency and accuracy.

When robots are placed in new environments or face new tasks, world models allow them to reason and experiment based on existing knowledge and experience, freeing them from excessive reliance on preset programs.

For example, in a logistics warehouse, robots can understand storage rules and handling processes based on world models. When the placement of goods changes, they can quickly plan new handling routes to efficiently complete the task of moving goods.

From a technical implementation perspective, world models need to integrate a large amount of environmental data and use methods such as machine learning and reinforcement learning to construct models that accurately reflect dynamic changes in the environment, achieving precise modeling and prediction of complex environments.

Thirdly, spatial intelligence focuses on the robot's precise understanding and efficient use of three-dimensional spatial information.

In practical scenarios such as logistics warehousing and construction, robots need to accurately perceive and operate on objects in three-dimensional space.

Currently, most robots have limited precision in spatial perception and operation, making it difficult to meet the demands of complex tasks.

Robots with strong spatial intelligence can accurately determine the position, shape, size, and spatial relationships of objects, efficiently completing various spatial tasks.

At construction sites, robots can use spatial intelligence to identify the locations of building materials, plan lifting routes, and accurately transport materials, avoiding collisions with construction personnel and other equipment. In logistics warehousing, robots can quickly locate the storage positions of goods, optimize handling routes, and improve warehouse space utilization and the efficiency of goods entering and exiting.

From a technical perspective, spatial intelligence involves several key technological aspects, including three-dimensional visual perception, spatial reasoning, and path planning. It requires the development of advanced sensor technologies, algorithm models, and real-time computing capabilities to ensure that robots can process complex spatial information in real-time and accurately.

Strategic Value of Multimodality

He Xiaopeng, chairman of XPeng Motors, revealed in March this year that XPeng Motors has been deeply involved in the humanoid robot industry for five years and may need to invest another 20 years, planning to invest 50 billion to even 100 billion yuan.

He Xiaopeng also mentioned that XPeng Motors plans to mass-produce L3-level humanoid robots for industrial scenarios by 2026, achieving full-dimensional collaborative capabilities of hands, feet, eyes, and brain.

In the May earnings call, He Xiaopeng disclosed that the fifth-generation robots will deploy Turing chips, significantly enhancing the computing power on the robot side, and will leap over the commonly used small models of reinforcement learning and segmented end-to-end technology routes, directly reusing the VLA architecture of XPeng's physical world base model, fully utilizing cloud AI infrastructure to improve the intelligence level of robots.

The establishment of the Intelligent Mimetic Department focusing on multimodality is a key move in XPeng's long-term strategic layout in the field of robotics.

Multimodal technology is considered a core element in enhancing robot intelligence. It breaks the limitations of traditional robot perception and interaction, allowing robots to perceive the world from multiple dimensions, obtain richer and more comprehensive information, and make more reasonable and intelligent decisions, greatly expanding the application scenarios and practical value of robots From a strategic perspective, XPeng aims to build a differentiated competitive advantage in the robotics field by focusing on multimodal technology, laying a solid foundation for future expansion in various areas such as smart mobility, home services, and industrial production.

Research directions such as embodied intelligence, native multimodal large models, world models, and spatial intelligence are at the forefront of the industry, with significant technical challenges.

In terms of algorithm optimization, it is necessary to break through the limitations of existing algorithms and develop new algorithms that can efficiently process multimodal data and achieve accurate predictions and decisions. Regarding computing power support, the current level of computing power is inadequate to meet the demands of massive data processing and complex model calculations, necessitating improvements in hardware performance and optimization of computing architecture.

Additionally, data quality is crucial; high-quality, diverse, and accurately labeled data is the cornerstone of model training. However, obtaining and organizing such data faces numerous challenges, such as high data collection costs and difficulties in ensuring labeling accuracy.

From the perspective of industry competition, the technology route competition in the robotics field is intense, with major enterprises and research institutions actively positioning themselves.

XPeng, by taking multimodal technology as its entry point, avoids direct confrontation with some giants, but the feasibility of this technological path has yet to be fully validated, and the outcomes of development remain uncertain.

However, if XPeng achieves breakthroughs in multimodal technology, it could reshape the industry landscape, driving the robotics industry towards a more intelligent and efficient direction, injecting new vitality and ideas into industry development