
Li Auto heavily bets on "Smart Driving Veteran"

The ticket for the next stage
Author | Chai Xuchen
Editor | Wang Xiaojun
A week ago, Li Auto's first pure electric i8, heavily "transformed," made its debut. Alongside it, Chairman Li Xiang's AI vision was also delivered, one of the core elements being the VLA "Driver Large Model."
Two years ago, after deciding to double down on intelligent driving, Li Auto tasted the sweetness of hitting the market trend, with sales continuously climbing and the high-end Max and Ultra versions being particularly popular. Li Auto hopes to take a further step by introducing a new technological architecture to solidify its advantages in the intelligent driving field, and thus launched the VLA (Vision, Language, Action) large model, naming it the "Driver Large Model."
It launched the VLA (Vision, Language, Action) large model, naming it the "Driver Large Model."
In the past, if one accidentally missed an intersection, they would have to frantically search for a U-turn point, manually steering and checking road conditions, making mistakes easily in a panic. Now, all one has to say is, "Li Auto, make a U-turn ahead," and VLA immediately understands the command and executes it automatically.
"I believe VLA can solve the problem of fully autonomous driving," Li Xiang stated frankly, "The current rules and algorithms for assisted driving still have too large a gap compared to humans. The capability of the Driver Large Model is the closest to human ability and even has the potential to surpass human capabilities in intelligent driving solutions."
Why does VLA possess such powerful potential? In a recent interview, Lang Xianpeng, Senior Vice President of Autonomous Driving R&D at Li Auto, provided a detailed explanation of the principles behind VLA to Wall Street Insights.
Looking back, autonomous driving technology has developed rapidly in recent years, transitioning from the manual era to the AI era, with the watershed being the shift from no images to end-to-end. The core of the manual era was to control vehicle operation and movement using rule-based algorithms. Therefore, the key to performance in the manual era was the engineers.
However, according to Lang Xianpeng, humans have limitations, and many scenarios require "stacking people" to develop solutions. Moreover, many scenarios are like pressing a gourd only to have a ladle pop up, "once you finish one rule, another rule fails." Based on this, the industry has entered the end-to-end AI era.
Lang Xianpeng pointed out that the core of end-to-end + VLM is to mimic learning using human driving data. "In fact, we don't know how the car drives; we only know that the model we trained can drive." But end-to-end lacks deep logical thinking ability, "it's like a monkey driving a car, at most it's just a reflex."
Li Auto recognized this issue last year and pioneered the end-to-end + VLM approach, incorporating the visual language large model. When deep decision-making is needed, the VLM model can provide better decisions.
However, this is still not the optimal solution, "the reasoning speed of VLM is a bit slow, and the key is that many good decisions from VLM cannot be absorbed by the end-to-end model because the end-to-end model lacks thinking ability and does not understand what VLM is saying."
Thus, VLA was born. It is understood that all modules of VLA have been newly designed, with the spatial encoder using a language model combined with logical reasoning to provide reasonable driving decisions, and predicting the trajectories of other vehicles and pedestrians through Diffusion, further optimizing the best driving trajectory, selecting the one that resembles a "seasoned driver" the most Improved the vehicle's understanding of complex environments and its strategic capabilities.
"Can think, can communicate, can remember, can self-improve," this is how Lang Xianpeng summarizes the capabilities of VLA. Based on these abilities, the actual experience brought to users is safety, comfort, superb driving skills, and natural interaction capabilities.
The powerful potential of VLA has attracted many competitors and suppliers to quickly follow suit, announcing their intention to enter this new track. Now, Li Auto has chosen to heavily invest in "smart driving veterans." Can this technological revolution led by VLA help it solidify its throne in fierce market competition and ultimately obtain the ultimate ticket to fully autonomous driving? The market is watching closely.
The following is a transcript of the conversation with Lang Xianpeng, Senior Vice President of Autonomous Driving R&D at Li Auto, and Senior Algorithm Experts Zhan Kun and Zhan Yifei:
Q: The VLA driver has reasoning capabilities and behaves more like a human, but it requires a few seconds of reasoning time. How does the VLA driver perform quick thinking in sudden scenarios?
Lang Xianpeng: In fact, the current reasoning frame rate of VLA is around 10Hz, which is more than three times the previous VLM (3Hz).
Zhan Kun: The self-developed base model plays a significant role in deploying VLA. VLA is a 4B model, larger than before but with faster reasoning speed. Not every open-source model in the industry can achieve this efficiency; we are using a MoE 0.4×8 architecture, which is currently unique. This is developed in collaboration with the base team.
The reasoning frame rate of VLA is around 10Hz, and each frame goes through a language model, which includes both quick and longer thinking processes. We have made many optimizations to ensure that the thinking process can be reasoned out as much as possible on the vehicle side.
Q: How do you determine the timeline for the implementation of autonomous driving, and how will it be commercially monetized?
Lang Xianpeng: Technically, the VLA model can progress to higher levels of autonomous driving, but it is currently in the initial stage. At this stage, the VLA model is roughly equivalent to the end-to-end limit, and there is still a long way to go. However, I believe this process will not be particularly slow, as it took about a year to go from 10 MPI to the current 100 MPI, and it may iterate to 1000 MPI next year.
But the premise is to have complete foundational capabilities, such as algorithms, computing power, and data, as well as engineering support to achieve this. Especially, VLA's training is different from end-to-end; it requires more mature and simulated environments for reinforcement learning training, which is completely different from the previous reliance solely on real vehicle data for imitation learning training.
There are many factors influencing commercial monetization, with the most critical being national laws and policies. Li Auto is also actively participating in discussion groups on relevant national policies and regulations. From a technical perspective, the implementation of L4 level autonomous driving can happen very quickly, but from a commercial perspective, there are still many issues to consider, such as insurance and compensation after accidents.
Q: What are the challenges of the VLA model, and what challenges would a company face if it wants to implement the VLA model? Eddie Wu: If car companies want to implement the VLA model, can they skip the previous rule-based algorithms and end-to-end phases? I don't think so. Although the data and algorithms for VLA may differ from before, they still need to be built on the previous foundation. Without a complete data loop collected from real vehicles, there is no data to train the world model.
The reason Li Auto can implement the VLA model is that we have 1.2 billion data points. Only by fully understanding this data can we generate better data. Without this data foundation, we cannot train the world model, and we also won't know what kind of data to generate. At the same time, the support for basic training computing power and inference computing power requires substantial funding and technical capability, which cannot be achieved without prior accumulation.
Question: In the future, how does Li Auto plan for computing power reserves and card planning during the enhancement of autonomous driving capabilities?
Eddie Wu: The growth of computing power is related to the technical solutions. In the era of rule-based algorithms, the training cards were only used for training BEV models and perception models. However, in the end-to-end era, our training cards have grown from less than 1 EFLOPS to 10 EFLOPS last year, an increase of about 10 times. We believe that increasing training computing power is one aspect, while also increasing inference computing power.
Question: There exists an "impossible triangle" in intelligent driving, where efficiency, comfort, and safety are mutually restrictive. How does Li Auto think about this?
Eddie Wu: The driving data of Li Auto owners shows that accidents occur approximately every 600,000 kilometers, while with the use of assisted driving functions, accidents occur every 3.5 to 4 million kilometers. Our goal is to be 10 times safer than human driving, achieving an accident every 6 million kilometers, but this can only be done after the VLA model is improved.
We have also conducted analyses, and some safety risk issues may lead to takeovers, but poor comfort can also lead to takeovers, such as sudden braking or hard braking. If the driving comfort is poor, users still do not want to use assisted driving. We have focused on improving the driving comfort of the i8.
Efficiency ranks after safety and comfort. For example, if we take a wrong turn, although efficiency is compromised, we will not immediately correct it through dangerous maneuvers; we still need to pursue efficiency based on safety and comfort.
Question: You just mentioned that this year's real vehicle testing is 20,000 kilometers. What is the basis for significantly reducing real vehicle testing?
Eddie Wu: Cost is one aspect, but mainly we are testing and verifying scenarios where problems cannot be completely replicated. Additionally, the efficiency of real vehicle testing is too low. Our current simulation results can fully match real vehicle testing. Over 90% of the tests in the current super version and the VLA version of the Li Auto i8 are simulation tests.
Since last year's end-to-end version, we have already started conducting validation through simulation testing. Currently, we believe its reliability and effectiveness are very high, so we have replaced real vehicle testing with this. However, there are still some tests that cannot be replaced, such as hardware durability testing, but for performance-related tests, we basically use simulation testing as a substitute, and the results are also very good The simulation testing effect is good, and the cost is low. We retain real vehicle testing for some necessary content. Any technological advancement must be accompanied by changes in the R&D process. We have entered the era of VLA large models, and testing efficiency is the core factor in enhancing capabilities. If we want to iterate quickly, we must eliminate the factors in the process that affect rapid iteration. If there is still a large amount of real vehicle and manual intervention, the speed will be reduced.
Q: You just shared the end-to-end bottlenecks and some unresolved issues. Was VLA the only route considered at that time?
Lang Xianpeng: We have always maintained predictions and explorations of cutting-edge algorithms. When doing end-to-end, we were also considering the next generation of artificial intelligence technology. At that time, the most promising technology in the industry was the VLA technical solution, but it is not only used for autonomous driving; it is a technology in the field of embodied intelligence. We believe it is also a universal technical framework for the future of robotics. In fact, autonomous driving is also a form of robotics. If we hope to develop other robots in the future, we can also base them on a similar VLA framework.
The VLA architecture has many advantages. Compared to VA models or end-to-end models, the VLA model has cognitive abilities, which is an undeniable advantage. If we do not use the pre-training and post-training approach of such large language models, it is difficult to integrate such knowledge. For autonomous driving to advance to L4 or higher capabilities, L is a necessary path. Now, whether it is large language models or other models, they are also beginning to do end-to-end L.
Q: If the quantization accuracy is high, it can achieve double computing power on the Thor chip. Why can Li Auto maximize the chip's capabilities? Based on this capability, will Li Auto still develop its own intelligent driving chips?
Zhan Kun: Since last year, we have been using the Orin chip for large model deployment. At that time, NVIDIA thought it was impossible. Our engineering team and deployment team modified the underlying CUDA and rewrote the PTX low-level instructions to achieve the current results.
The engineering deployment capability of the Li Auto autonomous driving team has been consistent, from the early deployment of high-speed NOA on the Horizon J3 to deploying large models on the Orin chip, and now deploying high-frequency rapid large models on the Thor chip. These are all based on engineering accumulation and practice.
Whether the chip can be maximized mainly depends on the underlying analysis. The VLA has improved efficiency from initially requiring 500-600 milliseconds to infer a frame to achieving 10Hz, nearly a tenfold increase in efficiency. Many details involve breaking down the current chip adaptation algorithms after encountering problems and adjusting operators to better match the chip's current capabilities. Commonly used inference models use FP16, and we have reduced it to FP8, significantly enhancing performance. At the same time, FP4 is also highly regarded by NVIDIA in the latest Blackwell architecture, and we will further squeeze the chip's computing power.
Lang Xianpeng: The core reason for self-developing chips is that as a dedicated chip, it can be specifically optimized for our algorithms, resulting in high cost-effectiveness and efficiency. We are still using Thor because NVIDIA has good support for some new operators, and the computing power is relatively sufficient. There is still the possibility of changes in the overall VLA iteration process. If the algorithms are locked in the future, we will consider self-developing chips for better efficiency and cost Question: Is VLA an innovation that leans towards engineering capabilities?
Zhan Kun: If you focus on embodied intelligence, you will find that this wave is accompanied by the application of large models to the physical world, which essentially proposes VLA. Our VLA model aims to incorporate the ideas and pathways of embodied intelligence into the field of autonomous driving.
VLA is also an end-to-end system, because the essence of end-to-end is scene input and trajectory output, and VLA is the same, but the innovation in algorithms involves more thinking. End-to-end can be understood as VA, without a language model; language corresponds to thinking and understanding. We have added this part in VLA, unifying the paradigm of robots, allowing autonomous driving to also be a type of robot, which is an algorithmic innovation.
For autonomous driving, a significant challenge is the necessity of engineering innovation. Because VLA is a large model, deploying large models on edge computing power is very challenging. Many teams do not think VLA is bad, but rather that deploying VLA is difficult, and making it truly operational is very challenging, especially when chip computing power is insufficient, it is impossible to accomplish.
Question: Large language models may lack long-term memory and long-term planning capabilities. What improvements has Li Auto made in this regard?
Zhan Kun: In the past year, the development of large models and Agents has been very rapid, and memory is the RAG capability. When we issue commands, we can attach them to RAG, and when we come back here next time, it can easily recall that it had issued such commands before, which can be added to the prompts. We will perform prompt tuning, which essentially incorporates this knowledge into the VLA input, and large models will have such capabilities.
When we view the large model system as an Agent, it is essentially a system built around the large model, which includes tools and RAG external systems to enhance its memory and planning capabilities, allowing it to form a truly complete intelligent entity. We have done a lot of work to achieve this functionality.
Question: From an industry perspective, the current intelligent driving experience is relatively homogeneous. Will Li Auto output or open-source its intelligent driving capabilities to the industry or sell them to other car companies in the future?
Lang Xianpeng: I believe it is possible. We hope to contribute to the industry. But the premise is, first, whether we can validate this system well, because the development of VLA is still in the early stages of the technology cycle; second, whether others have the capability to work with us on this, as they also need their own evaluation methods, simulation environments, and reinforcement learning training capabilities. We may discuss the open-source issue next year.
Question: Lang Bo mentioned that VLA language interaction is a very important part. When can we achieve a more natural "do as I say" interaction experience?
Zhan Kun: A very important trend in the future is that the entire vehicle will have a unified brain. When this vehicle iterates better on the unified brain, it will not only understand intelligent driving, understand the vehicle system, and understand the whole vehicle, but it can also make more precise distinctions about whether I am manipulating the vehicle's behavior, the air conditioning, opening the windows, or playing music. There will be better understanding in this regard, which is the direction we will pursue in the long term Another aspect is our current interaction and generalized understanding of language. As the amount of data increases, this will become more prevalent, and there will be rapid iterations. One can imagine that early large language models also exhibited some foolish behaviors. As we collect more feedback and iterations, progress will be very fast. This is actually a capability that will quickly iterate during our gradual usage process.
Q: VLA is still in its infancy. Will there be more possibilities for personalized customization in driving styles or "driver personalities" in the future?
Lang Xianpeng: We are also considering providing different driving style experiences for different cars and users similar to yours. Not all cars will have the same driving style because the end-to-end capability may not have been available before. However, reinforcement learning has the ability to support the car becoming more like your style or experience as it drives.
Q: VLA is more focused on the brain aspect. What can be improved in terms of perception?
Lang Xianpeng: We still need to continue enhancing our technical capabilities. We have made significant upgrades in perception within VLA, allowing us to see further and more precisely. The pure visual range has expanded from 150 meters to 200 meters, and the OCC (Object Classification and Detection) has increased from 80 to 125. These are all technical capability enhancements currently being made in VLA, including improvements in data and reasoning performance.
Q: Li Auto is the first domestic car company to implement the VLA model. What was the biggest challenge during the R&D process?
Lang Xianpeng: The biggest challenge was iterating the entire R&D process. Each technological innovation is accompanied by iterations in the R&D process or methods. Last year, the end-to-end process required a data-driven workflow, which we did well. This year, we must implement a reinforcement learning process, which requires us to quickly validate the reliability and effectiveness of our world model and rapidly build an efficient simulation environment. This year, we also need to purchase and deploy a large number of inference cards.
Q: Many domestic competitors are also following up on VLA. Can you share the biggest pitfalls Li Auto encountered during the entire R&D process?
Lang Xianpeng: Your judgment of the entire industry or understanding of autonomous driving determines whether you will encounter pitfalls. We continuously iterate our understanding of autonomous driving and even artificial intelligence. Last year, when we were working on end-to-end, we constantly reflected on whether end-to-end was sufficient. If it wasn't enough, what else did we need to do? Last year, we were conducting some preliminary research on VLA, which represents our understanding of artificial intelligence—not merely imitative learning. It must have thinking capabilities like humans and possess its own reasoning abilities. In other words, it must be capable of solving problems it has never encountered or unknown scenarios. While there may be some generalization capabilities in end-to-end, it is not sufficient to say it has thinking.
Just like monkeys, they may do things that exceed your imagination, but they won't always do so. Humans, on the other hand, can grow and iterate. Therefore, we must develop our artificial intelligence according to the way human intelligence evolves, allowing us to quickly switch from the end-to-end approach to the VLA solution We have always had a relatively good understanding, there are definitely small pitfalls, such as the amount of computing power reserves, whether delivery is fast or slow, and other minor engineering details and optimizations, but we must avoid major judgment errors. I think we are still quite lucky.
Zhan Kun: We previously believed in Scaling Law, and the next step is the current test times Scaling Law. When we can provide more data and longer training times, it will always yield better results. I think this is something we need to firmly believe in, or what the AI community now calls "the bitter lesson." We must have faith in this area.
Question: It seems that the process of integrating the Thor chip into vehicles is not easy. How did both parties work together at that time?
Lang Xianpeng: In fact, we have accumulated a lot of cooperative experience with chip manufacturers and suppliers, including looking back at the J3 chip, which had significant design flaws at that time. However, we work with our partners to make some optimizations and iterations. The application of a new chip is always accompanied by some mutual adjustments and iterations. Our iteration speed is relatively fast; we do not stick rigidly to one solution but also make adjustments and optimizations based on the characteristics of the chip itself.
The Thor chip is a completely new chip, and it is normal to encounter issues in application and deployment. Companies that dare to adopt new first-generation chips will face these problems and resolve them. For example, issues with the J3 were resolved in the J5; the Orin-X issues may be resolved in Thor, and problems in Thor may also be addressed in other areas.
Question: Is a larger cloud-based model always better? What is the most suitable size of the model for car manufacturers?
Lang Xianpeng: Each has its advantages, but what is important is whether you can translate the capabilities trained by the model onto your own chip and convert it into actual value for users.
The larger the model's parameter count, the more resources and consumption it will require during training, which may lead to lower efficiency. If you want to distill a larger model into a very small one, there may also be a loss of capability during the distillation process. This tests the quantization optimization deployment capabilities of engineers in each company. For consumers, we still need to look at the final product experience and the value it brings to users.
Question: In the training of VLA, how can we avoid the large model producing counterintuitive instructions that differ from human understanding?
Zhan Kun: As for current technology, large models already have some preliminary consensus methods and ideas.
First, we need to meticulously clean bad data; the more we clean, the better the quality. Second, we need to generate data. Previously, many large language models would have hallucinations, essentially because the "large model" does not understand or has not encountered certain things, leading to responses outside its domain. Therefore, we need to construct a lot of data, even generate data, to help it fully understand this domain, to know all the knowledge, and even to know what it does not know. This is a very important capability Through these two ideas, it can significantly reduce the hallucination ability of language models, even for things that are contrary to common sense. Thirdly, super alignment allows it to better align with human values. For example, in the case mentioned earlier, it cannot cross into the opposite lane, which is a similar idea. This is the first question.
Q: Is there any relevant data to support that pure electric users prefer intelligent driving more?
Lang Xianpeng: The conclusion from the marketing department's research is definitely needed, and it ranks among the top three choices. Now, new car buyers must prioritize intelligent driving as one of their primary selection factors.
Q: Starting in the second half of this year, various car companies will promote VLA. What is Li Auto's technological advantage?
Zhan Kun: Our technology stack is continuous; we did not suddenly jump from the previous rule-based era to VLA. What we are doing with VLA is actually a continuation of our technological architecture, even leveraging our original advantages, standing on the shoulders of giants to continue.
We have invested heavily in RD pre-research. VLA is a relatively new direction and field in the autonomous driving sector. Li Auto has also specifically established a TBP project to promote the technological exploration of VLA. We have consistently adhered to the idea of "pre-research one generation, research and development one generation, delivery one generation," which gives us an advantage compared to other peers or competitors.
Lang Xianpeng: The core technological barrier of Li Auto is still the barrier of world model simulation, which is difficult for others to replicate in a short time. Because its iteration speed must be ensured, and it must be tested with real vehicles, it is very hard to surpass us. Secondly, it is definitely expandable; we have also established various other robotics departments. VLA is a good embodied intelligence technological framework that may extend to other directions.
Q: How does Li Auto understand the barriers to VLA?
Lang Xianpeng: Five years ago, Li Auto indeed entered the self-developed autonomous driving track as a follower, but our thinking about autonomous driving did not start in 2020. At that time, when Li Xiang interviewed me, he asked what I thought was the most important thing, for example, to succeed in autonomous driving or to be number one?
I said that from today's perspective, it is data. Although other factors are also important, data must be prepared in advance. We started working on data closed-loop from the Li ONE, although the data was still relatively scarce at that time. In 2020, through the first complete delivery year, we accumulated around 15 million effective feedback data, and the samples were accumulated from this.
Over the past five years, starting from last year, the industry or our competitors have truly taken Li Auto's autonomous driving seriously, but it is too late for them because building these capabilities cannot be fully established or reach our level in a day or two. This year, we started doing VLA; we were the first to propose it and immediately the first to deliver it, while many others are still just talking about it and using an end-to-end approach to do VLA.
If you continue to follow the end-to-end approach to do so-called VLA, your speed will definitely slow down. Even if it is 100 million Clips, first, you need to train 100 million Clips, which requires a huge amount of training computing power, and your iteration speed will also slow down VLA may seem slow now, just like end-to-end did last year, but in fact, end-to-end has become very fast. It took us more than three years to move to end-to-end from 2021, and we are still standing on the shoulders of giants. If we move forward, the entire industry has taken about 10 years to transition from rule-based algorithms to end-to-end. However, the iteration from end-to-end will be very fast because at that time, the entire engineering and data will mature. I believe VLA will also follow this pace. When you see a 1000MPI product in front of you a year later, I believe everyone will feel that autonomous driving has truly arrived.
I believe that companies with real technology, real capability, and real responsibility will definitely be the first to emerge. I believe Li Auto will definitely be the first to come out of this.
Q: Everyone says that multimodal models have not yet entered the so-called GPT moment. At this time, you need to create a mass production plan to push to the market. Do you think this plan is a good enough solution? How much longer will it take to reach the GPT moment?
Zhan Kun: Currently, VLM has fully met a very innovative GPT moment. If we talk about physical AI, the current VLA, especially in the fields of robotics and embodiment, may not have reached the GPT moment yet because it does not have such good generalization capabilities. However, in the field of autonomous driving, VLA addresses a relatively unified driving paradigm, which has the opportunity to achieve a GPT moment in this way.
We want to use VLA to explore a new path, and there are many points that need to be explored for implementation. It does not mean that if we cannot achieve the GPT moment, we cannot proceed with mass production. Our evaluation and simulation will verify whether it can achieve mass production and provide users with a "better, more comfortable, and safer" experience. Achieving these three points can provide users with better delivery.
The GPT moment refers more to having strong generality and generalization. In this process, as we expand autonomous driving towards spatial robotics or other embodied fields, we may generate stronger generalization capabilities or more comprehensive coordination abilities. After implementation, we will gradually migrate towards the ChatGPT moment with "user data iteration, richer scenarios, increasing logical reasoning, and more voice interactions." It is not necessary to reach the GPT moment to create an autonomous driving model. For example, after we implement VLA, we can still migrate towards ChatGPT. This is the direction we will gradually take the VLA model after the first version is implemented, moving towards "richer, more general, and more diverse" capabilities