Baidu Qianfan-VL is open source, and the purely domestically developed Kunlun chip achieves world-class performance

Baidu has open-sourced a new visual understanding model Qianfan-VL, which includes three versions: 3B, 8B, and 70B, all trained on the self-developed Kunlun chip P800. Qianfan-VL is a multimodal large model that possesses OCR and deep optimization capabilities for educational scenarios, enabling it to recognize various texts and complex formulas. In scientific question-answering tests, the 70B version scored 98.76, and in Chinese multimodal benchmark tests, it scored 80.98, demonstrating its advantages in Chinese image-text understanding

Baidu has directly open-sourced their brand new visual understanding model Qianfan-VL.

The Qianfan-VL series has three versions: 3B, 8B, and 70B, with parameter counts increasing from small to large, corresponding to different application scenarios.

The model has been trained entirely on Baidu's own Kunlun chip P800.

Model Performance and Applications

Qianfan-VL is a multimodal large model, capable of understanding both images and text. It can analyze data and trends from a complex chart.

Its two core abilities are OCR (Optical Character Recognition) and deep optimization for educational scenarios.

When you take a photo of an ID card, the system automatically fills in your name and ID number; this is OCR. Qianfan-VL has achieved full-scene coverage of this capability, recognizing printed text, handwriting, artistic fonts on street signs and product packaging, and even complex formulas on math papers. Information from invoices and receipts can also be automatically extracted and transformed into structured data.

In educational scenarios, especially in the K12 (from kindergarten to senior high school) stage, its goal is to become a super student. Solving problems by taking photos, geometric reasoning, and function analysis are its strong points.

Qianfan-VL's performance is compared with several mainstream multimodal models internationally.

In the ScienceQA test, the 70B version of Qianfan-VL achieved an almost perfect score of 98.76, leaving its competitors behind.

Especially in the Chinese multimodal benchmark test CCBench, Qianfan-VL-70B scored 80.98, while competitors at the same level only scored slightly over 70. This indicates its significant advantage in understanding image and text content in the Chinese context.

In several math problem-solving tests, such as Mathvista-mini, Math Vision, and Math Verse, Qianfan-VL-70B leads by a crushing margin.

Purely Domestic Chip Training

Supporting the training of the Qianfan-VL model is Baidu's self-developed Kunlun chip P800 In April 2025, Baidu lit up the first fully self-developed 30,000-card Kunlun Chip P800 cluster in China. All training tasks of Qianfan-VL were completed on a cluster of over 5,000 Kunlun Chip P800 cards.

What level is the Kunlun Chip P800?

From the paper specifications, the Kunlun Chip P800 has a very prominent advantage, which is excellent power consumption control, with a power consumption of 150W to 160W, far lower than its competitors. This means that when building large-scale clusters, energy consumption and heat dissipation costs will be more advantageous.

The real killer feature of the Kunlun Chip P800 lies in its architectural design.

The XPU-R architecture of the P800 separates the computing units and communication units at the hardware level. This is like changing a single-lane road into a two-way eight-lane highway, with a dedicated sidewalk for pedestrians next to it. Computing and communication each have their own paths, without interference, and can proceed simultaneously.

Baidu calls this technology "Unified Computing and Communication." Through clever scheduling, the waiting time for data transmission can be completely masked by the computing process. For example, while computing the first piece of data, the second piece of data is already on its way. By the time the first piece is finished, the second piece seamlessly connects. This greatly improves the utilization of the chip.

Based on this capability, Baidu also launched the "Kunlun Chip Super Node" solution, which can fit 64 Kunlun Chip P800s into a single cabinet. The data exchange between cards has changed from the slower "inter-machine communication" to the fast "intra-machine communication," directly increasing the bandwidth by 8 times and improving single-machine training performance by 10 times.

This is how models are refined

Its underlying architecture integrates excellent achievements from the industry. For the language model part, the small parameter 3B version is based on Qwen2.5, while the main 8B and 70B versions are based on Llama 3.1. The visual encoder uses InternViT, capable of processing ultra-high-definition images at 4K resolution.

The essence lies in its training method; Baidu has designed an innovative "four-stage training pipeline," like a precise four-step upgrade program.

"Cross-modal alignment." The goal of this stage is simple: to let the model's language part and visual part get to know each other and establish the most basic connection. During training, only the connections between them (something called MLP Adapter) are updated, while the language and visual modules themselves are frozen first to avoid mutual influence "General Knowledge Injection." In this phase, the model is fed a massive amount of data, totaling 2.66 trillion tokens of general knowledge data. At the same time, all parameters of the model are opened up for training. The goal of this phase is to lay a solid foundation of general knowledge for the model, making it a well-rounded "generalist."

"Domain-Specific Knowledge Injection." After becoming a "generalist," it is time to cultivate its "specialty." Baidu has selected a large amount of high-quality data in areas such as OCR, document understanding, and mathematical problem-solving for specialized training of the model. To prevent the model from forgetting general knowledge while learning specialized knowledge (a phenomenon known as "catastrophic forgetting" in AI training), a portion of general data is also mixed in during training.

"Post-Training." After the first three phases, the model is already quite capable, but it may still not be very "obedient." This phase involves a large amount of instruction fine-tuning data to teach the model how to better understand and follow human instructions, making it more like a capable assistant.

The specialized data used in the third phase is "created" by Baidu through a high-precision data synthesis pipeline.

Currently, the entire series of Qianfan-VL models has been fully open-sourced on platforms such as GitHub and Hugging Face, allowing enterprises and developers to download and use them freely.

Baidu Intelligent Cloud's Qianfan platform also provides online experience and deployment services.

GitHub:

https://github.com/baidubce/Qianfan-VL

Hugging Face:

https://huggingface.co/baidu/Qianfan-VL-70B

https://huggingface.co/baidu/Qianfan-VL-8B

https://huggingface.co/baidu/Qianfan-VL-3B

ModelScope:

https://modelscope.cn/organization/baidu-qianfan

Risk Warning and Disclaimer

The market has risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at one's own risk