
As soon as DeepSeek mentioned FP8, NVIDIA pushed FP4 precision towards pre-training, making it faster and cheaper

DeepSeek mentioned FP8 quantization design during the release of V3.1, raising concerns about domestic chips and large model training. FP8, as an ultra-low precision format, can reduce storage and computational overhead. Meanwhile, NVIDIA launched the NVFP4 strategy, extending to the pre-training phase, claiming to train with 4-bit speed and efficiency, enhancing the efficiency of large-scale LLM training. This highlights the different development paths of domestic large models and chips
A few days ago, DeepSeek mentioned the quantization design of UE8M0 FP8 in the comment section of the article announcing DeepSeek V3.1, claiming it is designed for the upcoming next-generation domestic chip.
This has sparked a huge response, not only regarding the design of the next-generation domestic chip and the training of large models on domestic chips but also drawing attention to the quantization strategies for large models.
FP8, which stands for 8-bit floating point, is an ultra-low precision data representation format. Compared to traditional floating-point formats like FP32 (single precision) or FP16 (half precision), FP8 can further reduce storage and computational overhead while maintaining numerical stability and model accuracy.
In addition to NVIDIA, companies like Microsoft, Meta, Intel, and AMD are also researching FP8 training and inference, trending towards becoming the "new gold standard" in the industry.
Now, DeepSeek's adoption of the non-mainstream FP8 quantization strategy subtly reveals a different development path for domestic large models and domestic chips, contrasting with NVIDIA's highly compatible strategy.
The UE8M0 FP8 has significant strategic implications. By choosing to be the first to adopt and publicly declare the use of the UE8M0 format on the model side, DeepSeek binds its training and scaling strategies to this precision. This effectively sets a standard proposed by the large model side, forcing hardware and toolchains to adapt, accelerating the ecological construction of integrated domestic software and hardware.
Whether by coincidence or not, shortly after DeepSeek proposed the FP8 quantization strategy for domestic chips, today, NVIDIA has also made another move in the low-precision quantization field. However, this time it is not a new advancement in FP8 quantization but a leap to FP4 quantization.
NVIDIA has expanded its latest NVFP4 strategy to the pre-training phase, claiming it can train with precision matching 16 bits while operating at the speed and efficiency of 4 bits.
NVIDIA stated: "Using NVFP4 in pre-training can significantly enhance the efficiency of large-scale LLM training and infrastructure performance. This is not just a progressive optimization but a fundamental shift in redefining the way large-scale model training is conducted."
In the era of the "AI factory," computing power is the engine of progress, and numerical precision is no longer a backend detail but a strategic advantage. NVFP4 4-bit pre-training sets a new standard for efficiency and scalability, pushing high-performance AI model development into a new phase Currently, NVFP4 training is still in the research phase, exploring and validating the potential of 4-bit precision in large-scale model pre-training. Collaborations and experiments surrounding NVFP4 are actively advancing, with participants including leading organizations such as AWS, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, and Runway.
Opinions in the comments regarding NVIDIA's exploration at lower levels are mixed. Some recognize the positive role of NVFP4 in improving training speed and reducing costs and energy consumption, believing it is expected to drive more industries into an efficient and sustainable AI era.
Others mention that the combination of NVFP4 and Jetson Thor is expected to have a profound impact on real-world applications. Jetson Thor is NVIDIA's newly released next-generation chip dedicated to robotics, which significantly enhances computing power, enabling it to adapt to new algorithms for embodied intelligence and support various forms such as humanoid robots.
The potential combination of the two may bring higher energy efficiency and speed optimization on the training side, while fully utilizing high-performance, low-power computing capabilities in edge and inference scenarios, ultimately forming an efficient complete closed loop from training to deployment.
However, some are skeptical. Regarding NVIDIA's claim of being greener, they argue that while the new data format brings various optimizations, it does not mean that the overall computing power demand and energy consumption of AI will decrease, nor can it fundamentally change the energy and resource pressures caused by the continuous expansion of AI.
What is 4-bit quantization?
4-bit quantization refers to reducing the precision of weights and activation values in a model to just 4 bits. This is a significant compression in precision compared to the common 16-bit or 32-bit floating-point formats.
Using 4-bit quantization during the pre-training phase is very challenging. It requires carefully handling gradients and parameter updates while maintaining an increase in training speed to ensure that model accuracy is not lost.
To achieve this goal, NVIDIA must employ specialized techniques and methods to map originally high-precision tensors to a smaller set of quantized values while still maintaining the model's effectiveness
Fewer Bits Unlock Greater Potential for AI Factories
In recent years, the workload of AI has exploded—not only in the inference deployment of large language models (LLM) but also in the scale expansion of foundation models during pre-training and post-training phases.
As more institutions expand their computing infrastructure to train and deploy models with billions of parameters, a core metric has gradually emerged: how high the token throughput of AI factories can be maintained to unlock the next stage of model capabilities.
In the inference stage, precision formats have undergone multiple innovations: from the initial FP32 (32-bit floating point) to FP16, then to FP8, and recently even to NVFP4 released by NVIDIA for AI inference. Practices have shown that methods like post-training quantization (PTQ) can significantly enhance inference throughput using NVFP4 while maintaining accuracy.
However, challenges still exist in the upstream pre-training phase—most foundation models currently rely on BF16 or FP8 to maintain stability and convergence.
Pre-training is precisely the stage where AI factories consume the most computing power, energy, and time. With limited computing budgets and scarce GPU clock cycles, developers must be meticulous—calculating every bit, every token, and every training cycle. Throughput here is not just an abstract metric; it directly determines the scale of models that can be trained, how many experiments can be run, and how quickly new breakthroughs can be achieved.
This is where 4-bit precision truly has disruptive significance.
By reducing memory requirements, enhancing arithmetic operation throughput, and optimizing communication efficiency, 4-bit pre-training allows AI factories to process more tokens under the same hardware conditions. With the right quantization methods, its precision performance can be comparable to FP8 or BF16 while significantly improving throughput.
This means:
- Faster model convergence;
- More experiments can be run per unit of computing power;
- The ability to train unprecedentedly large cutting-edge models.
In other words, fewer bits not only save costs but also expand the capability boundaries of AI factories.
NVFP4 Pre-training Quantization Solution
To achieve 4-bit precision pre-training, NVIDIA has developed a dedicated NVFP4 pre-training solution that addresses the core challenges of dynamic range, gradient fluctuations, and numerical stability in large-scale training.
Blackwell is NVIDIA's first architecture natively supporting the FP4 format. The massive FP4 FLOPs throughput on GB200 and GB300 accelerates low-precision matrix operations while maintaining the scale and parallelism required for large model convergence, thus enabling efficient 4-bit training—making it an ideal choice for the next generation of FP4-based AI factories for pre-training The following figure 1 shows the GEMM performance measurement results of Blackwell Ultra, achieving a 7x acceleration compared to the Hopper generation. Modern large language models (LLMs) fundamentally rely on matrix multiplication, particularly in their fully connected layers or linear layers, where matrix multiplication is the core computational element. Therefore, the efficiency of these operations is crucial.
FP4 precision can execute these operations faster and more efficiently, and the observed GEMM acceleration means that the entire pre-training process is significantly accelerated, thereby shortening training time and supporting the rapid development of larger-scale models.
To achieve efficient low-precision training, NVIDIA's NVFP4 pre-training scheme employs several key technologies that are carefully selected based on performance and precision, including:
1. Utilizing micro-block scaling to enhance numerical representation of NVFP4
Blackwell introduces native Tensor Core support for NVFP4. NVFP4 is a 4-bit numerical format used for weights and activation values, employing micro-block scaling technology—where every 16 four-bit elements share a common scaling factor. Compared to MXFP4, which sets the block size to 32 elements, NVFP4 reduces the block size to 16 elements, thereby minimizing the impact of outliers and achieving more precise scaling. Finer-grained scaling reduces quantization error and improves overall model accuracy.
2. High-precision block encoding of NVFP4 using E4M3 scaling factors
The precision of scaling factors is critical in quantization quality and accuracy. Unlike MXFP4, which is limited to powers of 2 (E8M0) and prone to high rounding errors, NVFP4 uses high-precision E4M3 scaling factors with additional mantissa bits. This allows for finer-grained scaling, more effective utilization of the limited quantization range, and more accurate representation of values within blocks.
3. Reshaping tensor distributions to fit low-precision formats
During LLM pre-training, gradients and activation values often exhibit significant outliers, which can affect low-precision quantization. Applying Hadamard transforms to GEMM inputs can reshape their distribution to be closer to a Gaussian distribution, thereby smoothing out outliers and making tensors easier to represent accurately. These transformations are transparent to the model structure and can be applied in the linear layers of both forward and backward propagation.
4. Using quantization techniques to maintain data consistency
To ensure stable and efficient training, NVIDIA adopts quantization methods that maintain consistency between forward and backward propagation. Techniques such as selective two-dimensional block quantization help maintain alignment of tensor representations throughout the training cycle. This consistency is crucial for minimizing signal distortion, improving convergence behavior, and enhancing overall robustness, especially in low-precision formats like NVFP4 5. Reducing Bias through Random Rounding
Unlike traditional (deterministic) rounding, which always rounds to the nearest representable value, random rounding rounds up or down probabilistically based on the position of the value between two representable values. This step is crucial for reducing rounding bias, maintaining gradient flow during training, and ultimately improving model accuracy.
Precision and Stability at the Trillion Token Scale
To make low-precision formats practical in large-scale pre-training, both model accuracy and convergence stability must be ensured.
To evaluate the feasibility of 4-bit precision in large-scale model training, NVIDIA conducted experiments with FP8 and NVFP4 on a 12 billion parameter Hybrid Mamba-Transformer architecture model.
This model is similar to the NVIDIA Nemotron Nano 2, which is trained on an ultra-large dataset containing 100 trillion tokens, using a staged data mixing strategy: switching to different dataset mixes at the 70% stage of training and performing a third-stage data switch at the 90% stage of pre-training.
One version of the 12B Hybrid Mamba-Transformer model was initially trained using 8-bit precision (FP8). Previous research has shown that the accuracy of FP8 is very close to that of 16-bit precision, so FP8 was used as NVIDIA's baseline for comparison.
Subsequently, NVIDIA successfully trained the same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support complete pre-training at the trillion token scale. Moreover, NVFP4 exhibited stable convergence during training, without the instability or divergence issues that typically plague ultra-low precision training.
As shown in Figure 3, the validation loss curve of NVFP4 closely aligns with the loss curve of the high-precision baseline (i.e., FP8) throughout the training process. The aforementioned quantization techniques ensure that even with a significant reduction in bit width, the dynamic performance of 4-bit pre-training remains very close to that of high-precision training.
Subsequently, NVIDIA pre-trained the 12 billion parameter Hybrid Mamba-Transformer model using NVFP4 and compared it with the higher precision FP8 baseline across multiple downstream tasks and intelligence domains.
As shown in Figure 4, in all domains, the accuracy performance of NVFP4 is comparable to that of FP8, even surpassing it in the coding domain, demonstrating its effectiveness. This result further reinforces the initial hypothesis: even at the trillion token scale, NVFP4 remains a robust choice for pre-training large language models Validated its potential in efficient large-scale frontier model training.
Smart training, not just increasing investment
According to NVIDIA, the NVFP4 format is redefining the landscape of AI training and can set new benchmarks for speed, efficiency, and purposeful innovation. By achieving 4-bit pre-training, NVFP4 allows AI factories to scale faster and more sustainably, laying the foundation for a new era of generative AI.
Additionally, as a dynamic and continuously evolving technology, NVFP4 will continuously create new opportunities for frontier model teams, driving energy-efficient and high-performance AI development. With breakthroughs in computational efficiency, 4-bit pre-training will empower more advanced architectures, larger-scale training, and token processing, injecting new momentum into future intelligent systems.
Risk Warning and Disclaimer
The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment objectives, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at one's own risk