What is the UE8M0 FP8 that ignites domestic computing power chips?

Wallstreetcn
2025.08.24 01:20
portai
I'm PortAI, I can summarize articles.

With the expansion of the parameter scale of deep learning models, the demand for efficient computing and storage solutions has increased. Reducing the bit width of data types is an effective approach, but maintaining accuracy is a challenge. The Microscaling format introduced by NVIDIA Blackwell GPU enhances GPU efficiency. DeepSeek V3.1 uses UE8M0 FP8 scale, driving a short-term surge in the concept of domestic chips. Some domestic GPUs/NPUs claim to support FP8/MX, strengthening the narrative of software-hardware collaboration. In 2023, OCP released Microscaling v1.0, and in 2025, NVIDIA will adopt MXFP8 as a native data type to improve training efficiency

With the expansion of parameter scales in deep learning models (especially large-scale generative models), the demand for more efficient computing and storage solutions has become increasingly urgent. Reducing data type bit width (precision) is an effective approach, but maintaining accuracy while lowering bit width is a significant challenge.

During the pre-training process, representing model parameters and related tensors with fewer bits has become an essential technology for improving GPU efficiency without sacrificing accuracy. The Microscaling (MX) format introduced in NVIDIA's Blackwell generation GPUs combines narrow bit-width floating-point types with finer-grained block scaling factors, marking an important advancement in this direction; it allows more tensors to be quantized and makes operations on these tensors more efficient.

DeepSeek has sparked interest in domestic computing power chips, marking a critical turning point for domestic chips to break through? From an industrial perspective, future work is far from as simple as it seems, and the road ahead remains long and arduous!

DeepSeek V3.1 publicly named the use of UE8M0 FP8 scale and hinted at "next-generation domestic chip" collaboration. After concentrated media coverage, the "domestic chips, FP8 concept" surged in the A-share/Hong Kong stock market, and the topic quickly gained traction. At the same time, some domestic GPUs/NPUs claimed " native FP8 / Block FP8" or tool stacks that support FP8/MX, further reinforcing the narrative of "software-hardware collaboration → releasing bandwidth/power consumption dividends."

UE8M0/FP8 (MX) is not a new concept; as early as 2023, OCP released Microscaling (MX) v1.0 (block size K=32, shared scale UE8M0, etc.), establishing "block-level scaling + narrow bit-width floating-point" as an industry standard. By 2025, the king of AI chips, NVIDIA Blackwell, made MXFP8/6/4 the native data type for tensor cores, directly processing the logic of "one 2^k scale for every 32 numbers" (UE8M0) in hardware, no longer relying on software. Official materials and developer blogs have emphasized this point. With native support, MXFP8 training end-to-end throughput is approximately 2× that of BF16, rather than just "paper speedup" in the kernel. (This is explained in both the paper and official documents.)

**I specifically looked up the related papers; there isn't much content, just over 10 pages. The latest paper clarifies the reproducible methods for stabilizing pre-training of large models:All tensors (including activation gradients) uniformly use E4M3;scales use UE8M0, and log2(amax/destmax) is taken as "rounded up" to avoid divergence caused by overflow—this clearly distinguishes it from OCP v1.0's default rounding recommendations And provide empirical evidence of 8B/15T tokens and BF16 and other precisions.

However, the most critical aspect still lies in the underlying software and operator ecosystem, Transformer Engine, cuDNN/cuBLAS has implemented FP8/MX operators and data flows; NVIDIA NeMo and TE user manuals provide engineering paths.

There are increasingly more real cases on the large model side: public materials such as Nemotron-H and the Llama series mention using the FP8 route (initially mostly tensor scaling, now shifting to finer block scaling/MX). There is even a vLLM online FP8 generation path. All of these have connected the "training—inference—deployment" chain. The ecosystem is also spreading across manufacturers (for example, the Transformer Engine on the ROCm side), further enhancing "general perception."

What specific problems does it solve?

  1. Dynamic range overload: Scaling the entire tensor at once often cannot accommodate the simultaneous existence of "large values/small values," which can easily lead to overflow or compression to 0; block scaling can "align closely," resulting in less information loss.

  2. Lower bandwidth/VRAM pressure: 8-bit elements, adding only 1 byte of scale metadata for every 32; compared to "storing FP32 scale for each block," the metadata flow saves 75%.

  3. Lower hardware cost: UE8M0 only encodes 2^k, requiring only shifting, with a short critical path and low power consumption; for chips without complete FP8 multiply-accumulate units, the implementation threshold is lower.

Why does this bring benefits to domestic chips? In the stage where most domestic chips still primarily use FP16/BF16+INT8 pathways, introducing block-level scaling + native/nearly native FP8 access and operators can significantly reduce bandwidth and increase throughput without sacrificing precision. The hardware cost of UE8M0's "exponential scaling" is the lowest, making it a suitable transitional/long-term solution. Although it cannot achieve the same effects as NVIDIA, it can only be a second-best option, especially suitable for certain small scenarios on the edge.

1) What are UE8M0 / FP8 / MXFP8?

UE8M0 is not "another type of FP8," but rather the "block scaling factor" in the MX (Microscaling) format—8 bits are entirely given to the exponent (E8M0), only encoding powers of 2, used to uniformly scale FP8 elements within the same small block (typically K=32); thus, decoding only requires exponent shifting, without the need for floating-point multiplication, resulting in a shorter hardware critical path and more favorable bandwidth/energy consumption.

What common misconceptions exist?

  • Mistaking UE8M0 for "a third type of FP8"? Incorrect. It is a format for the "scaling factor," with elements still being E4M3/E5M2

  • It is believed that "with UE8M0, there will inevitably be a significant speedup," and the benefits depend on whether the hardware is natively MX, whether the model is bandwidth-limited, and whether communication/memory becomes a new bottleneck.

  • Understanding "75% savings" as "total traffic reduced by 75%" is inaccurate; it actually means reducing "the scaling metadata of each block" from 32b (FP32) to 8b (UE8M0) → a 75% reduction in the metadata part; the reduction in "overall block data" is smaller, but still beneficial.

Using UE8M0 FP8 scale, the goal is to be compatible with the "micro block format (MX)" ecosystem; the official sources have also mentioned the orientation towards compatibility with "new generation domestic chips" in foreign media and community pages.

An MX format is specified by: block size K, shared scaling factor X for each block, and data type of elements within the block. K=32 (suitable for all MX types). The type of X is UE8M0 (8-bit exponent, no mantissa, unsigned), representing NaN or powers of 2 (range 2^(−127) to 2^127).

Given K data V_i in the source format (usually FP32), when converting to MX format, it is necessary to calculate X and Q_i such that Q_i×X ≈ V_i. During storage, X and Q_i are written. Blackwell's tensor core will consume X and the Q_i of both sides of the block to perform the dot product; if the accumulated output is FP32, it will be re-quantized back to MX when subsequent operators require the MX format.

  • FP8 (E4M3 / E5M2) Two commonly used encodings for 8-bit floating point (1 sign + exponent + mantissa), widely used in the industry for training/inference. E4M3 has higher precision, while E5M2 has a larger dynamic range.

  • MX (Microscaling) Splits a tensor into fixed small blocks (typically K=32); each block shares a "scaling factor X" (stored in exponential form), and elements within the block are stored in a low bit-width format (such as FP8). This retains the low bandwidth advantage of 8 bits while achieving a larger usable dynamic range and more stable values through finer-grained scaling. The block scale of MX is independent of the element format.

  • UE8M0 The specific format of the scaling factor—unsigned (U), 8-bit exponent (E8), 0-bit mantissa (M0), meaning it only has an exponent, with no sign/mantissa; the "ExMy" notation is clearly defined in the OCP specifications: when y=0 (such as E8M0), it does not contain a sign bit. It only represents integer powers of 2, so hardware decoding is done through shifting, without the need for floating-point multiplication

  • MXFP8 Refers to the MX format set with "elements as FP8"; all specific MX formats share a scaling factor, uniformly adopting E8M0. The commonly used format is "UE8M0 + FP8(E4M3/E5M2), block size K=32".

Supported MX Formats by Blackwell

  • MXFP8: E4M3 (maximum approximately 1.75×2^8, minimum approximately 2^(−9), covering about 17.8 log2 buckets), tensor kernel throughput relative to BF16 ~2×.

  • MXFP8: E5M2 (larger dynamic range, approximately 31.8 buckets), tensor kernel throughput relative to BF16 ~2×.

  • MXFP6: E2M3/E3M2 (~2× throughput).

  • MXFP4: E2M1 (~4× throughput). Note: E4M3 has only one NaN bit pattern; E5M2 follows IEEE-754 special value semantics. More exponent bits → larger range; more mantissa bits → higher precision within a given range.

The paper shows that in 80 billion parameters, 15T tokens pre-training, it was observed that the validation perplexity of MXFP8 matches that of BF16 (overall difference <0.5%). Scores on downstream tasks (MMLU, 9 reasoning benchmarks) are also comparable. Similar equivalence holds for smaller models/data, making MXFP8 a more efficient pre-training option.

Model configuration: 32-layer Transformer, 32 heads, hidden 4096, GQA group 8, KV channels 128, pre-training sequence length 8192. Learning rate 6e-4 cosine decay to 6e-6; data mixing in two phases (first diversity, then high quality), switching at 60%.

Training platform: Megatron-LM; 3072 Hopper GPUs; batch size 768. MX operations simulate by converting BF16 input to MXFP8 before GEMM and converting back to BF16 after GEMM.

Evaluation: MMLU (5-shot), average score on 9 general reasoning (1-shot).

MXFP8 maintains BF16/FP8 level accuracy; on Blackwell, MXFP8 tensor kernel throughput is ~2×BF16, end-to-end pre-training is faster; compared to traditional FP8, the MXFP8 formulation is simpler (all layers can be quantized, scaling handled by hardware), with comparable or better throughput 2) What numerical and hardware problems does it solve?

On the numerical level, traditional "full tensor scaling" is prone to overflow/compression to 0 under sub 8-bit (<8b) or extreme value distributions; block scaling can "closely" match the amplitude distribution of each block, better covering large/small values, reducing saturation and underflow. Empirical evidence shows that in multiple tasks, MX can directly replace FP32 inference and even be used for low-bit training, achieving close/aligned precision with FP32/BF16.

Choosing between E4M3 and E5M2: With fine-grained block scaling in place, it is often practical to uniformly use E4M3 (higher "sampling precision") to achieve more stable training/downstream performance; Blackwell's MX training recipe also provides similar recommendations.

Hardware/System Level

UE8M0 = 2^k → decoding only requires shifting; there is no need for floating-point multiplication, normalization, or rounding, shortening the critical path, which is beneficial for high-frequency design and energy consumption control.

Scaling metadata is lighter: each block only adds 8 bits of scale. Compared to "storing one FP32 scale per block" (32 bits), the scaling metadata traffic is reduced by 75%; (overall block data changes from 256b→264b compared to 256b→288b, total traffic is also lower).

Ecosystem alignment: NVIDIA Blackwell has made MXFP8/6/4 a native data type for tensor cores (K=32, X=UE8M0), where MXFP8 has a claimed matrix core throughput of ~2× compared to BF16 on its platform. This sets the standard for a "common language" between upstream models and downstream hardware.

3) Why is it said to "fit the next generation of domestic chips"?

Most mass-produced domestic AI accelerators still primarily use FP16/BF16 + INT8 pathways, with varying support for a complete FP8 FMA hardware stack; while UE8M0's shifting decoding + block-level FP8 storage and computation has lower implementation difficulty and cost, aligning better with the phased evolution path.

Bandwidth/capacity constraints, in more sensitive environments, FP8+ block scaling can significantly reduce HBM/DDR pressure; this is precisely the direction domestic chips hope to "squeeze more out" through algorithms/formats in terms of power consumption/energy efficiency/bandwidth.

In reports from domestic media and institutions, Moore Threads' MUSA architecture claims native FP8 tensor acceleration and specifically mentions good support for UE8M0 FP8 Scale; Chipone's VIP9000 NPU has also been mentioned by several industry media and executive interviews regarding increased FP8 (E4M3/E5M2) support, emphasizing ease of deployment with mainstream frameworks/toolchains DeepSeek clearly adopts the UE8M0 FP8 scale, aligning the software recipe with the "best working point" of domestic hardware, effectively constructing a consistent coordinate system for software-hardware collaboration, thereby reducing the cost of ecological fragmentation.

Note: Whether specific manufacturers/models have "native FP8 tensor cores" or "Block FP8" should be based on official specifications/driver version descriptions; media releases and third-party articles may lag or have discrepancies in expression. The above citation is from public reports and industry interviews.

4) What is its relationship with "conventional FP8" (how to use it in combination)?

Still use E4M3/E5M2 (usually E4M3 is more stable throughout), shared scaling uses UE8M0; typical block size is K=32. This is MXFP8. Common practices for training/inference: weights/activations/gradients use MXFP8 in GEMM/CONV, while normalization/softmax/residuals use BF16/FP32; accumulation is generally in FP32, with the main weights typically kept as one FP32 "master copy." The scaling algorithm takes amax by block to determine the exponent, rounding up to avoid overflow, and then performs saturation quantization (clamping if exceeding the upper limit). This type of recipe provides specific steps and comparisons in Blackwell's MX paper.

5) "Quantization expectations" for model accuracy and throughput

Accuracy, in classification/speech/LLM, MX can achieve accuracy close to/aligned with FP32/BF16 after direct production/fine-tuning; for large model pre-training, MXFP8 can achieve perplexity/downstream scores equivalent to BF16 under suitable recipes.

Throughput/cost, on hardware that natively supports MX, matrix core throughput is approximately 2×BF16, with end-to-end training/inference time and memory usage correspondingly reduced (actual benefits depend on whether operators/bandwidth/communication are constrained).

What are the practical implications for the domestic ecosystem?

UE8M0 FP8 (MX) optimizes the model numerical recipe and hardware implementation costs to a "compatible & efficient" balance point: more stable accuracy, lower bandwidth, and shorter critical paths. DeepSeek aligns the training/weight format to the MX standard, which means "laying down the docking nails" on the domestic hardware side. As more chips make MXFP8 a "first-class citizen," the cost-effectiveness of software-hardware collaboration will truly be realized.

Therefore, we can see that UE8M0 FP8 (MX) is a good "format" that can significantly reduce bandwidth/power consumption and expand the quantifiable range; however, the "effect" depends on system engineering: whether there are native MX tensor cores, whether transposed weight quantization and dual-copy overhead are resolved, whether it expands on NVLink-level interconnects, and whether the toolchain effectively implements the recipe In these aspects, NVIDIA currently has a more complete end-to-end solution, so the "obvious gap" you see is essentially a platform gap, rather than "the UE8M0/MX route not working."

Therefore, domestic chips are once again heating up, but we still need to remain calm!

"Does having the UE8M0 FP8 (MX) format mean we can immediately achieve the same practical effects as NVIDIA?"

The answer is no! The gap often lies not in the "format itself," but in the operators/kernels, memory and interconnect, framework and toolchain, as well as the consistency of standard details. From an engineering perspective, we can see which shortcomings will directly consume the benefits we see in papers or promotions.

1) Numerical and Algorithm: Standard consistency has not been "fully aligned"

The definition of MX (K=32, each block shares the UE8M0 scale, block elements use FP8/FP6/FP4, etc.) is part of the OCP standard; UE8M0 only encodes powers of 2 (−127…127), which is inherently lightweight. The problem is: "How to round to a power of 2," which is not completely consistent across different implementations. NVIDIA's MXFP8 training recipe explicitly changes the scale rounding to ceiling (ceil(log2)), and provides an ablation: following the OCP v1.0 recommendation of "rounding down" will be more prone to overflow/divergence in large-scale pre-training. If hardware/software still follows v1.0, training stability may not align.

E4M3 "full quantization" choice: NVIDIA's conclusion is that weights/activations/activation gradients all use E4M3 (what is needed after block scaling is precision, not a larger exponent range), which differs from the old experience of "FP8=E5M2 for gradients." A slight difference in the recipe will make the effect "look like MX, but not run like it."

2) Operators and Kernels: No "native MX" incurs implicit overhead

MX requires handling a lot of "once per block" scaling in tensor kernels. Frequently processing these scalings in software is very costly; Blackwell integrates scale rounding and quantization into the tensor kernel instruction path at the hardware level, which eliminates this overhead. Without this hardware "shortcut," using MX on other chips will incur additional read-modify/write/quantization at the kernel level, consuming the benefits.

Transpose issue: Blackwell's MX requires "block data to be continuous along the reduction dimension," and during training, the reduction dimension will frequently change back and forth; ordinary FP8 transposition is a reordering, while MX's transposition requires "quantization," which can be very painful without dedicated hardware/kernel optimization.

Dual-axis dual quantization copies: To simultaneously serve both row/column reduction axes, training frameworks typically need to maintain two quantized versions of each tensor in MX; this consumes video memory and increases data movement. NVIDIA's papers and TE's engineering issues have pointed this out.

3) Memory and Interconnect: System "foundation" differences amplify effect gaps The scaling advantages of NVLink / NVSwitch: Blackwell brings NVLink bandwidth to 1.8 TB/s per GPU and integrates 72 GPUs into a 1.8 TB/s maintained NVLink domain through NVLink Switch, with the ability to scale across racks; this directly determines whether the bandwidth dividend of FP8/MX can truly be converted into cluster throughput. If the alternative platform only has PCIe or traditional Ethernet/IB, communication will be relatively tight, and the same MX/FP8 computing power advantage will be offset by All-Reduce/tensor parallel communication.

4) Ecosystem and Generality: The Toolchain is Still in the "Access Period"

Framework dtype and compiler tool support are not fully mature: The core layer of PyTorch is still advancing the basic types for MX (such as E8M0, FP4); Triton also has open questions about "how to expose MX/transpose mode in the language." Without native first-class support from leading frameworks, generality will be compromised.

Inconsistencies in cross-vendor FP8 "details": For example, AMD's documentation clearly states that the FP8 encoding of MI300 is different from H100; combined with the scaling rounding differences of MX, migrating "similarly named FP8/MX" models between multiple hardware vendors may require re-conversion/re-calibration to stabilize.

Current status of MX on non-NVIDIA platforms:

  • AMD: Public information has introduced the OCP MX concept and FP8 support at the tutorial/white paper level, but whether there is "native MX block scaling hardware pipeline" is not yet standard, mostly experimental/transitional software paths.

  • Intel Gaudi: Officially emphasizes FP8 training/inference computing power and inference tutorials, but does not claim native block scaling for MX; if it is just conventional FP8 (scaling by tensor/axis), the complexity and benefit curve of MX implementation are different.

5) What are the "most damaging" issues that usually lead to result discrepancies?

  1. Inconsistent numerical details (scaling rounding, gradient format): Unstable training or requires more conservative hyperparameters → Effective throughput decreases.

  2. Lack of "built-in MX" tensor kernels: Scaling processing/transpose quantization falls on software → GEMM bypass overhead increases.

  3. Storage/communication bottlenecks: Dual-copy video memory + edge scaling + insufficient inter-card communication → The bandwidth savings of MX cannot be realized.

  4. Incomplete toolchain and op coverage: Certain layers (embedding/final projection, BMM/softmax, etc.) still require high precision; if the execution plan is not aligned well, end-to-end benefits will be diluted by "non-MX segments."

However, for domestic chips struggling to survive in a tight space, this is also one of the few modes of seeking change, with a long way to go in the future.

Even without "native FP8 tensor kernels," effective results in terms of bandwidth/video memory can still be achieved through the hybrid path of "FP8 access + fast shift decoding → advancing to FP16/BF16 multiply-add"; Hardware only needs to add lightweight scale processing and shift units. Under the same memory bandwidth and power budget, the model can be larger, and the batch can be fuller, resulting in better throughput per unit TCO. Models like DeepSeek clearly use the block scaling paradigm of UE8M0, making it easier for the software stack (quantization, calibration, inference engine) to achieve unified adaptation on domestic chips, reducing the fragmentation costs of "everyone doing their own thing." Compared to "jumping straight to a fully functional FP8 FMA core," it is more realistic to first enable MX (block scaling + shift decoding), which belongs to progressive evolution:

  • Step one: Prioritize inference (weights FP8 + activations BF16/FP16, accumulation FP32);

  • Step two: Partial training link FP8 conversion (GEMM backbone FP8, normalization/Softmax, etc., maintain high precision);

  • Step three: Hardware generation upgrade, then develop native MX/FP8 tensor cores.

"Unable to achieve NVIDIA's effect, so is it just a fallback, more suitable for edge small scenarios?"

U1S1, there is indeed a gap currently: Without "native MX" tensor cores, high bandwidth interconnects (on par with NVLink/NVSwitch), and incomplete operator/framework support, the theoretical advantages of UE8M0/FP8 will be consumed by kernel overhead and communication bottlenecks. This is the reality for many platforms today.

But it does not mean "only for edge":

  • Data centers can also benefit, provided that block scaling and scale processing are integrated into the kernel to reduce the back-and-forth of "quantization—dequantization"; many domestic solutions have already implemented this hybrid path at the inference end.

  • Edge/endpoint is certainly more "fitting"—in places with narrow memory and tight power, the bandwidth/power benefits of UE8M0+FP8 will be more direct and stable; for example, embedded large language models, speech/vision edge models, and local inference for AI PCs.

  • The strategy is not "settling for less," but rather "first reaping the certainty dividends": first fully capitalize on the access and bandwidth dividends, then gradually FP8-ify the computation path.

When is it "most cost-effective" to use it?

  • Inference first: LLM, ASR, CV large model weights FP8 (block scaling) + activations 16bit + FP32 accumulation; significantly reduces memory usage and weight bandwidth, with noticeable improvements in latency/throughput.

  • Training pilot: Small to medium-scale pre-training/continued training (SFT/distillation/LoRA), using MXFP8 for the GEMM backbone, maintaining high precision for normalization/Softmax, first run stably before scaling up.

  • Bandwidth/power constrained: AI PCs/edge boxes/embedded SoCs, keeping power consumption down while increasing model size.

Therefore, UE8M0 FP8 (MX) = low bandwidth + low implementation threshold + sufficiently stable numbers, represents a realistic and progressive incremental route for current domestic chips still primarily using FP16/BF16+INT8 It's not just about the edge side, but the "cost-performance improvement" in edge/power-sensitive scenarios is the most immediate; for data centers to approach top-tier performance, operator-level integration, block scaling down to the kernel, and better interconnect bandwidth are needed. First, capture the dividends of weights/access, then advance the computational path and interconnect, this path can be traversed, and there will be short-term gains.

This article is sourced from: The Beauty of Bayesian, original title: "Heavy Depth: What is the UE8M0 FP8 that Explodes Domestic Computing Power Chips?"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at one's own risk