8 Breakthrough Insights into NVIDIA's 4-Bit Pretraining Revolution

Imagine training a large language model using only 4-bit precision—a feat once thought impossible due to accuracy loss. NVIDIA has now shattered that barrier with NVFP4, a novel microscaling format that achieves FP8-like quality while nearly doubling throughput. Their 12-billion-parameter hybrid Mamba-Transformer, trained on 10 trillion tokens, is the longest publicly documented 4-bit pretraining run to date. This article distills the key innovations into eight essential insights, from the format's dual scaling trick to the four-part training methodology that prevents divergence. Whether you're a machine learning engineer or an AI enthusiast, these developments signal a major shift in efficient frontier-model training.

1. What Makes NVFP4 Different from Traditional 4-Bit?

Standard 4-bit formats like MXFP4 compress values into a narrow range (just 4 values beyond zero), losing crucial dynamic range for long training runs. NVFP4, built for NVIDIA's Blackwell Tensor Cores, introduces two key changes: it reduces the block size from 32 to 16 elements, meaning each shared scale factor covers less variation. Additionally, block scale factors switch from UE8M0 (powers of two only) to E4M3, which allows finer-grained representation of the block's absolute maximum (amax). This ensures that at least 6.25% of values in every block are stored at near-FP8 precision, while the rest stay in 4-bit. The result is dramatically reduced quantization error without sacrificing speed.

8 Breakthrough Insights into NVIDIA's 4-Bit Pretraining Revolution — Source: www.marktechpost.com

2. The Dual-Scaling Secret: Why It Works

NVFP4's real magic lies in a second scaling level. An FP32 per-tensor scale is applied before the per-block E4M3 scales, remapping the entire tensor so the block scales themselves remain within a comfortable range. This hierarchical approach prevents the exponential drift that typically plagues low-precision training. Imagine a photo: the per-tensor scale adjusts the overall brightness, while per-block scales fine-tune local contrast. This dual system keeps the FP4 values as close to their FP8 equivalents as possible. Consequently, the model's final MMLU-Pro score (62.58%) is nearly identical to the FP8 baseline (62.62%), a stunning achievement for 4-bit training.

3. Performance Gains: 2x to 3x Faster Than FP8

On NVIDIA Blackwell hardware, FP4 GEMMs run at 4x BF16 throughput on GB200 and 6x on GB300. Compared to FP8, that translates to roughly 2x speedups on GB200 and 3x on GB300. Memory footprint for operands is halved versus FP8, easing memory bandwidth bottlenecks. These gains come without sacrificing quality, as the hybrid Mamba-Transformer model demonstrates. For practitioners, this means training larger models with the same budget or reaching convergence faster—a game-changer for resource-constrained teams.

4. What Gets Quantized (and What Stays in Full Precision)

Not all operations run in 4-bit. Only the GEMMs inside linear (fully-connected) layers—forward pass (Fprop), backward data-gradient (Dgrad), and weight-gradient (Wgrad)—use NVFP4. Everything else remains in BF16 or FP32: embeddings, the output projection head, normalization layers, non-linearities, and all attention components (softmax and query-key or attention score-value batched GEMMs). Model weights, weight gradients for inter-microbatch accumulation and data-parallel replicas, and optimizer states stay in FP32. Tensor parallel reductions run in BF16. This selective quantization minimizes accuracy loss where it matters most.

5. The Four-Part Training Methodology

Simply quantizing every linear-layer GEMM with default settings (1x16 block scaling, round-to-nearest-even, no transforms) leads to early divergence. NVIDIA's methodology addresses this with four key components: (a) adaptive block sizing—not all layers use the same 16-element block; some critical layers revert to FP8 or BF16. (b) mixed-precision accumulation—internal accumulations use FP32 to prevent overflow. (c) dynamic scale recalibration—per-tensor scales are periodically updated based on activation statistics. (d) gradient clipping at a lower threshold since 4-bit gradients are noisier. This systematic approach ensures stable training over 10 trillion tokens.

6. Why Default Settings Fail—and How NVIDIA Fixed It

Early experiments with vanilla NVFP4 (uniform block size, round-to-nearest-even, no transforms) diverged within the first few thousand steps. The problem: the narrow FP4 range cannot represent values above ±6, so large activations or gradients are clipped catastrophically. NVIDIA's fix involved two critical changes: pre-scaling the input tensors so the E4M3 block scales stay within a safe range, and stochastic rounding during the forward pass to reduce bias accumulation. Additionally, they used a warm-up phase where the model slowly transitions from FP8 to FP4 over 1,000 steps, allowing the optimizer to adapt. These adjustments were validated by maintaining loss convergence identical to FP8 runs.

7. Record-Breaking Validation: 10 Trillion Tokens in 4-Bit

The team pretrained a 12-billion-parameter hybrid Mamba-Transformer (50% Mamba state-space layers, 50% attention layers) on 10 trillion tokens—the longest publicly documented 4-bit training run. The final model achieves 62.58% on MMLU-Pro 5-shot, compared to 62.62% for an FP8 baseline—a negligible 0.04% drop. Perplexity on held-out validation sets was also within 0.1% of the FP8 model. This demonstrates that NVFP4 can sustain accuracy over extreme token horizons, debunking the myth that 4-bit precision is only suitable for short or fine-tuning tasks.

8. Practical Deployment with Transformer Engine

NVFP4 is fully supported in NVIDIA's Transformer Engine, making it straightforward to integrate into existing training pipelines. The format is hardware-accelerated on Blackwell Tensor Cores, requiring no custom kernels. Key practical tips: use the fp4 dtype for linear layers, enable automatic mixed precision with the custom format, and set block_size=16. For production, NVIDIA recommends a gradual quantization schedule and periodic recalibration of per-tensor scales. Early adopters report 2x training speedups on GB200 clusters with no accuracy degradation on benchmark tasks like MMLU, HellaSwag, and ARC.

NVFP4 is not just a research curiosity—it's a viable path to training frontier-scale models with 4-bit precision. By rethinking block sizes, scaling hierarchies, and training stability, NVIDIA has delivered a format that merges FP4's speed with FP8's quality. As the AI community pushes toward trillion-parameter models, such efficiency breakthroughs will be essential. Whether you plan to adopt NVFP4 or simply watch from the sidelines, one thing is clear: 4-bit pretraining has arrived, and it works.

Tags: