Details
ggml : add NVFP4 quantization type support (#19769)
-
WIP: add NVFP4 quantization support
-
tests
-
improve NVFP4 dot product implementation performance and fix bad super call
-
typo
-
Use nvfp4 kvalues
-
vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
-
vulcal and perf fixes
-
wip
-
Fix metal
-
fix vulcan
-
Rename threshold & fix wrong scale
-
Fix MOE
-
Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
- Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
-
quantize: add NVFP4 as a quantization type option
-
Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
- Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
- Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
- ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
-
CPU only backend 64 super-block layout
-
cleanup
-
Remove unused LUT
-
int
-
exclude NVFP4 from unsupported ops in metal build
-
remove quantization for now
-
store scales as native UE4M3, preserve original model bits when possible
-
Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
correct comment
-
format
-
reduce duplication and cleanup
-
Address comments
-
move detection to prepare_tensors
-
Use math instead of const
-
Move
-
fix comment
-
Shelf quantize tests
-
Rebase and move check
-
cleanup
-
lint
-
Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
Use fallback quant config
-
Simplify
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
organize
-
Refactor
-
Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
add quantize_nvfp4 (required for test_quants.py)
-
add quantize_nvfp4 (required for test_quants.py)
-
add quantize_nvfp4 (required for test_quants.py)
-
fix return type
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: