github ggml-org/llama.cpp b8297

latest release: b8298
one hour ago
Details

ggml : add NVFP4 quantization type support (#19769)

  • WIP: add NVFP4 quantization support

  • tests

  • improve NVFP4 dot product implementation performance and fix bad super call

  • typo

  • Use nvfp4 kvalues

  • vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table

  • vulcal and perf fixes

  • wip

  • Fix metal

  • fix vulcan

  • Rename threshold & fix wrong scale

  • Fix MOE

  • Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)

Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:

  • ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
    quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
  • ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
    ggml-metal-ops.cpp
  • ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
  • ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.

  • Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.

  • quantize: add NVFP4 as a quantization type option

  • Fix ggml_fp32_to_ue4m3: handle subnormal values

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

  • Restore ARM NEON NVFP4 dot product implementation

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

  • Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
  • Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
    ggml_ue4m3_to_fp32() in the hot loop
  • Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
  • Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

  • ARM NEON NVFP4: rearrange q8 to match nibble layout

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.

  • CPU only backend 64 super-block layout

  • cleanup

  • Remove unused LUT

  • int

  • exclude NVFP4 from unsupported ops in metal build

  • remove quantization for now

  • store scales as native UE4M3, preserve original model bits when possible

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • correct comment

  • format

  • reduce duplication and cleanup

  • Address comments

  • move detection to prepare_tensors

  • Use math instead of const

  • Move

  • fix comment

  • Shelf quantize tests

  • Rebase and move check

  • cleanup

  • lint

  • Update gguf-py/gguf/scripts/gguf_convert_endian.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • Use fallback quant config

  • Simplify

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • organize

  • Refactor

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • add quantize_nvfp4 (required for test_quants.py)

  • add quantize_nvfp4 (required for test_quants.py)

  • add quantize_nvfp4 (required for test_quants.py)

  • fix return type


Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.