llama.cpp/ggml/include
Richard Davison 5eae9cb1d9
ggml : add NVFP4 quantization type support (#19769)
* WIP: add NVFP4 quantization support

* tests

* improve NVFP4 dot product implementation performance and fix bad super call

* typo

* Use nvfp4 kvalues

* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table

* vulcal and perf fixes

* wip

* Fix metal

* fix vulcan

* Rename threshold & fix wrong scale

* Fix MOE

* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)

Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
  quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
  ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.

* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.

* quantize: add NVFP4 as a quantization type option

* Fix ggml_fp32_to_ue4m3: handle subnormal values

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

* Restore ARM NEON NVFP4 dot product implementation

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq

- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
  ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

* ARM NEON NVFP4: rearrange q8 to match nibble layout

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.

* CPU only backend 64 super-block layout

* cleanup

* Remove unused LUT

* int

* exclude NVFP4 from unsupported ops in metal build

* remove quantization for now

* store scales as native UE4M3, preserve original model bits when possible

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* correct comment

* format

* reduce duplication and cleanup

* Address comments

* move detection to prepare_tensors

* Use math instead of const

* Move

* fix comment

* Shelf quantize tests

* Rebase and move check

* cleanup

* lint

* Update gguf-py/gguf/scripts/gguf_convert_endian.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use fallback quant config

* Simplify

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* organize

* Refactor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* fix return type

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-11 21:02:54 +01:00
..
ggml-alloc.h llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653) 2025-12-15 09:24:59 +01:00
ggml-backend.h chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
ggml-blas.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-cann.h docs : Minor cleanups (#19252) 2026-02-02 08:38:55 +02:00
ggml-cpp.h ggml : fix ggml_gallocr_ptr type (ggml/1205) 2025-05-01 09:58:44 +03:00
ggml-cpu.h ggml-cpu: FA split across kv for faster TG (#19209) 2026-02-03 01:19:55 +08:00
ggml-cuda.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-hexagon.h Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
ggml-metal.h metal : refactor + optimize v2 (#15995) 2025-09-17 20:38:12 +03:00
ggml-opencl.h Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs (#10693) 2024-12-13 12:23:52 -08:00
ggml-opt.h chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
ggml-rpc.h ggml : bump RPC version (#20330) 2026-03-10 21:36:57 +02:00
ggml-sycl.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-virtgpu.h ggml-virtgpu: make the code thread safe (#19204) 2026-02-04 10:46:18 +08:00
ggml-vulkan.h vulkan: Make Vulkan optional at runtime (#11493). (#11494) 2025-02-10 07:17:21 +01:00
ggml-webgpu.h ggml: Add initial WebGPU backend (#14521) 2025-07-16 18:18:51 +03:00
ggml-zdnn.h zdnn: refactor codebase + add docs (#16178) 2025-09-23 14:53:05 +08:00
ggml-zendnn.h ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690) 2025-12-07 00:13:33 +08:00
ggml.h ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
gguf.h GGUF: C++ refactor, backend support, misc fixes (#11030) 2025-01-07 18:01:58 +01:00