llama.cpp/tests
Richard Davison 5eae9cb1d9
ggml : add NVFP4 quantization type support (#19769)
* WIP: add NVFP4 quantization support

* tests

* improve NVFP4 dot product implementation performance and fix bad super call

* typo

* Use nvfp4 kvalues

* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table

* vulcal and perf fixes

* wip

* Fix metal

* fix vulcan

* Rename threshold & fix wrong scale

* Fix MOE

* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)

Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
  quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
  ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.

* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.

* quantize: add NVFP4 as a quantization type option

* Fix ggml_fp32_to_ue4m3: handle subnormal values

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

* Restore ARM NEON NVFP4 dot product implementation

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq

- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
  ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

* ARM NEON NVFP4: rearrange q8 to match nibble layout

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.

* CPU only backend 64 super-block layout

* cleanup

* Remove unused LUT

* int

* exclude NVFP4 from unsupported ops in metal build

* remove quantization for now

* store scales as native UE4M3, preserve original model bits when possible

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* correct comment

* format

* reduce duplication and cleanup

* Address comments

* move detection to prepare_tensors

* Use math instead of const

* Move

* fix comment

* Shelf quantize tests

* Rebase and move check

* cleanup

* lint

* Update gguf-py/gguf/scripts/gguf_convert_endian.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use fallback quant config

* Simplify

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* organize

* Refactor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* fix return type

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-11 21:02:54 +01:00
..
peg-parser common: consolidate PEG string parsers (#20263) 2026-03-10 00:29:21 +01:00
.gitignore common : introduce composable PEG parser combinators for chat parsing (#17136) 2025-12-03 12:45:32 +02:00
CMakeLists.txt common/parser: handle reasoning budget (#20297) 2026-03-11 10:26:12 +01:00
get-model.cpp ci : add model tests + script wrapper (#4586) 2024-01-26 14:18:00 +02:00
get-model.h ci : add model tests + script wrapper (#4586) 2024-01-26 14:18:00 +02:00
gguf-model-data.cpp tests : model metadata loading from huggingface (#19796) 2026-02-28 10:44:38 +01:00
gguf-model-data.h tests : model metadata loading from huggingface (#19796) 2026-02-28 10:44:38 +01:00
run-json-schema-to-grammar.mjs llama : move end-user examples to tools directory (#13249) 2025-05-02 20:27:13 +02:00
test-alloc.cpp chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
test-arg-parser.cpp ci, tests : use cmake to download models and remove libcurl dependency (#18791) 2026-01-14 07:46:27 +01:00
test-autorelease.cpp docs : Minor cleanups (#19252) 2026-02-02 08:38:55 +02:00
test-backend-ops.cpp ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
test-backend-sampler.cpp tests : fix typos in comments in test-backend-sampler [no ci] (#19824) 2026-02-23 17:12:02 +01:00
test-barrier.cpp Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (#17748) 2025-12-10 12:32:23 -08:00
test-c.c ggml : remove kompute backend (#14501) 2025-07-03 07:48:32 +03:00
test-chat-auto-parser.cpp common : gracefully handle incomplete output (#20191) 2026-03-08 17:17:02 +01:00
test-chat-peg-parser.cpp common: consolidate PEG string parsers (#20263) 2026-03-10 00:29:21 +01:00
test-chat-template.cpp Autoparser - complete refactoring of parser architecture (#18675) 2026-03-06 21:01:00 +01:00
test-chat.cpp common: map developer role to system (#20215) 2026-03-09 14:25:11 +01:00
test-double-float.cpp ggml : minor naming changes (#8433) 2024-07-12 10:46:02 +03:00
test-gbnf-validator.cpp cmake : do not include ./src as public for libllama (#13062) 2025-04-24 16:00:10 +03:00
test-gguf-model-data.cpp tests : model metadata loading from huggingface (#19796) 2026-02-28 10:44:38 +01:00
test-gguf.cpp ggml/gguf : prevent integer overflows (#19856) 2026-02-24 20:17:11 +02:00
test-grammar-integration.cpp llama : add token matching support to llama-grammar (#17816) 2025-12-09 00:32:57 -06:00
test-grammar-llguidance.cpp tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
test-grammar-parser.cpp llama : add token matching support to llama-grammar (#17816) 2025-12-09 00:32:57 -06:00
test-jinja.cpp jinja: correct stats for tojson and string filters (#19785) 2026-02-22 21:08:23 +01:00
test-json-partial.cpp common : handle unicode during partial json parsing (#16526) 2025-10-12 16:18:47 +03:00
test-json-schema-to-grammar.cpp examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968) 2026-03-10 14:38:18 +01:00
test-llama-archs.cpp llama: end-to-end tests (#19802) 2026-03-08 12:30:21 +01:00
test-llama-grammar.cpp llama : add token matching support to llama-grammar (#17816) 2025-12-09 00:32:57 -06:00
test-log.cpp common : use common_ prefix for common library functions (#9805) 2024-10-10 22:57:42 +02:00
test-lora-conversion-inference.sh cli: new CLI experience (#17824) 2025-12-10 15:28:59 +01:00
test-model-load-cancel.cpp llama : update llama_model API names (#11063) 2025-01-06 10:55:18 +02:00
test-mtmd-c-api.c mtmd : add C public API (#13184) 2025-05-04 23:43:42 +02:00
test-opt.cpp tests : fix test-opt with GGML_BACKEND_DL (#15599) 2025-08-26 22:14:38 +02:00
test-peg-parser.cpp Autoparser - complete refactoring of parser architecture (#18675) 2026-03-06 21:01:00 +01:00
test-quantize-fns.cpp ggml : add NVFP4 quantization type support (#19769) 2026-03-11 21:02:54 +01:00
test-quantize-perf.cpp ci: run the x64 and arm ci on the github machines instead (#16183) 2025-09-25 08:06:06 +03:00
test-quantize-stats.cpp server: introduce API for serving / loading / unloading multiple models (#17470) 2025-12-01 19:41:04 +01:00
test-reasoning-budget.cpp common/parser: handle reasoning budget (#20297) 2026-03-11 10:26:12 +01:00
test-regex-partial.cpp common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342) 2026-01-03 16:02:43 -06:00
test-rope.cpp ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805) 2025-11-11 13:33:24 +02:00
test-sampling.cpp sampling : optimize samplers by reusing bucket sort (#15665) 2025-08-31 20:41:02 +03:00
test-state-restore-fragmented.cpp kv-cache: Fix state restore fragmented cache (#17982) 2025-12-15 19:28:35 +02:00
test-thread-safety.cpp server : support unified cache across slots (#16736) 2025-11-02 18:14:04 +02:00
test-tokenizer-0.cpp tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
test-tokenizer-0.py py : logging and flake8 suppression refactoring (#7081) 2024-05-05 08:07:48 +03:00
test-tokenizer-0.sh model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826) 2026-02-26 12:14:09 +01:00
test-tokenizer-1-bpe.cpp tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
test-tokenizer-1-spm.cpp tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
test-tokenizer-random.py requirements : update transformers/torch for Embedding Gemma (#15828) 2025-09-09 06:06:52 +02:00
test-tokenizers-repo.sh devops: add s390x & ppc64le CI (#15925) 2025-09-27 02:03:33 +08:00
testing.h common : implement new jinja template engine (#18462) 2026-01-16 11:22:06 +01:00