llama.cpp

History

Tim Burke ccea34ba41 perf : multiple fixes and enhancements, remove MSE search, expand test coverage * fix: correct tiled flash attention SoA pointer math for multihead MXFP The cleanup refactoring (`c919bc471`) extracted mxfp_dequant_head as a shared helper but failed to update the tiled path's data pointers. The helper expects the full SoA row base (no per-head offset), but the tiled path was passing a pointer that already included ik2nbk2, causing a double head offset that produced NaN during prefill. Add mxfp_row_ptr helper to centralize the multihead-aware pointer calculation across both one_chunk and tiled paths. Verified with 16-chunk perplexity on gpt-oss-20b: all four configs (f16, mxfp4, mxfp6, mxfp8) produce exact matches with the known-good commit (`23e88631c`). perf: reduce E8M0 MSE search range from ±2 to ±1 The base estimate round(log2(amax)) is always within 1 step of optimal. Empirically verified across 30K blocks and 6 distributions: ±1 and ±2 never disagree. This reduces the scale search from 5 to 3 candidates (40% fewer inner loop iterations) with zero quality impact. * perf: eliminate redundant work in MXFP quantize and flash attention - mse_error_mxfp4: use passed inv_scale instead of recomputing 1/d - mxfp_compute_e8m0_mse: hoist loop-invariant traits branch out of inner loop - tiled V path: dequant directly to V32 tile, remove intermediate memcpy and dead buffer * cleanup: fix comments, unify Hadamard condition, simplify E8M0 helpers - EMAX_OFFSET comments: fix ceil/floor labels to match actual values - Hadamard flag: unify write path (llama-kv-cache.cpp) and read path (ops.cpp) to both use DK==DV condition instead of is_mla() - E8M0 helpers in ggml-impl.h: simplify to match ggml-common.h style, add cross-reference comment * fix: MXFP8/6 flash attention tests crash on init The view base tensors for K/V don't get named "k"/"v" but inherit the MXFP type. The name-based filter in initialize_tensors missed them, falling through to init_tensor_uniform which calls quantize_chunk and aborts for KV-cache-only types. Fix by checking ggml_is_type_mxfp() for all tensors, matching the pattern set_rows tests already use. * test: expand MXFP set_rows coverage - Add MXFP8/MXFP6 to all_types for non-Hadamard set_rows coverage - Expand Hadamard set_rows tests: add views, broadcast, and multi-head configs - Coverage: 18 → 51 MXFP set_rows tests * perf: add AVX2 Hadamard for x86 (matches existing ARM NEON path) * cleanup: DRY MXFP4 quantize/dequant with shared per-block helpers Extract quantize_block_mxfp4 and dequantize_block_mxfp4 as shared helpers used by both AoS (quantize_row_mxfp4_ref, dequantize_row_mxfp4) and SoA (quantize_row_mxfp4_soa, dequantize_row_mxfp4_soa) paths. Eliminates duplicated per-block logic while keeping layout-specific pointer arithmetic in the callers. * feat: add MXFP8/MXFP6 AoS quantize/dequant (full type support) Extract quantize_block_mxfp / dequantize_block_mxfp per-block helpers from the SoA generic impl and use them to build AoS row functions for MXFP8 (E4M3) and MXFP6 (E2M3). Register to_float and from_float_ref in type traits, add quantize_chunk dispatch, replacing the GGML_ABORT. MXFP8 and MXFP6 are no longer KV-cache-only — they can now be used as general quantization types. The SoA impl is also DRY'd to delegate to the same per-block helpers. * cleanup: remove dead soa_elems field from mxfp_kv_params Computed but never read — leftover from an earlier design. * feat: add MXFP8/MXFP6 vec_dot and full CPU type support Add scalar vec_dot_mxfp8_q8_0 and vec_dot_mxfp6_q8_0 implementations, register from_float + vec_dot + vec_dot_type in CPU traits, and add fallback remaps for all architectures. MXFP8/6 are now fully tested: AoS quantization error, reference match, and dot product accuracy all pass in test-quantize-fns. * perf: remove E8M0 MSE search — base estimate is perplexity-optimal The MSE search over ±1 candidates around round(log2(amax)) was found to HURT perplexity by 4-37 PPL points across all MXFP configs on gpt-oss-20b. The base estimate alone (no search) produces better attention patterns because minimizing per-block reconstruction error is not the same as minimizing attention score distortion through softmax. Removes mse_error_mxfp4, mse_error field from traits, MSE_RANGE constant, and the entire search loop. E8M0 computation is now a single amax scan + integer bit extraction — no inner loop, no function pointers. This also simplifies future GPU/Metal implementations. * perf: fuse Hadamard rotation into SoA quantize (one pass, no temp buffer) Add quantize_row_mxfp{4,8,6}_soa_hadamard that apply Hadamard and quantize block-by-block with a 32-float stack buffer. Eliminates the std::vector heap allocation and 2 extra memory passes over the full row. set_rows now dispatches to the fused path when Hadamard is enabled, falling through to the unfused quantize for non-Hadamard types. This pattern maps directly to a CUDA kernel: global memory read → register Hadamard → register quantize → global memory write. * cleanup: consistent MXFP type names and variable naming - Rename type_name "mxfp8_e4m3" → "mxfp8", "mxfp6_e2m3" → "mxfp6" to match "mxfp4". Only one variant of each exists — the suffix was unnecessary disambiguation that implied alternatives. - Remove redundant MXFP shortcuts from arg.cpp (fallback loop handles all types via ggml_type_name matching). - Rename kv_is_f32_f16_or_mxfp → k_is_f32_f16_or_mxfp (only checks K). * perf: fuse Q preprocessing round-trip (no SoA buffer needed) Add mxfp{4,8,6}_hadamard_roundtrip and mxfp{4,8,6}_roundtrip functions that apply quantization error to float values without materializing SoA bytes. Replaces the 3-step Q preprocessing (Hadamard → quantize to SoA buffer → dequant from SoA buffer) with a single pass through per-block round-trip helpers. Eliminates the Q_q intermediate buffer and two function pointer calls from the flash attention hot path. Maps directly to CUDA: Q stays in registers, Hadamard butterfly + quantize error applied in-place. * fix: clamp E8M0 = 255 to 254 in decode (fixes CI NaN failures) E8M0 = 255 means NaN per MX spec, but our encode path already clamps to 254. When test data contains random E8M0 = 255 bytes, the decode produces Inf, and Inf * 0.0 = NaN, causing GET_ROWS and CPY tests to fail on MXFP6 (and potentially MXFP4/8). Fix: clamp 255 → 254 in both E8M0 decode functions: - ggml_e8m0_to_fp32 / ggml_e8m0_to_fp32_half (ggml-impl.h) - ggml_mxfp_e8m0_to_fp32 / ggml_mxfp_e8m0_to_fp32_half (ggml-common.h) These are unfortunately duplicated across two headers because ggml-common.h compiles for CUDA (__device__) while ggml-impl.h serves CPU-only callers that don't include ggml-common.h.		2026-03-22 20:12:09 -04:00
..
peg-parser	common: consolidate PEG string parsers (#20263 )	2026-03-10 00:29:21 +01:00
.gitignore	common : introduce composable PEG parser combinators for chat parsing (#17136 )	2025-12-03 12:45:32 +02:00
CMakeLists.txt	test : add testing and fixes	2026-03-22 01:07:55 -04:00
export-graph-ops.cpp	test-backend-ops: allow loading tests from file and parsing model operators into file (#19896 )	2026-03-12 13:26:00 +01:00
get-model.cpp	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
get-model.h	ci : add model tests + script wrapper (#4586 )	2024-01-26 14:18:00 +02:00
gguf-model-data.cpp	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
gguf-model-data.h	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
run-json-schema-to-grammar.mjs	llama : move end-user examples to tools directory (#13249 )	2025-05-02 20:27:13 +02:00
test-alloc.cpp	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
test-arg-parser.cpp	ci, tests : use cmake to download models and remove libcurl dependency (#18791 )	2026-01-14 07:46:27 +01:00
test-autorelease.cpp	docs : Minor cleanups (#19252 )	2026-02-02 08:38:55 +02:00
test-backend-ops.cpp	perf : multiple fixes and enhancements, remove MSE search, expand test coverage	2026-03-22 20:12:09 -04:00
test-backend-sampler.cpp	tests: enable kv_unified to prevent cuda oom error on rtx 2060 (#20645 )	2026-03-18 17:40:22 +08:00
test-barrier.cpp	Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (#17748 )	2025-12-10 12:32:23 -08:00
test-c.c	ggml : remove kompute backend (#14501 )	2025-07-03 07:48:32 +03:00
test-chat-auto-parser.cpp	common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825 )	2026-03-21 00:19:04 +01:00
test-chat-peg-parser.cpp	common/parser: add proper reasoning tag prefill reading (#20424 )	2026-03-19 16:58:21 +01:00
test-chat-template.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-chat.cpp	common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825 )	2026-03-21 00:19:04 +01:00
test-double-float.cpp	ggml : minor naming changes (#8433 )	2024-07-12 10:46:02 +03:00
test-gbnf-validator.cpp	cmake : do not include ./src as public for libllama (#13062 )	2025-04-24 16:00:10 +03:00
test-gguf-model-data.cpp	tests : model metadata loading from huggingface (#19796 )	2026-02-28 10:44:38 +01:00
test-gguf.cpp	ggml/gguf : prevent integer overflows (#19856 )	2026-02-24 20:17:11 +02:00
test-grammar-integration.cpp	common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604 )	2026-03-21 18:43:35 +01:00
test-grammar-llguidance.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-grammar-parser.cpp	common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604 )	2026-03-21 18:43:35 +01:00
test-jinja.cpp	tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] (#20483 )	2026-03-18 10:43:31 +01:00
test-json-partial.cpp	common : handle unicode during partial json parsing (#16526 )	2025-10-12 16:18:47 +03:00
test-json-schema-to-grammar.cpp	examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968 )	2026-03-10 14:38:18 +01:00
test-llama-archs.cpp	model: mistral small 4 support (#20649 )	2026-03-17 00:31:14 +01:00
test-llama-grammar.cpp	common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604 )	2026-03-21 18:43:35 +01:00
test-log.cpp	common : use common_ prefix for common library functions (#9805 )	2024-10-10 22:57:42 +02:00
test-lora-conversion-inference.sh	cli: new CLI experience (#17824 )	2025-12-10 15:28:59 +01:00
test-model-load-cancel.cpp	llama : update llama_model API names (#11063 )	2025-01-06 10:55:18 +02:00
test-mtmd-c-api.c	mtmd : add C public API (#13184 )	2025-05-04 23:43:42 +02:00
test-opt.cpp	tests : fix test-opt with GGML_BACKEND_DL (#15599 )	2025-08-26 22:14:38 +02:00
test-peg-parser.cpp	Autoparser - complete refactoring of parser architecture (#18675 )	2026-03-06 21:01:00 +01:00
test-quantize-fns.cpp	perf : multiple fixes and enhancements, remove MSE search, expand test coverage	2026-03-22 20:12:09 -04:00
test-quantize-perf.cpp	ci: run the x64 and arm ci on the github machines instead (#16183 )	2025-09-25 08:06:06 +03:00
test-quantize-stats.cpp	server: introduce API for serving / loading / unloading multiple models (#17470 )	2025-12-01 19:41:04 +01:00
test-reasoning-budget.cpp	common/parser: handle reasoning budget (#20297 )	2026-03-11 10:26:12 +01:00
test-regex-partial.cpp	common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342 )	2026-01-03 16:02:43 -06:00
test-rope.cpp	ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (#16805 )	2025-11-11 13:33:24 +02:00
test-sampling.cpp	sampling : optimize samplers by reusing bucket sort (#15665 )	2025-08-31 20:41:02 +03:00
test-state-restore-fragmented.cpp	kv-cache: Fix state restore fragmented cache (#17982 )	2025-12-15 19:28:35 +02:00
test-thread-safety.cpp	server : support unified cache across slots (#16736 )	2025-11-02 18:14:04 +02:00
test-tokenizer-0.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-0.py	py : logging and flake8 suppression refactoring (#7081 )	2024-05-05 08:07:48 +03:00
test-tokenizer-0.sh	model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826 )	2026-02-26 12:14:09 +01:00
test-tokenizer-1-bpe.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-1-spm.cpp	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
test-tokenizer-random.py	ci : switch from pyright to ty (#20826 )	2026-03-21 08:54:34 +01:00
test-tokenizers-repo.sh	devops: add s390x & ppc64le CI (#15925 )	2025-09-27 02:03:33 +08:00
testing.h	common : implement new jinja template engine (#18462 )	2026-01-16 11:22:06 +01:00