llama.cpp/common
Tim Burke ccea34ba41
perf : multiple fixes and enhancements, remove MSE search, expand test coverage
* fix: correct tiled flash attention SoA pointer math for multihead MXFP

The cleanup refactoring (c919bc471) extracted mxfp_dequant_head as a
shared helper but failed to update the tiled path's data pointers.
The helper expects the full SoA row base (no per-head offset), but the
tiled path was passing a pointer that already included ik2*nbk2, causing
a double head offset that produced NaN during prefill.

Add mxfp_row_ptr helper to centralize the multihead-aware pointer
calculation across both one_chunk and tiled paths. Verified with 16-chunk
perplexity on gpt-oss-20b: all four configs (f16, mxfp4, mxfp6, mxfp8)
produce exact matches with the known-good commit (23e88631c).

* perf: reduce E8M0 MSE search range from ±2 to ±1

The base estimate round(log2(amax)) is always within 1 step of optimal.
Empirically verified across 30K blocks and 6 distributions: ±1 and ±2
never disagree. This reduces the scale search from 5 to 3 candidates
(40% fewer inner loop iterations) with zero quality impact.

* perf: eliminate redundant work in MXFP quantize and flash attention

- mse_error_mxfp4: use passed inv_scale instead of recomputing 1/d
- mxfp_compute_e8m0_mse: hoist loop-invariant traits branch out of inner loop
- tiled V path: dequant directly to V32 tile, remove intermediate memcpy and dead buffer

* cleanup: fix comments, unify Hadamard condition, simplify E8M0 helpers

- EMAX_OFFSET comments: fix ceil/floor labels to match actual values
- Hadamard flag: unify write path (llama-kv-cache.cpp) and read path
  (ops.cpp) to both use DK==DV condition instead of is_mla()
- E8M0 helpers in ggml-impl.h: simplify to match ggml-common.h style,
  add cross-reference comment

* fix: MXFP8/6 flash attention tests crash on init

The view base tensors for K/V don't get named "k"/"v" but inherit the
MXFP type. The name-based filter in initialize_tensors missed them,
falling through to init_tensor_uniform which calls quantize_chunk and
aborts for KV-cache-only types. Fix by checking ggml_is_type_mxfp() for
all tensors, matching the pattern set_rows tests already use.

* test: expand MXFP set_rows coverage

- Add MXFP8/MXFP6 to all_types for non-Hadamard set_rows coverage
- Expand Hadamard set_rows tests: add views, broadcast, and multi-head configs
- Coverage: 18 → 51 MXFP set_rows tests

* perf: add AVX2 Hadamard for x86 (matches existing ARM NEON path)

* cleanup: DRY MXFP4 quantize/dequant with shared per-block helpers

Extract quantize_block_mxfp4 and dequantize_block_mxfp4 as shared
helpers used by both AoS (quantize_row_mxfp4_ref, dequantize_row_mxfp4)
and SoA (quantize_row_mxfp4_soa, dequantize_row_mxfp4_soa) paths.
Eliminates duplicated per-block logic while keeping layout-specific
pointer arithmetic in the callers.

* feat: add MXFP8/MXFP6 AoS quantize/dequant (full type support)

Extract quantize_block_mxfp / dequantize_block_mxfp per-block helpers
from the SoA generic impl and use them to build AoS row functions for
MXFP8 (E4M3) and MXFP6 (E2M3). Register to_float and from_float_ref
in type traits, add quantize_chunk dispatch, replacing the GGML_ABORT.

MXFP8 and MXFP6 are no longer KV-cache-only — they can now be used
as general quantization types. The SoA impl is also DRY'd to delegate
to the same per-block helpers.

* cleanup: remove dead soa_elems field from mxfp_kv_params

Computed but never read — leftover from an earlier design.

* feat: add MXFP8/MXFP6 vec_dot and full CPU type support

Add scalar vec_dot_mxfp8_q8_0 and vec_dot_mxfp6_q8_0 implementations,
register from_float + vec_dot + vec_dot_type in CPU traits, and add
fallback remaps for all architectures. MXFP8/6 are now fully tested:
AoS quantization error, reference match, and dot product accuracy all
pass in test-quantize-fns.

* perf: remove E8M0 MSE search — base estimate is perplexity-optimal

The MSE search over ±1 candidates around round(log2(amax)) was found to
HURT perplexity by 4-37 PPL points across all MXFP configs on gpt-oss-20b.
The base estimate alone (no search) produces better attention patterns
because minimizing per-block reconstruction error is not the same as
minimizing attention score distortion through softmax.

Removes mse_error_mxfp4, mse_error field from traits, MSE_RANGE constant,
and the entire search loop. E8M0 computation is now a single amax scan +
integer bit extraction — no inner loop, no function pointers. This also
simplifies future GPU/Metal implementations.

* perf: fuse Hadamard rotation into SoA quantize (one pass, no temp buffer)

Add quantize_row_mxfp{4,8,6}_soa_hadamard that apply Hadamard and
quantize block-by-block with a 32-float stack buffer. Eliminates the
std::vector heap allocation and 2 extra memory passes over the full row.

set_rows now dispatches to the fused path when Hadamard is enabled,
falling through to the unfused quantize for non-Hadamard types.

This pattern maps directly to a CUDA kernel: global memory read →
register Hadamard → register quantize → global memory write.

* cleanup: consistent MXFP type names and variable naming

- Rename type_name "mxfp8_e4m3" → "mxfp8", "mxfp6_e2m3" → "mxfp6"
  to match "mxfp4". Only one variant of each exists — the suffix was
  unnecessary disambiguation that implied alternatives.
- Remove redundant MXFP shortcuts from arg.cpp (fallback loop handles
  all types via ggml_type_name matching).
- Rename kv_is_f32_f16_or_mxfp → k_is_f32_f16_or_mxfp (only checks K).

* perf: fuse Q preprocessing round-trip (no SoA buffer needed)

Add mxfp{4,8,6}_hadamard_roundtrip and mxfp{4,8,6}_roundtrip functions
that apply quantization error to float values without materializing SoA
bytes. Replaces the 3-step Q preprocessing (Hadamard → quantize to SoA
buffer → dequant from SoA buffer) with a single pass through per-block
round-trip helpers.

Eliminates the Q_q intermediate buffer and two function pointer calls
from the flash attention hot path. Maps directly to CUDA: Q stays in
registers, Hadamard butterfly + quantize error applied in-place.

* fix: clamp E8M0 = 255 to 254 in decode (fixes CI NaN failures)

E8M0 = 255 means NaN per MX spec, but our encode path already clamps
to 254. When test data contains random E8M0 = 255 bytes, the decode
produces Inf, and Inf * 0.0 = NaN, causing GET_ROWS and CPY tests to
fail on MXFP6 (and potentially MXFP4/8).

Fix: clamp 255 → 254 in both E8M0 decode functions:
  - ggml_e8m0_to_fp32 / ggml_e8m0_to_fp32_half (ggml-impl.h)
  - ggml_mxfp_e8m0_to_fp32 / ggml_mxfp_e8m0_to_fp32_half (ggml-common.h)

These are unfortunately duplicated across two headers because
ggml-common.h compiles for CUDA (__device__) while ggml-impl.h serves
CPU-only callers that don't include ggml-common.h.
2026-03-22 20:12:09 -04:00
..
jinja jinja : fix heap OOB read in value equality comparison (#20782) 2026-03-20 07:15:17 +01:00
CMakeLists.txt common/parser: handle reasoning budget (#20297) 2026-03-11 10:26:12 +01:00
arg.cpp perf : multiple fixes and enhancements, remove MSE search, expand test coverage 2026-03-22 20:12:09 -04:00
arg.h vendor : update cpp-httplib to 0.30.0 (#18660) 2026-01-08 13:53:54 +01:00
base64.hpp llava : expose as a shared library for downstream projects (#3613) 2023-11-07 00:36:23 +03:00
build-info.cpp.in cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167) 2025-06-13 10:38:52 +02:00
chat-auto-parser-generator.cpp chat : handle tool calls with no required args in TAG_WITH_TAGGED format (#20764) 2026-03-19 17:53:11 +01:00
chat-auto-parser-helpers.cpp common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825) 2026-03-21 00:19:04 +01:00
chat-auto-parser-helpers.h common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
chat-auto-parser.h common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
chat-diff-analyzer.cpp common : fix typo in debug log ('extracft' -> 'extract') (#20807) 2026-03-20 18:23:18 +01:00
chat-peg-parser.cpp common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
chat-peg-parser.h common/parser: use nlohmann::ordered_json to preserve parameter order (#20385) 2026-03-11 10:26:51 +01:00
chat.cpp common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777) 2026-03-20 02:37:22 +01:00
chat.h common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
common.cpp llama : re-enable manual LoRA adapter free (#19983) 2026-03-18 12:03:26 +02:00
common.h common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
console.cpp cli : add command and file auto-completion (#19985) 2026-03-05 10:47:28 +01:00
console.h cli : add command and file auto-completion (#19985) 2026-03-05 10:47:28 +01:00
debug.cpp debug: make common_debug_print_tensor readable (#19331) 2026-02-04 17:55:31 +01:00
debug.h chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
download.cpp build : remove LLAMA_HTTPLIB option (#19623) 2026-02-15 15:38:50 +01:00
download.h preset: allow named remote preset (#18728) 2026-01-10 15:12:29 +01:00
http.h server: Parse port numbers from MCP server URLs in CORS proxy (#20208) 2026-03-09 17:47:54 +01:00
json-partial.cpp common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932) 2025-11-18 18:54:15 +01:00
json-partial.h cli : fix reasoning responses in CLI (#18961) 2026-01-20 18:23:25 +01:00
json-schema-to-grammar.cpp common : fix incorrect uses of stoul (#20313) 2026-03-10 11:40:26 +01:00
json-schema-to-grammar.h common : add nemotron 3 parsing (#18077) 2025-12-16 04:05:23 -06:00
llguidance.cpp sampling : add support for backend sampling (#17004) 2026-01-04 22:22:16 +02:00
log.cpp cli: new CLI experience (#17824) 2025-12-10 15:28:59 +01:00
log.h cli: new CLI experience (#17824) 2025-12-10 15:28:59 +01:00
ngram-cache.cpp spec : add self‑speculative decoding (no draft model required) + refactor (#18471) 2026-01-28 19:42:42 +02:00
ngram-cache.h spec : add self‑speculative decoding (no draft model required) + refactor (#18471) 2026-01-28 19:42:42 +02:00
ngram-map.cpp llama : correct typos 'occured' and 'occurences' (#19414) 2026-02-11 07:05:31 +01:00
ngram-map.h llama : correct typos 'occured' and 'occurences' (#19414) 2026-02-11 07:05:31 +01:00
ngram-mod.cpp spec : add ngram-mod (#19164) 2026-01-30 18:21:48 +02:00
ngram-mod.h ngram-mod : fix build [no ci] (#19216) 2026-01-30 21:27:27 +02:00
peg-parser.cpp common: consolidate PEG string parsers (#20263) 2026-03-10 00:29:21 +01:00
peg-parser.h common: consolidate PEG string parsers (#20263) 2026-03-10 00:29:21 +01:00
preset.cpp preset: allow named remote preset (#18728) 2026-01-10 15:12:29 +01:00
preset.h common: support remote preset (#18520) 2026-01-08 22:35:40 +01:00
reasoning-budget.cpp common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
reasoning-budget.h common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
regex-partial.cpp common : fix iterator::end() dereference (#20445) 2026-03-16 08:50:38 +02:00
regex-partial.h `common`: add partial regex support (#12808) 2025-05-14 19:50:57 +01:00
sampling.cpp common/parser: add proper reasoning tag prefill reading (#20424) 2026-03-19 16:58:21 +01:00
sampling.h sampling : add support for backend sampling (#17004) 2026-01-04 22:22:16 +02:00
speculative.cpp spec : remove check rate (#19377) 2026-02-09 15:30:50 +02:00
speculative.h common : add common_speculative_is_compat() (#19270) 2026-02-06 16:47:22 +02:00
unicode.cpp common/parser: handle reasoning budget (#20297) 2026-03-11 10:26:12 +01:00
unicode.h common/parser: handle reasoning budget (#20297) 2026-03-11 10:26:12 +01:00