llama.cpp/ggml
Tim Burke c2f2ff7814 ggml: optimize CPU MXFP flash attention hot loop
- Per-head dequant: multihead MXFP now extracts only the needed head's
  SoA blocks (e.g. 20 bytes for mxfp4 DK=128) into a stack buffer and
  dequants DK elements, instead of dequanting all heads (nek2*DK).
  For 8 KV heads this is 8x less dequant work per KV position.

- Hoist loop invariants: base pointer offsets (k_base, v_base),
  per-head SoA byte offsets, and multihead row bases are computed once
  per query row instead of per KV position in the inner loop.

- Precompute SoA addressing in mxfp_fa_params_init: qs_per_block,
  blocks_per_head, head_qs_bytes, and head_e8m0_offset are calculated
  once at init rather than derived per iteration.

- Move thread-local buffer pointers (VKQ32, V32, VKQ16, Q_q) and
  v_is_f16 check outside the ir loop.
2026-03-15 19:49:27 -04:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include ggml: MXFP flash attention with SoA layout (CPU scalar reference) 2026-03-15 17:33:19 -04:00
src ggml: optimize CPU MXFP flash attention hot loop 2026-03-15 19:49:27 -04:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt ggml : add OpenVINO backend (#15307) 2026-03-14 07:56:55 +02:00