- Per-head dequant: multihead MXFP now extracts only the needed head's
SoA blocks (e.g. 20 bytes for mxfp4 DK=128) into a stack buffer and
dequants DK elements, instead of dequanting all heads (nek2*DK).
For 8 KV heads this is 8x less dequant work per KV position.
- Hoist loop invariants: base pointer offsets (k_base, v_base),
per-head SoA byte offsets, and multihead row bases are computed once
per query row instead of per KV position in the inner loop.
- Precompute SoA addressing in mxfp_fa_params_init: qs_per_block,
blocks_per_head, head_qs_bytes, and head_e8m0_offset are calculated
once at init rather than derived per iteration.
- Move thread-local buffer pointers (VKQ32, V32, VKQ16, Q_q) and
v_is_f16 check outside the ir loop.