llama.cpp

History

Tim Burke d8c9f9c7f6 ggml: MXFP flash attention with SoA layout (CPU scalar reference) Add MXFP KV cache quantization for flash attention using Struct-of-Arrays (SoA) memory layout exclusively. Three MX types: MXFP4 (E2M1), MXFP8 (E4M3), MXFP6 (E2M3), implementing the OCP Microscaling v1.0 spec. SoA layout stores [qs contiguous][e8m0 contiguous] per row, enabling aligned memory access patterns for GPU backends. All functions in the flash attention pipeline — set_rows quantization, Q preprocessing, K/V dequantization — use SoA end-to-end. The existing AoS block layout remains for MUL_MAT weight quantization (untouched). Q preprocessing applies Walsh-Hadamard rotation (block-32) before quantize/dequant round-trip, distributing outlier energy across the shared exponent group. This is essential for perplexity: MXFP8: +0.22 PPL without rotation MXFP6: +3.34 PPL without rotation Hadamard is skipped for MLA models (DK != DV) where V is a view of K. Shared infrastructure in ggml-common.h: - Block structures (block_mxfp8: 33B, block_mxfp6: 25B per 32 elements) - E8M0 MSE-optimal scale search with ±1 range - Canonical element converters (FP8 E4M3/E5M2, FP6 E2M3/E3M2) - FP6 tight packing (4 six-bit values in 3 bytes, 25% savings) - IEEE-754 bit reconstruction constants for SIMD backends - SoA layout macros, portable bit cast, type property queries CPU implementation: - Scalar reference + ARM NEON + x86 AVX2 optimized paths - Both FA paths supported: one_chunk (scalar) and tiled (SIMD GEMM) - Split-KV path extended for single-query decode - Generic vec_dot via dequant-to-float for MUL_MAT compatibility - Arch fallbacks for loongarch, powerpc, riscv, s390, wasm KV cache integration: - set_rows writes SoA with optional Hadamard (op_params[0] flag) - K cache block-aligned to 16 for CUDA cp.async compatibility - CLI: --cache-type-k/v with short aliases (mxfp4, mxfp6, mxfp8) Tests: - Flash attention: all 3 types at D=64/128, mixed K/V (mxfp8+mxfp4) - SET_ROWS: Hadamard rotation for all types - SoA-aware test initialization and comparison for MXFP tensors - Quantize functions coverage for all types Rename GGML_TYPE_MXFP4 → GGML_TYPE_MXFP4_E2M1 across all backends (CPU, OpenCL, SYCL) for consistency with the MX type family naming.		2026-03-15 17:33:19 -04:00
..
ggml-alloc.h	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 )	2025-12-15 09:24:59 +01:00
ggml-backend.h	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
ggml-blas.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-cann.h	docs : Minor cleanups (#19252 )	2026-02-02 08:38:55 +02:00
ggml-cpp.h	ggml : fix ggml_gallocr_ptr type (ggml/1205)	2025-05-01 09:58:44 +03:00
ggml-cpu.h	ggml: MXFP flash attention with SoA layout (CPU scalar reference)	2026-03-15 17:33:19 -04:00
ggml-cuda.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-hexagon.h	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 )	2025-10-22 13:47:09 -07:00
ggml-metal.h	metal : refactor + optimize v2 (#15995 )	2025-09-17 20:38:12 +03:00
ggml-opencl.h	Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs (#10693 )	2024-12-13 12:23:52 -08:00
ggml-openvino.h	ggml : add OpenVINO backend (#15307 )	2026-03-14 07:56:55 +02:00
ggml-opt.h	chore : correct typos [no ci] (#20041 )	2026-03-05 08:50:21 +01:00
ggml-rpc.h	ggml : bump RPC version (#20330 )	2026-03-10 21:36:57 +02:00
ggml-sycl.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-virtgpu.h	ggml-virtgpu: make the code thread safe (#19204 )	2026-02-04 10:46:18 +08:00
ggml-vulkan.h	vulkan: Make Vulkan optional at runtime (#11493 ). (#11494 )	2025-02-10 07:17:21 +01:00
ggml-webgpu.h	ggml: Add initial WebGPU backend (#14521 )	2025-07-16 18:18:51 +03:00
ggml-zdnn.h	zdnn: refactor codebase + add docs (#16178 )	2025-09-23 14:53:05 +08:00
ggml-zendnn.h	ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690 )	2025-12-07 00:13:33 +08:00
ggml.h	ggml: MXFP flash attention with SoA layout (CPU scalar reference)	2026-03-15 17:33:19 -04:00
gguf.h	GGUF: C++ refactor, backend support, misc fixes (#11030 )	2025-01-07 18:01:58 +01:00