llama.cpp/ggml
CaffeinatedBits 41e6c8caf4 ggml-cuda : add flash attention support for head size 88
Llama 4 vision models use a head dimension of D=88.
Previously, this fell back to unoptimized operations,
causing massive VRAM bloat and slow inference.

This adds D=88 to the CUDA flash attention tile backend.
It explicitly excludes 88 from the Turing/Volta/WMMA/MMA
Tensor Core checks to prevent memory misalignment/segfaults,
forcing the fallback to the TILE kernel.

Also updates generate_cu_files.py to dynamically generate
the required template instance.
2026-03-10 22:03:27 -04:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include ggml: add GATED_DELTA_NET op (#19504) 2026-03-07 15:41:10 +08:00
src ggml-cuda : add flash attention support for head size 88 2026-03-10 22:03:27 -04:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt ggml : bump version to 0.9.7 (ggml/1425) 2026-02-15 22:24:29 +02:00