llama.cpp

History

CaffeinatedBits 41e6c8caf4 ggml-cuda : add flash attention support for head size 88 Llama 4 vision models use a head dimension of D=88. Previously, this fell back to unoptimized operations, causing massive VRAM bloat and slow inference. This adds D=88 to the CUDA flash attention tile backend. It explicitly excludes 88 from the Turing/Volta/WMMA/MMA Tensor Core checks to prevent memory misalignment/segfaults, forcing the fallback to the TILE kernel. Also updates generate_cu_files.py to dynamically generate the required template instance.		2026-03-10 22:03:27 -04:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	ggml: add GATED_DELTA_NET op (#19504 )	2026-03-07 15:41:10 +08:00
src	ggml-cuda : add flash attention support for head size 88	2026-03-10 22:03:27 -04:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	ggml : bump version to 0.9.7 (ggml/1425)	2026-02-15 22:24:29 +02:00