llama.cpp/ggml
Jeff Bolz 6efcd65945
vulkan: optimize flash attention split_k_reduce (#14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-08 20:11:42 +02:00
..
cmake ggml-cpu : rework weak alias on apple targets (#14146) 2025-06-16 13:54:15 +08:00
include CUDA: add bilinear interpolation for upscale (#14563) 2025-07-08 10:11:18 +08:00
src vulkan: optimize flash attention split_k_reduce (#14554) 2025-07-08 20:11:42 +02:00
.gitignore vulkan : cmake integration (#8119) 2024-07-13 18:12:39 +02:00
CMakeLists.txt ggml : remove kompute backend (#14501) 2025-07-03 07:48:32 +03:00