llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	fb349848f3	vulkan: Handle FA with all -inf mask values (#16447 )	2025-10-20 22:16:08 -05:00
Acly	e29acf74fe	vulkan : incremental shader builds (#16341 ) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-04 11:42:56 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Jeff Bolz	e6d65fb02d	vulkan: support arbitrary KV dimension in flash attention (#16160 ) The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).	2025-09-27 22:43:39 +02:00
Jeff Bolz	94e82c7ead	vulkan: clamp matmul and FA results to the max finite value (#15652 ) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16	2025-08-31 08:27:57 +02:00
Jeff Bolz	c4f53563df	vulkan: support fattn sinks (#15126 )	2025-08-07 22:44:20 +02:00
Jeff Bolz	a0374a67e2	vulkan: Handle updated FA dim2/3 definition (#14518 ) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-05 09:26:04 +02:00
Jeff Bolz	2b72bedec1	vulkan: support mixed/deepseekR1 FA head sizes (#14509 ) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-03 20:21:14 +02:00
Jeff Bolz	8875523eb3	vulkan: support softmax/FA batch and broadcast (#14449 )	2025-07-02 15:48:33 +03:00
Jeff Bolz	2f5a4e1e09	vulkan: move common FA code to flash_attn_base.comp (#13556 ) * vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix	2025-05-17 09:14:55 +02:00
Jeff Bolz	ab3971f2a0	vulkan: workaround FA compile failures on macos (#13517 )	2025-05-14 06:15:50 +02:00
Jeff Bolz	dc1d2adfc0	vulkan: scalar flash attention implementation (#13324 ) * vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA	2025-05-10 08:07:07 +02:00

12 Commits