* vulkan: change gated_delta_net to shard a column across a subgroup
This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).
This fixes a perf regression from the transposing of the values in memory
(!20443).
* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
* vulkan: avoid graphics queue on non-RADV AMD drivers
* avoid graphics queues on small GPUs
* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE
* reenable transfer queue if graphics queue is not used
* ggml : transpose fused GDN state access for coalesced memory reads (#20436)
The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.
Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU: restructured loops for row-wise transposed access
Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.
All GATED_DELTA_NET backend-ops tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags
- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
path lacks device support, disable both to prevent state layout mismatch
between transposed (fused) and non-transposed (unfused) formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* llama : rever fgdn argument changes
* graph : remove GDN state transposes
* vulkan : adapt
* cuda : remove obsolete smem code
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
* vulkan: add GATED_DELTA_NET op support
Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.
Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: optimize GATED_DELTA_NET shader (Phase 1)
- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops
KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: address review feedback for GATED_DELTA_NET
Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros,
scale in push constants, supports_op fix, dispatch restructuring.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: add explicit FLOAT_TYPE casts for buffer loads
Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts
to ensure correct behavior across all Vulkan configurations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: fix Q/K broadcast for interleaved head layout
Adapt to the interleaved broadcast convention from #20340:
head_id / rq1 → head_id % neq1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: optimize SSM_CONV workgroup dispatch for large ubatch
Tile tokens into 2D workgroups (32x16) to reduce workgroup launch
overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common
d_conv size). Fixes PP performance degradation with ubatch > 512.
Ref: ggml-org/llama.cpp#18725
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: remove unused shared memory declaration in SSM_CONV
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* ggml-Vulkan: add ELU support
* ggml-Vulkan: remove extra spaces and variables
* ggml-Vulkan: fix format issue
* ggml-Vulkan: fix format issue
* fix whitespace issue
* Update Vulkan.csv and ops.md
* vulkan: Fix data races in coopmat1 mul_mat(_id)
Add barriers between coopmat store and regular loads. We sort of got away with
this because it was the same subgroup accessing the values, but it's still a
race and may not work.
* switch to subgroup control barriers
* vulkan: fix and enable cpy_tensor_async function
* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore
* update offload_op logic
* fix missing transfer submission
* disable async transfer queue on AMD GCN
* revert op batch size change
* fix cpy_tensor_async checks
* vulkan: allow using fp16 in scalar flash attention shader
* split rows inside of subgroups for faster synchronization
* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1
* use f32 scalar FA if f16 is not supported by device
* fix amd workgroup size issue
* optimize masksh use
* add medium rows FA shader Br size
* fixes
* add padding to mask shmem buffer
* cache q values into registers for KQ
* fuse lf accumulation, pf and v accumulation into a loop
* stage K loads through shmem
* stage V loads through shmem
* only stage through shmem on Nvidia
* default to Bc 32
* also stage V through shmem when this is done for K
* dynamic subgroups for intel
* use vectorized stores
* use float_type for dequantize4 functions
* use smaller scalar rows size for smaller rows count
* relax flash attention split_k condition to allow non-gqa use
* use minimal subgroup size on Intel
* fix shmem support function
* fix rebase issues
* fixes
* Bc 4 for scalar FA is not a valid configuration
* Use wave32 on AMD RDNA for scalar FA
* add Intel shader core count lookup-table
* fix regressions
* device tuning
* tmpsh size fix
* fix editorconfig
* refactor fa tuning logic into a single place
* fix gqa opt logic
* fix block_rows with small n_rows
* amd tuning
* fix hsk=72/80 issue
* tuning
* allow condition skipping for column check
* use float16 for Of if available
* address feedback
* fix bad RDNA performance on head size <= 128 by limiting occupancy
* allow printing pipeline stats
* cleanup and fixes
* limit occupancy for GCN for small batch FA with large HSK
* disable f16 FA for GCN AMD GPUs on the proprietary driver
* vulkan: split mul_mat into multiple dispatches to avoid overflow
The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.
Fall back for the less common p021 and nc variants.
* address feedback
* fix vulkan ggml_acc only works in 3d but not 4d
* removed clamp in test_acc_block
* use the correct stride and its test case
* cuda : fix "supports op" condition
* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check
* version without boundary check
* revert back to boundary check version
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.
I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.
I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.
Apply this optimization when the mask is relatively large (i.e. prompt
processing).
* vulkan: fix GPU deduplication logic.
As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.
Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.
Verified that MacOS + 4xVega still reports 4 GPUs with this version.
* vulkan: only skip dedup when both drivers are moltenVk
* vulkan: use coopmat for flash attention p*v matrix multiplication
* fix P loading issue
* fix barrier position
* remove reduction that is no longer needed
* move max thread reduction into loop
* remove osh padding
* add bounds checks and padding
* remove unused code
* fix shmem sizes, loop duration and accesses
* don't overwrite Qf, add new shared psh buffer instead
* add missing bounds checks
* use subgroup reductions
* optimize
* move bounds check, reduce barriers
* support other Bc values and other subgroup sizes
* remove D_split
* replace Of register array with shared memory Ofsh array
* parallelize HSV across the rowgroups
* go back to Of in registers, not shmem
* vectorize sfsh
* don't store entire K tile in shmem
* fixes
* load large k tiles to shmem on Nvidia
* adapt shared memory host check function to shader changes
* remove Bc 32 case
* remove unused variable
* fix missing mask reduction tmspsh barrier
* fix mask bounds check
* fix rowmax f16 under/overflow to inf
* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.
See more details here: KhronosGroup/MoltenVK#2683.
The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.
The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.