* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
* vulkan: change gated_delta_net to shard a column across a subgroup
This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).
This fixes a perf regression from the transposing of the values in memory
(!20443).
* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
* vulkan: avoid graphics queue on non-RADV AMD drivers
* avoid graphics queues on small GPUs
* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE
* reenable transfer queue if graphics queue is not used
* ggml : transpose fused GDN state access for coalesced memory reads (#20436)
The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.
Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU: restructured loops for row-wise transposed access
Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.
All GATED_DELTA_NET backend-ops tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags
- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
path lacks device support, disable both to prevent state layout mismatch
between transposed (fused) and non-transposed (unfused) formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* llama : rever fgdn argument changes
* graph : remove GDN state transposes
* vulkan : adapt
* cuda : remove obsolete smem code
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
* vulkan: add GATED_DELTA_NET op support
Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.
Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: optimize GATED_DELTA_NET shader (Phase 1)
- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops
KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: address review feedback for GATED_DELTA_NET
Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros,
scale in push constants, supports_op fix, dispatch restructuring.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: add explicit FLOAT_TYPE casts for buffer loads
Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts
to ensure correct behavior across all Vulkan configurations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: fix Q/K broadcast for interleaved head layout
Adapt to the interleaved broadcast convention from #20340:
head_id / rq1 → head_id % neq1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: optimize SSM_CONV workgroup dispatch for large ubatch
Tile tokens into 2D workgroups (32x16) to reduce workgroup launch
overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common
d_conv size). Fixes PP performance degradation with ubatch > 512.
Ref: ggml-org/llama.cpp#18725
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* vulkan: remove unused shared memory declaration in SSM_CONV
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* ggml-Vulkan: add ELU support
* ggml-Vulkan: remove extra spaces and variables
* ggml-Vulkan: fix format issue
* ggml-Vulkan: fix format issue
* fix whitespace issue
* Update Vulkan.csv and ops.md
* vulkan: Fix data races in coopmat1 mul_mat(_id)
Add barriers between coopmat store and regular loads. We sort of got away with
this because it was the same subgroup accessing the values, but it's still a
race and may not work.
* switch to subgroup control barriers
* vulkan: fix and enable cpy_tensor_async function
* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore
* update offload_op logic
* fix missing transfer submission
* disable async transfer queue on AMD GCN
* revert op batch size change
* fix cpy_tensor_async checks
* vulkan: allow using fp16 in scalar flash attention shader
* split rows inside of subgroups for faster synchronization
* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1
* use f32 scalar FA if f16 is not supported by device
* fix amd workgroup size issue
* optimize masksh use
* add medium rows FA shader Br size
* fixes
* add padding to mask shmem buffer
* cache q values into registers for KQ
* fuse lf accumulation, pf and v accumulation into a loop
* stage K loads through shmem
* stage V loads through shmem
* only stage through shmem on Nvidia
* default to Bc 32
* also stage V through shmem when this is done for K
* dynamic subgroups for intel
* use vectorized stores
* use float_type for dequantize4 functions
* use smaller scalar rows size for smaller rows count
* relax flash attention split_k condition to allow non-gqa use
* use minimal subgroup size on Intel
* fix shmem support function
* fix rebase issues
* fixes
* Bc 4 for scalar FA is not a valid configuration
* Use wave32 on AMD RDNA for scalar FA
* add Intel shader core count lookup-table
* fix regressions
* device tuning
* tmpsh size fix
* fix editorconfig
* refactor fa tuning logic into a single place
* fix gqa opt logic
* fix block_rows with small n_rows
* amd tuning
* fix hsk=72/80 issue
* tuning
* allow condition skipping for column check
* use float16 for Of if available
* address feedback
* fix bad RDNA performance on head size <= 128 by limiting occupancy
* allow printing pipeline stats
* cleanup and fixes
* limit occupancy for GCN for small batch FA with large HSK
* disable f16 FA for GCN AMD GPUs on the proprietary driver
* vulkan: split mul_mat into multiple dispatches to avoid overflow
The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.
Fall back for the less common p021 and nc variants.
* address feedback
* fix vulkan ggml_acc only works in 3d but not 4d
* removed clamp in test_acc_block
* use the correct stride and its test case
* cuda : fix "supports op" condition
* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check
* version without boundary check
* revert back to boundary check version
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.
I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.
I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.
Apply this optimization when the mask is relatively large (i.e. prompt
processing).
* vulkan: fix GPU deduplication logic.
As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.
Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.
Verified that MacOS + 4xVega still reports 4 GPUs with this version.
* vulkan: only skip dedup when both drivers are moltenVk