llama.cpp/ggml/src/ggml-sycl
Johannes Gäßler d6f3030047
ggml: backend-agnostic tensor parallelism (experimental) (#19378)
* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (#17)

* meta : formatting, naming, indentation (#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-09 16:42:19 +02:00
..
dpct [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190) 2026-03-08 12:00:07 +08:00
template-instances sycl : add flash-attn support for head size 512 (#21654) 2026-04-09 09:36:48 +03:00
CMakeLists.txt [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190) 2026-03-08 12:00:07 +08:00
add-id.cpp sycl : fix wrong variable check by assert (#20903) 2026-03-25 11:48:37 +02:00
add-id.hpp [SYCL] Support gpt-oss by OPs add-id, mul_mat for mxfp4, swiglu_oai (#17826) 2025-12-15 10:35:15 +08:00
backend.hpp [SYCL] ehance UPSCALE to support all UT cases (#20637) 2026-03-17 10:01:52 +08:00
binbcast.cpp support permuted, remove check s0/s10 (#19889) 2026-02-26 10:27:20 +08:00
binbcast.hpp [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521) 2025-10-12 21:53:35 +08:00
common.cpp SYCL: Remove misleading ggml_sycl_op_flatten function (#12387) 2025-03-31 11:25:24 +02:00
common.hpp sycl : support nvfp4 type in mul_mat (#21227) 2026-04-01 13:54:15 +03:00
concat.cpp sycl: add CONCAT operator support (#16047) 2025-11-06 11:02:33 +01:00
concat.hpp SYCL: Refactor ggml_sycl_compute_forward (#11121) 2025-01-10 08:13:03 +08:00
conv.cpp Revert "sycl: add usage of enqueue_functions extension (#14244)" (#15910) 2025-09-12 09:15:12 +08:00
conv.hpp SYCL: Refactor ggml_sycl_compute_forward (#11121) 2025-01-10 08:13:03 +08:00
convert.cpp sycl : support nvfp4 type in mul_mat (#21227) 2026-04-01 13:54:15 +03:00
convert.hpp fix op rope, add rope_back (#20293) 2026-03-11 09:53:34 +08:00
count-equal.cpp [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190) 2026-03-08 12:00:07 +08:00
count-equal.hpp [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521) 2025-10-12 21:53:35 +08:00
cpy.cpp sycl : support to malloc memory on device more than 4GB, update the doc and script (#17566) 2025-11-29 14:59:44 +02:00
cpy.hpp SYCL: Add set_rows support for quantized types (#14883) 2025-07-28 20:32:15 +05:30
dequantize.hpp [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527) 2026-04-07 16:12:49 +08:00
dmmv.cpp [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527) 2026-04-07 16:12:49 +08:00
dmmv.hpp llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
element_wise.cpp [SYCL] ehance UPSCALE to support all UT cases (#20637) 2026-03-17 10:01:52 +08:00
element_wise.hpp [SYCL] ehance UPSCALE to support all UT cases (#20637) 2026-03-17 10:01:52 +08:00
fattn-common.hpp [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190) 2026-03-08 12:00:07 +08:00
fattn-tile.cpp sycl : add flash-attn support for head size 512 (#21654) 2026-04-09 09:36:48 +03:00
fattn-tile.hpp sycl : add flash-attn support for head size 512 (#21654) 2026-04-09 09:36:48 +03:00
fattn-vec.hpp sycl : add flash-attn support for head size 512 (#21654) 2026-04-09 09:36:48 +03:00
fattn.cpp sycl : add flash-attn support for head size 512 (#21654) 2026-04-09 09:36:48 +03:00
fattn.hpp [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190) 2026-03-08 12:00:07 +08:00
gated_delta_net.cpp sycl : fix for untransposed GDA recurrent state (#20583) 2026-03-15 19:10:15 +01:00
gated_delta_net.hpp add op gated_delta_net (#20455) 2026-03-14 22:01:57 +08:00
gemm.hpp sycl: Batched mulmat rework for oneDNN dispatch (#14617) 2025-07-14 10:37:35 +01:00
getrows.cpp Revert "sycl: add usage of enqueue_functions extension (#14244)" (#15910) 2025-09-12 09:15:12 +08:00
getrows.hpp SYCL: Remove misleading ggml_sycl_op_flatten function (#12387) 2025-03-31 11:25:24 +02:00
ggml-sycl.cpp ggml: backend-agnostic tensor parallelism (experimental) (#19378) 2026-04-09 16:42:19 +02:00
gla.cpp Revert "sycl: add usage of enqueue_functions extension (#14244)" (#15910) 2025-09-12 09:15:12 +08:00
gla.hpp SYCL: Add gated linear attention kernel (#11175) 2025-01-15 11:20:17 +08:00
im2col.cpp Revert "sycl: add usage of enqueue_functions extension (#14244)" (#15910) 2025-09-12 09:15:12 +08:00
im2col.hpp SYCL: Remove misleading ggml_sycl_op_flatten function (#12387) 2025-03-31 11:25:24 +02:00
mmq.cpp Revert "sycl: add usage of enqueue_functions extension (#14244)" (#15910) 2025-09-12 09:15:12 +08:00
mmq.hpp llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
mmvq.cpp [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527) 2026-04-07 16:12:49 +08:00
mmvq.hpp llama : reorganize source code + improve CMake (#8006) 2024-06-26 18:33:02 +03:00
norm.cpp fix for failed UT case: ACC, L2_NORM, UPSCALE, fused_glu, unary (#20283) 2026-03-11 09:53:05 +08:00
norm.hpp sycl: add RMS_NORM_BACK operation support (#16808) 2025-10-29 14:14:39 +08:00
outprod.cpp Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246) 2026-02-02 21:06:21 +08:00
outprod.hpp SYCL: Refactor ggml_sycl_compute_forward (#11121) 2025-01-10 08:13:03 +08:00
pad.cpp [SYCL] Support gpt-oss by OPs add-id, mul_mat for mxfp4, swiglu_oai (#17826) 2025-12-15 10:35:15 +08:00
pad.hpp [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521) 2025-10-12 21:53:35 +08:00
pad_reflect_1d.cpp refactor pad_reflect_1d to make the UT case pass (#17204) 2025-11-28 08:50:56 +08:00
pad_reflect_1d.hpp refactor pad_reflect_1d to make the UT case pass (#17204) 2025-11-28 08:50:56 +08:00
presets.hpp [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190) 2026-03-08 12:00:07 +08:00
quantize.hpp sycl: refactor quantization to q8_1 (#14815) 2025-07-28 11:05:53 +01:00
quants.hpp [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527) 2026-04-07 16:12:49 +08:00
repeat_back.cpp SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869) 2025-11-03 09:35:33 +08:00
repeat_back.hpp sycl: add REPEAT_BACK operation support (#16734) 2025-10-27 09:19:50 +08:00
roll.cpp sycl: add ROLL operation support (#16665) 2025-10-27 09:20:24 +08:00
roll.hpp sycl: add ROLL operation support (#16665) 2025-10-27 09:20:24 +08:00
rope.cpp fix op rope, add rope_back (#20293) 2026-03-11 09:53:34 +08:00
rope.hpp fix op rope, add rope_back (#20293) 2026-03-11 09:53:34 +08:00
set.cpp SYCL SET operator optimized for F32 tensors (#16350) 2025-10-17 10:36:40 +08:00
set.hpp SYCL SET operator optimized for F32 tensors (#16350) 2025-10-17 10:36:40 +08:00
set_rows.cpp ggml : implement set_rows with i32 index (#16159) 2025-09-22 19:13:00 +02:00
set_rows.hpp SYCL: Initial set_rows kernel implementation (#14562) 2025-07-10 09:29:38 +01:00
softmax.cpp [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190) 2026-03-08 12:00:07 +08:00
softmax.hpp [SYCL] refactor soft_max, add soft_max_back (#16472) 2025-10-09 10:25:11 +03:00
ssm_conv.cpp [SYCL] Support gpt-oss by OPs add-id, mul_mat for mxfp4, swiglu_oai (#17826) 2025-12-15 10:35:15 +08:00
ssm_conv.hpp sycl: add SSM_CONV operation support (#16800) 2025-10-28 09:50:33 +08:00
sycl_hw.cpp sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973) 2025-06-25 18:09:55 +02:00
sycl_hw.hpp sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973) 2025-06-25 18:09:55 +02:00
tsembd.cpp ggml : fix padding in timestep embedding kernels (#15932) 2025-09-16 15:25:57 +02:00
tsembd.hpp SYCL: Refactor ggml_sycl_compute_forward (#11121) 2025-01-10 08:13:03 +08:00
type.hpp sycl : support nvfp4 type in mul_mat (#21227) 2026-04-01 13:54:15 +03:00
upscale.cpp [SYCL] ehance UPSCALE to support all UT cases (#20637) 2026-03-17 10:01:52 +08:00
upscale.hpp [SYCL] ehance UPSCALE to support all UT cases (#20637) 2026-03-17 10:01:52 +08:00
vecdotq.hpp [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527) 2026-04-07 16:12:49 +08:00
wkv.cpp Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246) 2026-02-02 21:06:21 +08:00
wkv.hpp llama: Add support for RWKV v7 architecture (#12412) 2025-03-18 07:27:50 +08:00