llama.cpp

Commit Graph

Author	SHA1	Message	Date
Johannes Gäßler	d6f3030047	ggml: backend-agnostic tensor parallelism (experimental) (#19378 ) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-09 16:42:19 +02:00
lhez	95a6ebabb2	opencl: fix leak in Adreno q8_0 path (#21212 )	2026-04-01 12:54:58 -07:00
shaofeiqi	08f21453ae	opencl: add q4_K gemm and gemv kernels for Adreno (#20919 ) * opencl: add q4_K gemm and gemv kernels for Adreno * opencl: fix whitespace * opencl: add workarounds for compiler bugs on older devices * opencl: handle fp16 denorm on X Elite * opencl: fix kernel build error * opencl: fix whitespace * opencl: make q4_K cvt kernels signature consistent --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-30 12:19:16 -07:00
lhez	ded446b34c	opencl: allow large buffer for adreno (#20997 )	2026-03-26 08:52:21 -07:00
lhez	1772701f99	opencl: add q6_K gemm and gemv kernels for Adreno (#20089 ) * opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code * opencl: add q6_K transpose * opencl: fix cvt kernel name * opencl: add call to q6_K gemv * opencl: fix q6_K scale transpose * opencl: fix loading for gemv q6_K, refactor * opencl: fix transpose_8_buf kernel assignment, refactor * opencl: refactor q6_K transpose * opencl: add gemm_noshuffle_q6_k_f32 * opencl: fix qh loading * opencl: refactor q6_K gemv host side, release bufs and imgs * opencl: refactor * opencl: fix q6_K dequant and scale selection * opencl: workaround compiler bug, fix dump_tensor * opencl: refactor q6_K convert kernels * opencl: unpack transformed q6_K in get_tensor * opencl: refactor, handle non-uniform workgroups * opencl: support non-vector subgroup bcast	2026-03-23 12:44:18 -07:00
shaofeiqi	84ffd0c192	opencl: add flattened Q4_K mv and general Q4_K mm (#20773 )	2026-03-22 22:45:11 -07:00
lhez	0516e04bf9	opencl: use larger workgroup size for get_rows (#20316 )	2026-03-11 22:03:27 -07:00
shaofeiqi	3d9ab225e7	opencl: add cumsum op (#18981 ) * OpenCL: add CUMSUM op support * remove unused argument * opencl: refactor cumsum * opencl: refactor * opencl: refactor tmp buffer * opencl: adjust max number of subgroups * opencl: fix whitespace * opencl: fix global size when cumsum the tmp buffer --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-11 22:03:07 -07:00
lhez	6fce5c6a7d	opencl: add l2_norm (#20160 )	2026-03-06 18:03:05 -08:00
Aaron Teo	ba2ff79e43	ggml: update comments for backends which have no memory to report (#20157 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-06 23:24:38 +08:00
lhez	6c97bffd65	opencl: add neg, exp and diag (#20127 ) * opencl: add `neg` * opencl: add `exp` * opencl: add `diag`	2026-03-05 21:16:39 -08:00
Marcel Petrick	92f7da00b4	chore : correct typos [no ci] (#20041 ) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 08:50:21 +01:00
lhez	69fd345335	opencl: add `SET`, support i32 for `CPY`, minor refactor for cpy (#20101 )	2026-03-04 21:32:26 -08:00
shaofeiqi	24350fdf9b	opencl: add optimized q4_1 mm kernel for adreno (#19840 ) * Add Q4_1 OpenCL Kernels * opencl: refactor transpose * opencl: format * opencl: refactor q4_1 unpack * opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno` * opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels * opencl: rename kernel files and kernes * opencl: fix build for non adreno * opencl: move code around and format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-02 19:49:41 -08:00
shaofeiqi	e2f19b320f	opencl: refactor expm1 and softplus (#19404 ) * opencl: refactor expm1 * opencl: refactor softplus * opencl: use h for half literals --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-02-17 14:47:18 -08:00
shaofeiqi	983559d24b	opencl: optimize mean and sum_row kernels (#19614 ) * opencl: optimize mean and sum_row kernels * opencl: add comment for max subgroups * opencl: format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-02-17 13:56:09 -08:00
lhez	79cc0f2daf	opencl: add basic support for q4_1 (#19534 ) * opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm	2026-02-12 14:52:37 -08:00
lhez	4d3daf80f8	opencl: add general Q6_K mm and Q4_K mv (#19347 ) * opencl: add general q6_k mm * opencl: refine condition for q6_K mm * opencl: add general q4_K mv * opencl: fix whitespace	2026-02-11 10:33:13 -08:00
lhez	91ea44e89b	opencl: refactor some ops, concat, repeat, tanh and scale (#19226 ) * opencl: refactor concat * opencl: refactor repeat * opencl: refactor tanh * opencl: enable fp16 for tanh * opencl: refactor scale * opencl: fix unused variables	2026-02-02 15:54:43 -08:00
Christian Kastner	7a4ca3cbd9	docs : Minor cleanups (#19252 ) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-02 08:38:55 +02:00
shaofeiqi	971facc38e	opencl: add optimized q8_0 mm kernel for adreno (#18871 ) * Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-01-30 10:19:27 -08:00
lhez	94eeb5967c	opencl: add flattened q6_K mv (#19054 ) * opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat` * opencl: clean up * opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat` * opencl: tweak the workgroup size a bit * opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat` * opencl: proper alignment for q6_K * opencl: boundary handling for flattened q6_K mv * opencl: rename q6_K mv kernel file * opencl: put flattened q6_K mv in its own file * opencl: use lower k in file name * opencl: use K in variable names	2026-01-26 19:36:24 -08:00
lhez	9c96465f99	opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (#18970 ) * opencl: add `copy_to_contiguous` and utilize mm kernels * opencl: only copy to cont for f32 and f16 tensors * opencl: use cont mm for fallback when dst is large * opencl: use nb local to copy-to-cont * opencl: use local offset as well	2026-01-22 10:29:25 -08:00
shaofeiqi	5516b9c16a	opencl: add TRI op support (#18979 )	2026-01-21 22:05:54 -08:00
Georgi Gerganov	365a3e8c31	ggml : add ggml_build_forward_select (#18550 ) * ggml : add ggml_build_forward_select * cuda : adapt CUDA graph compat to new feature * vulkan : update logic to handle command buffer closing * ggml : check compute for fusion * ggml : add comment	2026-01-19 20:03:19 +02:00
shaofeiqi	785a710085	OpenCL: add SOLVE_TRI op support (#18846 )	2026-01-15 11:17:17 -08:00
shaofeiqi	707cbafcaa	opencl: add SOFTPLUS op support (#18726 )	2026-01-10 21:57:44 -08:00
shaofeiqi	593da7fa49	opencl: add EXPM1 op (#18704 )	2026-01-09 10:13:13 -08:00
Aaron Teo	046d5fd44e	llama: use host memory if device reports 0 memory (#18587 )	2026-01-09 05:34:56 +08:00
shaofeiqi	568371a726	opencl: add FILL op support (#18682 )	2026-01-07 22:04:50 -08:00
lhez	08566977a7	opencl: allow resizing transpose buffers (#18384 ) * opencl: allow resizing transpose buffers instead of using fixed sizes * opencl: remove commented code	2025-12-27 15:51:14 -08:00
lhez	eb492bf43f	opencl: unpack q4_0 for adreno in get_tensor (#18278 )	2025-12-22 10:19:01 -08:00
Phylliida Dev	09c7c50e64	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985 ) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-06 15:07:02 +01:00
Tarek Dakhran	2ba719519d	model: LFM2-VL fixes (#17577 ) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-11-30 21:57:31 +01:00
lhez	7cba58bbea	opencl: add sqr, sqrt, mean and ssm_conv (#17476 ) * opencl: add sqr * opencl: add sqrt * opencl: add mean * opencl: add ssm_conv * opencl: add missing cl_khr_fp16 * opencl: do sqrt in f32 then convert to f16 for better precision	2025-11-26 13:29:58 -08:00
lhez	8e9ddba610	opencl: refine condition for kqv mm (#17392 )	2025-11-21 14:34:48 -08:00
lhez	52e5d421f1	opencl: fix rms_norm_mul (#17250 ) * opencl: use subgrroup reduce for reduction in rms_norm_mul * opencl: add comment about workgroup size	2025-11-15 17:40:14 -08:00
shaofeiqi	4db5641210	opencl: add kernel to handle mat mul in attention to improve encoding speed (#17181 ) * Add mul_mm_f16_f32_kq_kqv kernel * Add ggml_cl_mul_mat_kq_kqv_adreno func * fix whitespace * remove unused variable * remove redundant * refactor and clean up * remove trailing whitespace	2025-11-15 17:33:10 -08:00
lhez	ece0f5c177	opencl: add fastdiv and use it in set_rows, ported from cuda (#17090 ) * opencl: add fastdiv for mm q8_0 * opencl: use uint4 for fastdiv vals * opencl: use fastdiv for set_rows * opencl: do not use fastdiv for q8_0 mm	2025-11-10 15:00:13 -08:00
Acly	1032256ec9	cuda/vulkan : bicubic interpolation (#17022 ) * vulkan : implement upscale with bicubic interpolation * cuda : implement upscale with bicubic interpolation * tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests * adapt OpenCL backend to not support the OP in that case so tests don't fail * print scale mode & flags in test-backend-ops	2025-11-10 10:19:39 +01:00
lhez	c5023daf60	opencl: support imrope (#16914 ) * opencl: support imrope * opencl: fix whitespace	2025-11-03 11:47:57 -08:00
Acly	10640e31aa	ggml : fix interpolate with align-corners and ne=1 (#16700 ) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning	2025-10-27 21:50:22 +01:00
lhez	6ea37f5739	opencl: fix warnings and clean up profiling (#16688 ) * opencl: remove unused headers, fix warnings * opencl: clean up profiling, only keep kernel time	2025-10-20 22:26:17 -07:00
Shawn Gu	81387858f1	opencl: transposed gemm/gemv moe kernel with mxfp4,f32 (#16602 ) * opencl: transposed gemm/gemv moe kernel with mxfp4,f32 * add restore kernel for moe transpose * fix trailing whitespaces * resolve compilation warnings	2025-10-17 17:55:32 -07:00
lhez	0cb7a0683b	opencl: add q8_0 mm support (#16469 ) * opencl: add mm_q8_0_f32 * opencl: fix data loading for incomplete tile * opencl: use q8_0 mm for larger matrix * opencl: add some tests to cover the path	2025-10-15 10:51:04 -07:00
Aman Gupta	120bf7046d	CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (#16577 )	2025-10-14 07:48:08 -07:00
lhez	5016b72862	opencl: fix build targeting CL 2 (#16554 )	2025-10-13 11:50:37 -07:00
lhez	7c156df414	opencl: support pad_ext (#15888 )	2025-09-30 10:45:45 -07:00
lhez	d1c84a662d	opencl: support ne3 in get_rows (#15866 )	2025-09-30 09:55:13 -07:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-22 19:13:00 +02:00

1 2 3

113 Commits