llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aman Gupta	c1b187688d	CUDA: skip fusion for repeating adds in bias (#17080 )	2025-11-08 16:58:05 +08:00
SavicStefan	b8a5cfd11a	vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (#16636 ) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>	2025-11-08 09:28:22 +01:00
Jeff Bolz	b4e335d8dc	vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (#16977 ) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.	2025-11-08 08:52:15 +01:00
Jeff Bolz	d6fe40fa00	vulkan: Fix test-thread-safety crashes (#17024 ) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.	2025-11-08 08:39:45 +01:00
Johannes Gäßler	e14e842e87	CUDA: fix MMQ stream-k fixup ne1 indices (#17089 )	2025-11-08 08:26:18 +01:00
Reese Levine	647b960bd8	ggml webgpu: faster matrix multiplication/matrix-vector multiplication (#17031 ) * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings	2025-11-07 19:27:20 -08:00
bssrdf	299f5d782c	CUDA: properly handle nb00=nb02 case for cpy (#17081 )	2025-11-07 23:41:58 +01:00
Acly	ac76d36201	vulkan : refactor buffer handling in vk_op_f32 (#16840 ) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer	2025-11-07 21:08:50 +01:00
Johannes Gäßler	6515610506	CUDA: fix should_use_mmvf for ne11 == 1 (#17085 ) * CUDA: fix should_use_mmvf for ne11 == 1 * Apply suggestion from @am17an Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-11-07 20:53:14 +01:00
Adrien Gallouët	9eb9a1331d	Revert "ggml-cpu: detect correct cpu flags for arm64 (#16229 ) (#16239 )" (#17084 ) This reverts commit `7c23f3f0d4`.	2025-11-07 18:34:05 +02:00
iron	7c23f3f0d4	ggml-cpu: detect correct cpu flags for arm64 (#16229 ) (#16239 ) When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004, the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags, which results in compilation failures for certain extended instructions, but the correct CPU flags can be obtained by using gcc -march. Signed-off-by: lizhenneng <lizhenneng@kylinos.cn> Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>	2025-11-07 08:18:14 -08:00
xctan	7f09a680af	ggml-cpu : optimize RVV q2_k and q3_k kernels (#16887 )	2025-11-06 18:12:45 +02:00
Johannes Gäßler	aa374175c3	CUDA: fix crash on uneven context without FA (#16988 )	2025-11-06 14:05:47 +01:00
Georgi Gerganov	5b180c3d60	metal : initial Metal4 tensor API support (#16634 ) * metal : rework mat-mat multiplication * metal : initial Metal4 support * cont * metal : detect tensor support * cont : better ifdefs * metal : support tensors in mul_mm_id * metal : add env for disabling tensor API * tests : restore * metal : remove unused constants * metal : fix check for bfloat tensor support * cont : handle API incompatibilities * cont : handle even more incompatibilities * metal : use tensor API only on M5 and later	2025-11-06 14:45:10 +02:00
YehuditE	9d7c518d64	sycl: add CONCAT operator support (#16047 ) * sycl: add CONCAT operator support * cleanup: remove stray lines added by mistake * fix: code format issues in concat.cpp and tests/test-backend-ops.cpp * chore: fix editorconfig violations * cleanup: drop unnecessary i16 type support * docs: update sycl-csv and regenerate ops.md * update docs/ops.md * fix: adapt to upstream master changes after rebase * fix: remove empty files * fix: drop whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-06 11:02:33 +01:00
l3utterfly	6db3d1ffe6	ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 and FASTRPC_GET_URI is unsupported (#16987 ) * support older socs where FASTRPC_GET_URI is unsupported * added graceful fallback when FASTRPC_GET_URI call fails * use weak symbols instead of loading libcdsprpc.so dynamically * Add weak pragma for rpcmem_alloc2 * Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp Removed weak declaration for rpcmem_alloc2. * Enforce ndev to 1 for archs below v75 Force ndev to 1 for SoCs architectures lower than v75.	2025-11-05 21:46:38 -08:00
bssrdf	230d1169e5	improve CUDA cpy memory bandwidth when copying transposed tensor (#16841 ) * WIP * added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth * added BF16 support * more strict check to make sure src0 is a transpose * reformulated to handle more complicated transpose cases * bring back 2D transpose for higher performance * allow build on windows * tranpose copy more shapes * minor tweak * final clean up * restore some test cases * keep only the kernel for true tranposed case; updated with review suggestions * make CI happy * remove headers not needed * reduced bank conflicts for fp16 and bf16 * add missing const* * now bank conflicts free * use padding instead of swizzling --------- Co-authored-by: bssrdf <bssrdf@gmail.com>	2025-11-05 21:55:04 +01:00
Jeff Bolz	a44d77126c	vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (#16919 )	2025-11-05 19:51:03 +01:00
Reese Levine	03ea04175d	ggml webgpu: minor set rows optimization (#16810 ) * Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit `ed710b36f5`. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>	2025-11-05 10:27:42 +01:00
Georgi Gerganov	852ce5180a	ggml : fix conv2d_dw SVE path (ggml/1380) * Fix test-conv2d-dw failure on ARM SVE by using runtime vector length The ggml_compute_forward_conv_2d_dw_cwhn function was using a hardcoded GGML_F32_EPR (8) for SIMD vectorization, but on ARM SVE the actual vector length varies by hardware. This caused incorrect computation when processing CWHN layout tensors on ARM machines. Fix by using svcntw() to get the runtime SVE vector length instead of the compile-time constant. Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com> * ci : reduce sam score threshold * ci : update bbox checks for sam test --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>	2025-11-05 10:41:51 +02:00
nullname	a5c07dcd7b	refactor: replace sprintf with snprintf for safer string handling in dump functions (#16913 )	2025-11-04 12:25:39 -08:00
Jeff Bolz	ad51c0a720	vulkan: remove the need for the dryrun (#16826 ) * vulkan: remove the need for the dryrun Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed. * remove dryrun parameters	2025-11-04 13:28:17 -06:00
Acly	cc98f8d349	ggml-cpu : bicubic interpolation (#16891 )	2025-11-04 13:12:20 +01:00
Noah	1f5accb8d0	Fix garbled output with REPACK at high thread counts (#16956 ) * Fix garbled output with REPACK at high thread counts Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps. * Update ggml/src/ggml-cpu/repack.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/repack.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-03 21:04:59 -08:00
Aman Gupta	2759ccdb4a	CUDA: avoid mul + bias fusion when doing fusion (#16935 )	2025-11-04 10:53:48 +08:00
lhez	c5023daf60	opencl: support imrope (#16914 ) * opencl: support imrope * opencl: fix whitespace	2025-11-03 11:47:57 -08:00
theo77186	622cd010ff	ggml: CUDA: add head size 72 for flash-attn (#16962 )	2025-11-03 14:29:11 +01:00
Jinyang He	fcfce040e8	ggml : LoongArch fixes (#16958 ) * Fix test-quantize-fns f16 and q4_0 failed when use LSX * Fix LoongArch set float intrinsic when use LSX/LASX	2025-11-03 08:40:02 +02:00
shani-f	7e994168b1	SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869 ) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * SYCL: optimize repeat_back kernel * Remove Hebrew comment from repeat_back.cpp * Remove comments for code clarity Removed comments to clean up the code. * Fix formatting in ggml-sycl.cpp * Formatted lambda according to legacy style. No logic changes * Remove blank line in repeat_back.cpp Remove unnecessary blank line before assigning acc to dst_dd.	2025-11-03 09:35:33 +08:00
Georgi Gerganov	2f966b8ed8	clip : use FA (#16837 ) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-02 21:21:48 +01:00
mnehete32	7db35a7958	CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (#16917 )	2025-11-02 11:12:57 +08:00
Aaron Teo	d38d9f0877	ggml: add s390x cpu-feats (#16774 )	2025-11-02 08:48:23 +08:00
Jeff Bolz	5d8bb900bc	vulkan: Fix multi_add invalid descriptor usage (#16899 )	2025-11-01 06:52:14 +01:00
Jeff Bolz	2e76e01360	vulkan: fuse mul_mat+add and mul_mat_id+add_id (#16868 ) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-01 06:45:28 +01:00
Oliver Simons	d3dc9dd898	CUDA: Remove unneded bias/gate dims in fused mmvq (#16858 ) * CUDA: Remove unneded bias/gate dims in fused mmvq Pointed out [here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989) that only a single value is needed per target col per thread * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix "Error 991-D: extra braces are nonstandard" during compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-01 13:13:26 +08:00
Johannes Gäßler	31c511a968	CUDA: Volta tensor core support for MMF (#16843 ) * CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-10-31 15:57:19 +01:00
Aman Gupta	4146d6a1a6	CUDA: add expert reduce kernel (#16857 ) * CUDA: add expert reduce kernel * contigous checks, better formatting, use std::vector instead of array * use vector empty instead of size Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-31 20:05:07 +08:00
Jeff Bolz	d2d931f173	vulkan: disable spirv-opt for rope shaders (#16872 )	2025-10-31 08:34:47 +01:00
Masato Nakasaka	2976b0374d	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796 ) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines	2025-10-31 08:18:59 +01:00
Ruben Ortlam	d2a2673dd1	vulkan: fix shmem overrun in mmq id shader (#16873 ) * vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-31 08:14:49 +01:00
l3utterfly	13002a0896	ggml-hexagon: respect input size when getting/setting tensor data (#16836 ) * respect input size when getting/setting tensor data allows partial repacking/copying when get tensor size is smaller than the actual tensor * Removed duplicate repack_mxfp4_mxfp4x4x2 function	2025-10-30 21:46:31 -07:00
lhez	9984cbb61d	opencl: fix boundary handling for mul_mm (#16875 )	2025-10-30 16:00:20 -07:00
Max Krasnyansky	517b7170e1	cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 (#16833 ) Very similar implementation to the flash-attention chunking, with similar benefits.	2025-10-30 09:06:13 -07:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Max Krasnyansky	dcca0d3ab8	cpu: introduce chunking for flash attention (#16829 ) Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.	2025-10-30 14:26:05 +02:00
Sigbjørn Skjæret	229bf68628	cuda : fix argsort with 64k+ rows (#16849 )	2025-10-30 08:56:28 +01:00
Jeff Bolz	052df28b0e	vulkan: Handle argsort with a large number of rows (#16851 )	2025-10-30 07:27:41 +01:00
Oliver Simons	8b11deea46	Hide latency of bias and gate-loading (#16847 ) This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.	2025-10-30 11:34:15 +08:00
Jeff Bolz	b9ce940177	vulkan: Fuse rope+set_rows (#16769 ) This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.	2025-10-29 15:13:10 -05:00
Jeff Bolz	10fcc41290	vulkan: Update topk_moe fusion to handle gpt's late softmax (#16656 ) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-29 14:44:29 +01:00
Ruben Ortlam	bcf5bda6f5	Vulkan MMQ Integer Dot Refactor and K-Quant support (#16536 ) * vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads	2025-10-29 14:39:03 +01:00
Max Krasnyansky	3eb2be1ca5	Hexagon Op queue & dispatch optimizations (#16820 ) * hexagon: remove dspqueue callbacks and do all read processing inplace * hexagon: there is no need to ref/deref the buffers at this point We're not going to release the buffers without flushing the session queue. So there is no need to inc/dec the refcounts for every request. We also don't need to include those bufs in the response. * hexagon: bump the thread count in the adb wrapper scripts We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention). Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs. * hexagon: add lhez as the second code owner	2025-10-29 06:29:12 -07:00
Aman Gupta	e41bcce8f0	CUDA: use fastdiv in set-rows (#16834 ) * CUDA: use fastdiv in set-rows * add assert about value fitting in u32	2025-10-29 21:11:53 +08:00
Jeff Bolz	f549b0007d	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (#16793 ) This lets the copy to the destination device use the host-visible vidmem optimization.	2025-10-29 09:53:04 +01:00
Aman Gupta	9a3ea685b9	CUDA: Fix bug in topk-moe for gpt-oss (#16821 ) * CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef	2025-10-29 15:55:06 +08:00
YaelLogic	338074c383	sycl: add RMS_NORM_BACK operation support (#16808 ) * sycl: add RMS_NORM_BACK operation support * sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes * sycl: add RMS_NORM_BACK support Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv). * revert: restore .gitignore and tools/run/CMakeLists.txt to upstream * revert: restore tests/CMakeLists.txt to upstream * sycl: optimize rms_norm_back * fix: restore SYCL.csv to correct state with RMS_NORM_BACK support * Update ggml/src/ggml-sycl/norm.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * fix: remove trailing whitespace and add missing newline (EditorConfig) --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-10-29 14:14:39 +08:00
YaelGitAccount	851553ea6b	cuda: add SET operation support (#16804 ) * feat(cuda): add GGML_OP_SET support Implement CUDA kernel for SET operation with f32 support. All tests passing (14598/14598). * cuda(set): add I32 support; keep F32 * refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cuda/set.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-28 20:10:28 +01:00
l3utterfly	8284efc35c	initialise buffer.device in ggml_hexagon_session (#16816 )	2025-10-28 08:16:20 -07:00
Chenguang Li	3479efd112	CANN: Improve device ID handling and aclnnArange checks (#16752 ) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var	2025-10-28 10:54:53 +08:00
Aman Gupta	463bbf20bf	CUDA: add unused vars to mmvf and mmvq (#16807 )	2025-10-28 10:31:21 +08:00
tamarPal	ad8d36beff	sycl: add SSM_CONV operation support (#16800 ) * feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <tamarPal@example.com>	2025-10-28 09:50:33 +08:00
Acly	10640e31aa	ggml : fix interpolate with align-corners and ne=1 (#16700 ) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning	2025-10-27 21:50:22 +01:00
Johannes Gäßler	80d28f104c	HIP: fix AMDGPU_TARGETS, update documentation (#16803 )	2025-10-27 21:39:49 +01:00
tamarPal	2b9bd9bf4e	sycl: add ROLL operation support (#16665 ) * sycl: add ROLL operation support - Implement ggml_sycl_roll function for F32 tensors - Add multi-axis roll operation with SYCL kernel - Support all 4 tensor dimensions with proper shift normalization - Add roll.cpp and roll.hpp to SYCL backend - Update backend dispatch and supports_op for GGML_OP_ROLL - Tests: 17662/17662 pass with identical CPU reference results * fix: remove trailing whitespace from roll.cpp - Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp - Remove trailing spaces from lines 6, 11, 28, 47, 58, 60 * ci: retrigger * sycl: remove wait() calls from ROLL operation * fix: editorconfig — LF endings + final newline for roll.hpp --------- Co-authored-by: tamarPal <tamarPal@example.com>	2025-10-27 09:20:24 +08:00
shani-f	59fc1ec8e8	sycl: add REPEAT_BACK operation support (#16734 ) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * Update ggml/src/ggml-sycl/repeat_back.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/repeat_back.hpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-27 09:19:50 +08:00
Aman Gupta	75d33b9302	CUDA: support for weight clamp in top-k norm (#16702 )	2025-10-27 09:06:16 +08:00
Acly	3470a5c891	ggml-alloc : make gallocr prefer chunks that allow memory reuse (#16788 )	2025-10-26 23:19:03 +01:00
Sigbjørn Skjæret	bd562fe4f7	cuda : use fast copy when src and dst are of different type and contiguous (#16789 ) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape	2025-10-26 21:31:41 +01:00
leejet	bbac6a26b2	ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch (#16744 ) * fix k_compute_batched_ptrs * add backend ops test * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * reduce the batch size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-26 19:13:31 +01:00
Aman Gupta	f77c13b91f	CUDA: General GEMV fusion (#16715 )	2025-10-26 19:28:04 +08:00
Gilad S.	3cfa9c3f12	vulkan: deduplicate Microsoft Direct3D12 devices (#16689 ) * fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `\|\|`	2025-10-26 05:37:38 +01:00
Giuseppe Scrivano	f90b4a8efe	vulkan: delete dead code (#16732 ) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-25 10:59:54 +02:00
Jeff Bolz	8423d01931	vulkan: Optimize SSM_SCAN (#16645 )	2025-10-25 07:04:12 +02:00
leejet	55945d2ef5	ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (#16742 ) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions	2025-10-24 21:39:37 +02:00
Aman Gupta	0bcb40b48c	CUDA: use CUB for arbitary size argsort (#16754 )	2025-10-24 20:46:19 +08:00
Aman Gupta	061f0eff02	ggml-cuda: use passed ops instead of hardcoded ops (#16712 )	2025-10-23 19:14:06 +08:00
Matthew Michel	9de9672adb	sycl: use async memory allocation to fix crashes during graph recording (#16644 ) * sycl: use async memory allocation to fix graph recording failures GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because: - Host waits are currently unsupported in graph recording mode. - SYCL malloc / free calls are unsupported in graph recording mode. The following changes are made to fix SYCL graph functionality: - When graphs are enabled, use the SYCL async memory extension for temp buffers which is supported with SYCL graphs. - For compiler versions that do not support this extension, skip graphs with the affected op. - Switch from USM shared to device memory as the async extension currently just supports device allocations. * Address reviewer feedback * Use global async variable to decide path in sycl_ext_[malloc_device\|free]	2025-10-23 09:05:15 +08:00
Max Krasnyansky	63d2fc46e1	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 ) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX Note: This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2025-10-22 13:47:09 -07:00
Diego Devesa	a2e0088d92	Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_v…" (#16723 ) This reverts commit `19a5a3edfd`.	2025-10-22 20:20:55 +02:00
sirus20x6	19a5a3edfd	ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_vec_set_f32 for faster fills (#16522 ) * Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds. * Vectorize additional f32 helper loops * Normalize f32 helper tails for ggml vec ops --------- Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-10-22 12:14:14 +02:00
Aman Gupta	9285325ce0	CUDA: fix bug in topk-moe softmax (#16711 )	2025-10-22 12:33:08 +08:00
Aman Gupta	03792ad936	CUDA: topk-moe: add optional parameter for gpt-oss (#16649 )	2025-10-21 22:40:38 +08:00
Johannes Gäßler	51d1a8c997	CUDA: better error for FA kernel with 0 occupancy (#16643 )	2025-10-21 15:27:53 +02:00
Aman Gupta	4926419c4d	ggml: add ggml_can_fuse_subgraph (#16662 ) * ggml: add ggml_can_fuse_subgraph * ggml-cuda: use ggml_can_fuse_subgraph for topk-moe * format * 1. remove inputs from signature as they are transient nodes 2. add check for views: view_src should be part of the subgraph * - combine check into one loop - check all view_src parents - other minor review comments * remove redudant if test * - rename and other minor review comments * add assert about count < 32	2025-10-21 16:43:14 +08:00
lhez	6ea37f5739	opencl: fix warnings and clean up profiling (#16688 ) * opencl: remove unused headers, fix warnings * opencl: clean up profiling, only keep kernel time	2025-10-20 22:26:17 -07:00
Jeff Bolz	fb349848f3	vulkan: Handle FA with all -inf mask values (#16447 )	2025-10-20 22:16:08 -05:00
YehuditE	6de8ed7519	sycl : add PAD_REFLECT_D1 operator support (#16145 ) * sycl: add PAD_REFLECT_D1 operator support * docs(ops): regenerate docs/ops.md * remove trailing whitespaces * style: fix editorconfig issues — trim trailing spaces and normalize EOLs * fix: move PAD_REFLECT_1D case outside of fall-through block	2025-10-21 00:21:12 +02:00
Diego Devesa	b617cfd289	ggml-alloc : fix leak when reusing a tensor with a larger size (#16679 )	2025-10-20 14:53:50 +02:00
safranowith	2330de7b84	SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators (#16613 ) * SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators Clean up unrelated changes from previous commit * Chore: remove empty lines and fix indentation * Clean up: remove leftover blank lines and fix spacing * chore: fix trailing whitespace and ensure final newline * Cleanup: remove redundant declarations already defined in header * Sync docs/ops.md with updated backend operation support * docs: update ops.md after rebase * docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN	2025-10-20 11:08:32 +03:00
Aaron Teo	4f73d0a951	ci : fix binaries release failure for s390x (binaries may not work yet) (#16664 ) * devops: initial patch Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: forgot the z15 suffix Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at impl GGML_CPU_ALL_VARIANTS for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: rm baseline version Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-10-19 23:06:39 +02:00
Johannes Gäßler	ee09828cb0	HIP: fix GPU_TARGETS (#16642 )	2025-10-18 14:47:32 +02:00
Jeff Bolz	e56abd2098	vulkan: Implement topk_moe fused shader, ported from CUDA (#16641 ) This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.	2025-10-18 12:22:57 +02:00
Aman Gupta	38355c6c8e	CUDA: use registers instead of smem in topk-moe (#16647 ) Uses the technique used in the vulkan PR #16641. Neat trick!	2025-10-18 11:52:53 +02:00
Shawn Gu	81387858f1	opencl: transposed gemm/gemv moe kernel with mxfp4,f32 (#16602 ) * opencl: transposed gemm/gemv moe kernel with mxfp4,f32 * add restore kernel for moe transpose * fix trailing whitespaces * resolve compilation warnings	2025-10-17 17:55:32 -07:00
Radoslav Gerganov	41386cf365	rpc : report actual free memory (#16616 ) * rpc : report actual free memory Start reporting the free memory on every device instead of using fixed values. Now llama-cli users can get a nice memory breakdown when using RPC devices. * drop --mem in rpc-server	2025-10-17 18:02:52 +03:00
Giuseppe Scrivano	3d4e86bbeb	vulkan: Add State Space Model (SSM) Operations Support (#16463 ) * vulkan: implement SSM scan operation Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * vulkan: implement SSM conv operation Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-17 14:23:47 +02:00
muggle-stack	342c728d03	ggml : fix SpaceMit IME array out-of-bounds in task assignment (#16629 ) Fix incorrect task-to-batch index calculation in the quantization phase. The bug caused out-of-bounds access to qnbitgemm_args array when compute_idx exceeded per_gemm_block_count_m, leading to invalid pointer dereferences and SIGBUS errors. Correctly map tasks to batches by dividing compute_idx by per_gemm_block_count_m instead of block_size_m. Example: batch_feature=1, gemm_m=30, block_size_m=4 per_gemm_block_count_m = 8, task_count = 8 Old: gemm_idx = 4/4 = 1 (out of bounds New: gemm_idx = 4/8 = 0 (correct) Tested on SpaceMit K1 RISC-V64 with qwen2.5:0.5b model. Co-authored-by: muggle <mingjun.rong@spacemit.com>	2025-10-17 13:01:23 +03:00
Jeff Bolz	b19491599d	vulkan: fix debug build (add_rms_len/data not found) (#16624 )	2025-10-17 09:31:04 +02:00
Ilia Ilmer	9ad4f1931e	metal : add `CONV_TRANSPOSE_2D` (#16542 ) * initial: headers and metal-device.cpp updates * adding conv_transpose_2d * fix type * fix type: int32->int64 * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add checks for src[0] and src[1]; add type checks * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add more tests, add optimization to threading * add dynamic memory allocation in metal --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-17 09:33:58 +03:00
GittyBurstein	ceff6bb253	SYCL SET operator optimized for F32 tensors (#16350 ) * SYCL/SET: implement operator + wire-up; docs/ops updates; element_wise & ggml-sycl changes * sycl(SET): re-apply post-rebase; revert manual docs/ops.md; style cleanups * move SET op to standalone file, GPU-only implementation * Update SYCL SET operator for F32 * ci: fix editorconfig issues (LF endings, trailing spaces, final newline) * fixed ggml-sycl.cpp --------- Co-authored-by: Gitty Burstein <gitty@example.com>	2025-10-17 10:36:40 +08:00
GittyBurstein	b22572e97d	sycl : add ARANGE operator (#16362 ) * SYCL: update element-wise ops and presets * clean arange * Re-trigger CI --------- Co-authored-by: Gitty Burstein <gitty@example.com>	2025-10-16 15:26:21 +02:00
Chenguang Li	7a50cf388a	CANN: format code using .clang-format (#15863 ) This commit applies .clang-format rules to all source files under the ggml-cann directory to ensure consistent coding style and readability. The .clang-format option `SortIncludes: false` has been set to disable automatic reordering of include directives. No functional changes are introduced. Co-authored-by: hipudding <huafengchun@gmail.com>	2025-10-16 16:41:11 +08:00
takuya kodama	adc9b60f19	ggml-cpu: replace putenv with setenv for const-correctness (#16573 ) ## Why it failed When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers), the build fails with the following error: ``` cmake \ -S . \ -B ../llama.cpp.build \ --preset=x64-linux-gcc-debug \ -DCMAKE_INSTALL_PREFIX=/tmp/local \ -DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \ cmake --build ../llama.cpp.build/ ... /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’: /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers] 3572 \| putenv("KMP_BLOCKTIME=200"); // 200ms \| ^~~~~~~~~~~~~~~~~~~ In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6: /usr/include/stdlib.h:786:26: note: expected ‘char ’ but argument is of type ‘const char ’ 786 \| extern int putenv (char __string) __THROW __nonnull ((1)); \| ~~~~~~^~~~~~~~ cc1: some warnings being treated as errors ninja: build stopped: subcommand failed. ``` The issue is that putenv() expects a non-const char but receives a string literal (const char ). ## How to fix This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0). Benefits of setenv(): - Accepts const char parameters (no qualifier warnings) - Makes copies of the strings (safer memory handling) - The third parameter (0) ensures we don't overwrite if already set	2025-10-16 08:10:32 +03:00
yael-works	ee50ee1ead	SYCL: Add GGML_OP_MEAN operator support (#16009 ) * SYCL: Add GGML_OP_MEAN operator support * SYCL: Fix formatting for GGML_OP_MEAN case * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-16 12:21:28 +08:00
safranowith	466c1911ab	cpu : add FLOOR, CEIL, ROUND and TRUNC unary operators (#16083 ) * CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators - Added the operators to unary op enum - Implemented API functions - Implemented forward and unary-op logic in CPU backend - Updated ggml_get_n_tasks - Updated operators names array and static_assert - Updated docs and enabled automatic tests * docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h * chore: remove trailing whitespace from ggml.h * Remove unresolved merge markers * Apply review suggestions: cleanup formatting, enum order and leftover artifacts * Regenerate ops.md using create_ops_docs.py	2025-10-15 21:24:51 +02:00
lhez	0cb7a0683b	opencl: add q8_0 mm support (#16469 ) * opencl: add mm_q8_0_f32 * opencl: fix data loading for incomplete tile * opencl: use q8_0 mm for larger matrix * opencl: add some tests to cover the path	2025-10-15 10:51:04 -07:00
lhez	d93f8439b0	opencl: fix FA for f32 (#16584 )	2025-10-15 10:48:28 -07:00
Sam/Samuel	f4ce81c45e	metal: optimise `GGML_OP_SUM` (#16559 ) * optimise GGML_OP_SUM * add non-contiguous tests by permuting the input * change tests to require full contiguity of OP_SUM * cuda : add check GGML_OP_SUM --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-15 17:05:56 +03:00
Julius Tischbein	5acd455460	CUDA: Changing the CUDA scheduling strategy to spin (#16585 ) * CUDA set scheduling strategy to spinning for cc121 * Using prop.major and prop.minor, include HIP and MUSA * Exclude HIP and MUSA * Remove trailing whitespace Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Remove empty line Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-15 14:54:15 +03:00
Georgi Gerganov	fa882fd2b1	metal : avoid using Metal's gpuAddress property (#16576 ) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check	2025-10-14 20:33:05 +03:00
SavicStefan	ffa059034c	vulkan: Add ACC_TYPE_VEC2 implementation (#16203 ) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>	2025-10-14 19:18:05 +02:00
Aman Gupta	120bf7046d	CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (#16577 )	2025-10-14 07:48:08 -07:00
Jeff Bolz	4258e0cfe7	vulkan: Support FA with K/V in F32 (#16543 )	2025-10-14 15:53:37 +02:00
Jeff Bolz	7ea15bb64c	vulkan: Improve build time for MSVC (#16545 ) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.	2025-10-14 14:51:36 +02:00
Johannes Gäßler	9c7185dd28	CUDA: enable FA for FP32 KV cache (#16546 )	2025-10-14 14:22:47 +02:00
Aman Gupta	1ee9d0b415	CUDA: use fastdiv + ggml_cuda_mad for mmvf (#16557 ) * CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code	2025-10-14 13:16:21 +02:00
Aman Gupta	48e2fa9fb7	CUDA: add fp kernel for larger batch size MoE (#16512 ) * CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks	2025-10-14 13:15:15 +02:00
Anav Prasad	5b6913c47b	cuda : remove legacy copy-op pointer indirection code (#16485 ) * remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function	2025-10-14 11:53:49 +02:00
Georgi Gerganov	e60f241eac	metal : FA support F32 K and V and head size = 32 (#16531 ) * metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]	2025-10-13 23:07:57 +03:00
lhez	5016b72862	opencl: fix build targeting CL 2 (#16554 )	2025-10-13 11:50:37 -07:00
Johannes Gäßler	7049736b2d	CUDA: fix numerical issues in tile FA kernel (#16540 )	2025-10-13 17:29:45 +03:00
Jie Fu (傅杰)	01d2bdc2bc	ggml : fix build broken with -march=armv9-a on MacOS (#16520 ) * ggml : fix build broken with -march=armv9-a on MacOS Signed-off-by: Jie Fu <jiefu@tencent.com> * Add #pragma message Signed-off-by: Jie Fu <jiefu@tencent.com> * Address review comment. Signed-off-by: Jie Fu <jiefu@tencent.com> * Update ggml/src/ggml-cpu/ggml-cpu.c --------- Signed-off-by: Jie Fu <jiefu@tencent.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-13 15:48:47 +03:00
Chenguang Li	56fc38b965	CANN: fix CPU memory leak in CANN backend (#16549 ) This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed.	2025-10-13 17:01:24 +08:00
Sam/Samuel	3f750f8d76	metal: add support for opt_step_sgd (#16539 ) * metal: add support for opt_step_sgd * add newline to pass EditorConfig check	2025-10-13 11:25:02 +03:00
Georgi Gerganov	c515fc5771	ggml : fix scalar path for computing norm (#16558 )	2025-10-13 11:22:27 +03:00
hipudding	f9bc66c3eb	CANN: Update several operators to support FP16 data format (#16251 ) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <757486878@qq.com>	2025-10-13 08:52:22 +08:00
Sam/Samuel	a31cf36ad9	metal : add opt_step_adamw and op_sum (#16529 ) * scaffold to support opt step adamw on metal (not written so far) * add opt-step-adamw kernel for metal * pass op->src[4] as a separate buffer to the pipeline * add bounds check to opt-step-adamw kernel * complete scaffold for GGML_OP_SUM * naive GGML_OP_SUM kernel * remove unwanted comment * change OP_SUM capability gate * Add has_simdgroup_reduction to both ops to pass CI	2025-10-12 21:43:14 +03:00
Neo Zhang Jianyu	c7be9febcb	[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521 ) * fix/refactor OP argsort, pad * fix count-equal op * update SYCL OP list * fix format issue --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-12 21:53:35 +08:00
sirus20x6	41aac5c69b	ggml : Fix FP16 ELU positive branch (#16519 ) Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-10-12 08:25:37 +03:00
sirus20x6	20cc625edc	ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (#16518 ) The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion. Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-10-12 08:15:00 +03:00
Johannes Gäßler	11f0af5504	CUDA: faster tile FA, add oob checks, more HSs (#16492 )	2025-10-11 20:54:32 +02:00
Georgi Gerganov	a3cb04744f	metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494 )	2025-10-11 16:54:10 +03:00
Diego Devesa	97870e6497	cuda : avoid initializing unused devices (#16510 )	2025-10-11 13:02:26 +02:00
Prajwal B Mehendarkar	6d69ab3f26	cmake : Dont define XOPENSOURCE on AIX (#16481 )	2025-10-10 11:15:46 +03:00
duduta	1deee0f8d4	cpu : optimize the ggml NORM operation (#15953 ) * ggml-cpu: optimize norm operation to use intrinsics or Accelerate rename function add endif macro comment Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com> * implement s390x SIMD suggested by @taronaeo * add TODO comment * tidy up spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2025-10-09 21:11:15 +02:00
Chenguang Li	aa4711d369	CANN: Improve ACL graph matching (#16166 ) * CANN: improve ACL graph matching Record `ne` and `nb` information for src tensors and include them in the graph matching check. This enhances the robustness of ACL graph matching by preventing incorrect matches when src tensors share the same data address but differ in shape or stride. * CANN: add op_params match	2025-10-09 15:50:25 +08:00
Charles Xu	d80d6d2400	kleidiai: kernel interface refactoring (#16460 )	2025-10-09 10:29:17 +03:00
Neo Zhang Jianyu	b260213755	[SYCL] refactor soft_max, add soft_max_back (#16472 ) * refactor to support soft_max_ext * fix error and support soft_max_back * rm unused functions * fix format issue --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-09 10:25:11 +03:00
ai-fonsi	9d0882840e	Disable CUDA host buffers on integrated GPUs (#16308 )	2025-10-08 20:21:46 +02:00
Georgi Gerganov	b2c08c9ec4	metal : mark FA blocks (#16372 ) * metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic	2025-10-08 10:57:53 +03:00
Reese Levine	74b8fc17f9	ggml webgpu: profiling, CI updates, reworking of command submission (#16452 ) * Add profiling * More detailed profiling * Rework command submission to avoid global locks * Update wait handling * try new method of waiting on futures * Add serializing of command submission in some cases * Add new pool for timestamp queries and clean up logging * Serialize command submission in CI and leave a TODO note * Update webgpu CI * Add myself as WebGPU codeowner * Deadlock avoidance * Leave WebGPU/Vulkan CI serialized * Fix divide by 0 * Fix logic in division by inflight_threads * Update CODEOWNERS and remove serialize submit option	2025-10-07 13:48:56 -07:00
Georgi Gerganov	0a319bb75e	metal : add support for non-padded FA KV (#16148 ) * metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement	2025-10-07 08:23:30 +03:00
Georgi Gerganov	1d6092fc72	tests : add -INF blocks to the KQ mask in the FA tests (#16380 ) * tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-07 08:22:35 +03:00
Georgi Gerganov	8ae32dc9ec	metal : various optimizations + refactoring (#16446 ) * metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt	2025-10-07 08:21:40 +03:00
Georgi Gerganov	a23b9bdbd3	ggml : fix unaligned access in AMX code (#16315 )	2025-10-06 16:05:27 +03:00
Daniel Bevenius	a80ff183ab	ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (#16443 ) This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2ggml_f32_epr elements per iteration , there can be up to (2ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630	2025-10-06 14:17:12 +02:00
Reese Levine	35266573b9	ggml webgpu: actually add softmax, fix rms_norm offset (#16400 ) * implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit	2025-10-04 20:59:31 -07:00
Eve	86df2c9ae4	vulkan: use a more appropriate amount of threads when generating shaders (#16418 ) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax	2025-10-04 22:04:27 +02:00
Radoslav Gerganov	f39283960b	rpc : check src buffer when copying tensor (#16421 ) Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.	2025-10-04 16:22:45 +03:00
Radoslav Gerganov	898acba681	rpc : add support for multiple devices (#16276 ) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order	2025-10-04 12:49:16 +03:00
Acly	e29acf74fe	vulkan : incremental shader builds (#16341 ) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-04 11:42:56 +02:00
Georgi Gerganov	606a73f531	metal : fix loop bound in ggml_mem_ranges (#16412 )	2025-10-03 19:18:56 +03:00
Acly	638d330246	ggml : fix graph reallocation with multiple chunks (#16396 ) reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower	2025-10-03 13:49:08 +02:00
Jeff Bolz	2aaf0a2a20	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354 ) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-03 12:50:46 +02:00
Jeff Bolz	0e1f838556	vulkan: Fix FA coopmat1 invalid array indexing (#16365 ) When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.	2025-10-03 11:52:46 +02:00
Jeff Bolz	e308efda8e	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316 )	2025-10-03 10:33:08 +02:00
Reese Levine	ef07a40906	ggml webgpu: add support for soft_max, optimize rms_norm (#16357 ) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-02 11:00:31 -07:00
Piotr Wilkin (ilintar)	34fcc5a4ac	model : Apertus model implementation (#15852 ) * First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-02 20:43:22 +03:00
R0CKSTAR	91a2a56556	musa: update compile flags (#16265 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2025-10-02 16:29:56 +03:00
uvos	e95fec640f	HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.0 (#16221 ) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn	2025-10-01 23:09:25 +02:00
Eve	132d673554	vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345 ) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using	2025-10-01 09:56:36 +02:00
lhez	7c156df414	opencl: support pad_ext (#15888 )	2025-09-30 10:45:45 -07:00
Reese Levine	8d78cd2613	ggml webgpu: support for rope,div,sub,glu,scale,cont operators (#16187 ) * Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-30 09:57:51 -07:00
lhez	d1c84a662d	opencl: support ne3 in get_rows (#15866 )	2025-09-30 09:55:13 -07:00
anavp-nvidia	a014310374	cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328 ) * Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes	2025-09-30 11:13:22 +03:00
Georgi Gerganov	35fb82497e	metal : dynamic simdgroups for MV kernels (#16340 ) * metal : dynamic simdgroups for MV kernels * cont : minor	2025-09-30 11:03:23 +03:00
Charles Xu	f1eb1cb1eb	kleidiai : fix work size and threads sync for fp16 (#16246 )	2025-09-30 10:07:20 +03:00
alex-spacemit	b77e6c18e1	ggml: riscv: add riscv spacemit backend (#15288 ) * ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: cailinxi <linxi.cai@spacemit.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-29 17:50:44 +03:00
Georgi Gerganov	4d3d455d3c	sync : whisper.cpp (ggml/1359) * ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp	2025-09-29 17:43:58 +03:00
Rafal Lewczuk	02463ab27b	ggml-backend : add root cause in error message if loading backend library fails (#16172 ) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).	2025-09-29 13:17:09 +02:00
Sigbjørn Skjæret	adc76347d7	ggml : check cuda and metal argsort limits and add test (#16323 ) * check cuda argsort limits and add test * add metal check	2025-09-29 11:09:00 +02:00
Georgi Gerganov	a4a0aa5ea2	ggml : fix dependencies for ggml_set_rows (#16318 )	2025-09-29 08:41:28 +03:00
Jeff Bolz	92cd103f62	vulkan: Fix validation failure in quantized flash attention (#16292 )	2025-09-29 06:50:37 +02:00
Sigbjørn Skjæret	b887d2f341	ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (#16307 ) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd	2025-09-28 23:15:03 +02:00
Jeff Bolz	d8359f5fde	vulkan: 64-bit im2col (#16135 ) * vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col	2025-09-28 08:38:37 +02:00
Georgi Gerganov	6a2c6145a0	metal : extend mat-mat multiplication support (#16225 ) * metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed	2025-09-28 09:34:44 +03:00
Georgi Gerganov	3b53634fe3	metal : fuse non-sequential nodes (#16102 ) * metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks	2025-09-28 09:34:05 +03:00
Jeff Bolz	1384abf8b8	vulkan: handle mat_mul with A matrix > 4GB (#16176 ) * vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes	2025-09-27 20:36:34 -05:00
Jeff Bolz	e6d65fb02d	vulkan: support arbitrary KV dimension in flash attention (#16160 ) The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).	2025-09-27 22:43:39 +02:00
Acly	8656f5de68	vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16224 ) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same	2025-09-27 22:41:03 +02:00
Aman Gupta	c0bfc57af4	CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 (#16277 ) * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 This commit adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles into the blockDim.y. My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32 * Review: refactor if statement	2025-09-27 18:49:32 +02:00
Johannes Gäßler	75a3a6c2cd	CUDA: refactor and deduplicate vector FA kernels (#16208 ) * CUDA: refactor and deduplicate vector FA kernels	2025-09-27 18:45:07 +02:00
Dmytro Minochkin	0499b29c6f	vulkan: throw system error instead of SIGABRT during init on older devices (#16156 ) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init	2025-09-27 18:26:46 +02:00
Jeff Bolz	3f81b4e91c	vulkan: support GET_ROWS for k-quants (#16235 ) The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few changes - add a_offset and divide iqs by 2. It's probably possible to call these functions from mul_mm_funcs and avoid the duplication, but I didn't go that far in this change.	2025-09-27 12:36:11 +02:00
Aaron Teo	624207e676	devops: add s390x & ppc64le CI (#15925 ) * devops: move s390x and ppc64le ci build we have access to ubuntu-24.04-s390x and ppc64le images now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le for now since they have compiler errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: stop warnings as errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: switch to non-macro flag Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: going the llama macro route Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add big-endian gguf test models Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le to test s390x, check test build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dup .gguf.inp files for big-endian tests Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dup .gguf.out files for big-endian too Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add python setup and endian byteswap Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: pooring thing does not have s390x python3 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing rust compiler for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: try rust actions runner Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: try rust actions runner" This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: try a different path for rust Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dump home directory and user info Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: install gguf-py only Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: missed relative path Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove big-endian files since local swapping is working Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: revert test-tokenizer-0 cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix unicode flags conversion from and to uint16_t Bitfields are allocated in different order on s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Simplify byteswap command Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix endianness detection in vocab loader Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Disable test-thread-safety on s390x In this test a model is downloaded, then immediately loaded to check if more downloads are needed, and then used for test. There is no clean way to separate all those steps to add byteswapping between them, so just skip this test. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix q8_0 test in test-quantize-fns vec_signed uses unexpected rounding mode. Explicitly use different rounding function. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add big-endian stories260K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add s390x test-eval-callback Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix test does not exist Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix model not found llama-eval-callback Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix q3_K dot product error in test-quantize-fns on s390x Array q8bytes had only 4 elements allocated, but 8 elements accessed. This lead to write out of bounds and later read of overwritten values out of bounds and incorrect result. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: re-enable ppc64le for testing Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: activate test-thread-safety for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le tests for some reason it keeps failing test-thread-safety tests and I do not have a machine that is able to replicate the tests. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: LLAMA_FATAL_WARNINGS=ON Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Correct repository URL for s390x for test-thread-safety model Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix fs_get_cache_directory Ensure it works even if both XDG_CACHE_HOME and HOME are unset. This might happen in containers. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Re-enable CI for ppc64le Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fortify ggml_rope_impl Only memcpy data from sections argument if it's non-NULL. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way * Update URL for big-endian model * Update .github/workflows/build.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update remaining mentions of BE models to ggml-org/models repo --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com> Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-27 02:03:33 +08:00
Georgi Gerganov	54dbc37053	metal : report OOM errors (#16274 )	2025-09-26 14:14:28 +03:00
Aaron Teo	9b26511857	ggml-cpu: implement MXFP4 SIMD for s390x (#16193 ) * ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit `1fe55724e2`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-26 13:27:25 +03:00
R0CKSTAR	0f7c69689f	musa: fix build warnings (#15611 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-26 02:56:10 +02:00
Aman Gupta	077c94d0ca	CUDA: add a fused top-K MoE kernel (#16130 ) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback	2025-09-25 16:35:05 +02:00
junchao-zhao	aa719c2f88	ggml : fix loongarch lsx compilation error (#15864 )	2025-09-25 12:22:55 +03:00
Georgi Gerganov	dfcd53f7ec	metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220 ) * metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]	2025-09-25 11:30:16 +03:00
Georgi Gerganov	4ea00794b8	metal : relax reorder conditions (#16216 )	2025-09-25 11:29:42 +03:00
Georgi Gerganov	02a6a82ae7	metal : restore im2col perf (#16219 )	2025-09-25 11:29:08 +03:00
Radoslav Gerganov	c498fc82fe	rpc : use ggml logging facilities Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.	2025-09-25 07:20:02 +00:00
Johannes Gäßler	e789095502	llama: print memory breakdown on exit (#15860 ) * llama: print memory breakdown on exit	2025-09-24 16:53:48 +02:00
Acly	f2a789e334	ggml : split graph allocations according to backend max buffer size (#15815 ) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks	2025-09-24 16:17:49 +02:00
Xiangyan Sun	4e29084ba4	ggml-cpu: Respect cpumask settings (#16164 )	2025-09-23 11:58:12 +03:00
Sigbjørn Skjæret	f6b4af3d04	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928 ) * fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl * change initialization to true	2025-09-23 10:25:20 +02:00
Aaron Teo	264f1b5187	zdnn: refactor codebase + add docs (#16178 ) * zdnn: initial matmul refactor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add zDNN docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-23 14:53:05 +08:00
Daniel Bevenius	85e72271ba	ggml-cpu : fix typo in gemm comments [no ci] (#16189 )	2025-09-23 05:59:03 +02:00
Sigbjørn Skjæret	3ecb2f671a	ggml : implement set_rows with i32 index (#16159 ) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (#16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-22 19:13:00 +02:00
Georgi Gerganov	4f324a556c	ggml : extend ggml_can_fuse to work with non-sequential nodes (#16123 ) * ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload	2025-09-22 11:12:37 +03:00
Georgi Gerganov	a71ae3ba7a	ggml : add ggml_op_is_empty (#16122 ) * ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h	2025-09-22 11:12:09 +03:00
Shin-myoung-serp	96fdca043b	Vulkan: add conv_transpose_2d operation (#16022 ) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.	2025-09-22 10:04:01 +02:00
Jeff Bolz	a20d810d79	vulkan: add RTE variants of exp shader (#16165 ) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.	2025-09-22 07:37:17 +02:00
Ruben Ortlam	9073a73d82	vulkan: vec dot matrix multiplication fix (#16151 ) * vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching * add odd m/n + odd k test with batching	2025-09-22 07:22:43 +02:00
lhez	51f5a45fbe	opencl: fix concat crash on win arm64 with Adreno (#15944 )	2025-09-21 16:42:10 -07:00
lhez	c4510dc937	opencl: initial `q8_0` mv support (#15732 )	2025-09-21 14:48:44 -07:00
Giuseppe Scrivano	1eeb523c3e	vulkan: optimize UMA buffer operations and fix driver hangs (#16059 ) * vulkan: optimize UMA buffer operations and fix driver hangs The previous implementation was blocking the GPU for extended periods, causing the i915 driver to reset the context due to the hangcheck protection. [32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114] [32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang * vulkan: implement deferred_memset on UMA --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-09-21 08:31:55 +02:00
Jeff Bolz	5bb4a3edec	vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (#16086 )	2025-09-21 08:23:37 +02:00
Gregor Jasny	fa6383ca7e	CUDA : conditionally add cuda architectures (ggml/1341)	2025-09-20 13:02:14 +03:00
Ruben Ortlam	803dac2e48	vulkan: use vec dot for matrix matrix multiplications (#16056 ) * vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues	2025-09-20 10:42:56 +02:00
Xuan-Son Nguyen	0dd58b6877	ggml : refactor forward_dup for cpu backend (#16062 ) * ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test	2025-09-19 06:31:56 +02:00
Adrien Gallouët	69ffd89163	ggml-amx : fix ggml_amx_init() on generic Linux (#16049 ) Generalize Linux check to `__linux__` to support non-glibc systems (like musl). Also, return `false` on unknown/untested OS. Without this commit, the code compiles (with warnings) but fails: register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C) build: 6487 (`51c4cac6`) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug) system info: n_threads = 8, n_threads_batch = 8, total_threads = 16 .... print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 4B Illegal instruction (core dumped) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-18 23:07:26 +02:00
Adrien Gallouët	246c0d9c79	cmake : fix static linking for OpenMP on Unix-like systems (#16031 ) When compiling with GGML_STATIC=ON, the build process would produce a binary that was still dynamically linked to OpenMP. This defeats the purpose of a static build: $ cmake -B build \ -DBUILD_SHARED_LIBS=OFF \ -DLLAMA_CURL=OFF \ -DGGML_CCACHE=OFF \ -DGGML_NATIVE=OFF \ -DGGML_STATIC=ON $ ldd llama-server linux-vdso.so.1 (0x0000e1a434e3b000) libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000) /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000) This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES` to prioritize `.a` files, forcing CMake to link the static version of the library. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-18 23:07:18 +02:00
Shawn Gu	3edd87cd05	opencl: optimize mxfp4 kernels (#16037 ) - flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2025-09-18 12:03:34 -07:00
Jeff Bolz	c0b45097c3	rename optimize_graph to graph_optimize (#16082 )	2025-09-18 13:46:17 -05:00
Bowen Han	38dbdf4c05	CUDA: Optimize PAD_REFLECT_1D (#15957 ) * CUDA: Optimize PAD_REFLECT_1D feat: add more test cases for PAD_REFLECT_1D * use fast_div to improve performance * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * optimize * use a concise expression to further speedup the cuda kernel --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-18 20:26:03 +02:00
Johannes Gäßler	368560a1e3	CUDA: fix compilation on CC 6.0 (#16091 )	2025-09-18 19:28:32 +02:00
Georgi Gerganov	703f9e32c4	metal : use function constants for mul_mv_ext kernels (#16074 ) * metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci	2025-09-18 16:28:41 +03:00
Sigbjørn Skjæret	ad6bd9083b	cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (#16060 )	2025-09-18 13:28:22 +02:00
Georgi Gerganov	b213fce89b	metal : improve F32, F16 and BF16 mat-vec multiplication (#16057 ) * metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci	2025-09-18 12:33:45 +03:00
Jhen-Jie Hong	e00f3fd8ff	metal : avoid call free for non-owned buffer (#16067 )	2025-09-18 10:06:48 +03:00
Georgi Gerganov	f2f28380ea	metal : handle nil cv during pipeline creation (#16065 ) ggml-ci	2025-09-18 10:03:24 +03:00
Chenguang Li	62c3b645c5	CANN: Remove print (#16044 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-18 09:26:33 +08:00
Reese Levine	d304f459d8	GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (#16018 ) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit `77f8b96515`. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>	2025-09-17 13:09:40 -07:00
Georgi Gerganov	0320ac5264	metal : refactor + optimize v2 (#15995 ) * metal : improve naming * metal : refactor device ggml-ci * cont : props ggml-ci * metal : apply ggml_mem_ranges_t ggml-ci * metal : remove GGML_METAL_USE_BF16 ggml-ci * metal : refactor device buffer ggml-ci * cont : fix naming * metal : sync before destroying the backend ggml-ci * metal : refactor context ggml-ci * metal : migrate ggml-metal.m to ggml-metal.cpp ggml-ci * metal : adjust ops API ggml-ci * metal : use C++ to store piplienes ggml-ci * metal : migrate ops to separate functions ggml-ci * metal : add ggml_metal_library_t ggml-ci * metal : improve naming ggml-ci * metal : cleanp ggml-ci * metal : add support for GGML_OP_LOG ggml-ci * metal : fix error handling ggml-ci	2025-09-17 20:38:12 +03:00
Johannes Gäßler	c959b676be	CUDA: fix FA occupancy, optimize tile kernel (#15982 )	2025-09-17 15:32:42 +02:00
Eve	cb5bb6cc05	vulkan: automatically remove unsupported devices (#15976 ) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference	2025-09-17 09:35:37 +02:00
Chenguang Li	d5fabe3682	CANN: Optimize ggml_cann_set_device (#15935 ) * CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id	2025-09-17 14:33:08 +08:00
Daniel Bevenius	3913f8730e	ggml : fix padding in timestep embedding kernels (#15932 ) * ggml : remove adding extra dim timestep embedding This commit updates the ggml_timestep_embedding function to no longer add an extra dimension when the specified dimension is odd. The motivation for this change is that this introduces an unnecessary dimension when the dimension is odd, which caused an issue in the kernels which were not expecting this extra dimension and it resulted in uninitialized memory for the second to last dimension. * ggml-cuda : fix padding in timestep embedding kernel This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. * ggml-metal : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel * ggml-opencl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-sycl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-vulkan : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-cpu : fix padding in timestep embedding function This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension.	2025-09-16 15:25:57 +02:00
Jake Karnes	3d4053f77f	CUDA: fix im2col_3d to respect non-contiguous inputs (views) (#15956 ) * fix im2col_3d to respect non-contiguous inputs (views) The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged. * use ggml_element_size() for src strides Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-16 00:28:31 +02:00
yael-works	b907255f4b	SYCL: Add COUNT_EQUAL operator support (#15991 ) * SYCL: Add COUNT_EQUAL operator support (rebased on master) * SYCL: remove duplicate op_count_equal definition * tests: remove test_count_equal_typed and use test_count_equal for all cases * tests: keep only I32 case for COUNT_EQUAL as suggested * tests: keep only I32 case for COUNT_EQUAL as requested	2025-09-15 18:51:35 +02:00
Aman Gupta	106220562a	CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (#15926 )	2025-09-15 17:35:11 +08:00
Georgi Gerganov	9dcd200d57	metal : remove memory pools (#15966 ) * metal : remove mem pool usage ggml-ci * metal : remove mem pool implementation ggml-ci * metal : take into account the actual allocated memory of the tensor ggml-ci * cont : use ggml_backend_buft_get_alloc_size ggml-ci * cont : improve, comments ggml-ci * cont : add functions for the extra tensor sizes * metal : add comments ggml-ci * metal : implement .get_alloc_size for the rest of the buffer types ggml-ci * metal : remove ggml_metal_heap ggml-ci	2025-09-14 22:02:32 +03:00
Ruben Ortlam	261e6a20ff	Vulkan: Clean up mul_mm shader (#15987 ) * vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings	2025-09-14 16:56:28 +02:00
Georgi Gerganov	a14bd35014	metal : fix kernel requirements (#15983 ) * metal : fix kernel requirements ggml-ci * cont : fix supports_op * cont : fix supports_op for ARGMAX	2025-09-14 15:33:22 +03:00
Aaron Teo	6380d6a3e7	ggml-zdnn: rm user mapped buffers (#15965 ) * ggml-zdnn: rm user mapped buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm dead code Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to fix missing extra data buffer free Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-14 13:37:03 +08:00
Jeff Bolz	aa0c461efe	vulkan: fix failing dequant shaders (#15862 ) * vulkan: fix failing dequant shaders * add missing const	2025-09-13 17:29:43 +02:00
Jeff Bolz	b9c9c9f789	vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705 ) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.	2025-09-13 17:23:30 +02:00
Georgi Gerganov	55758b00ca	metal : refactor kernel loading (#15964 ) * metal : refactor bin kernels loading ggml-ci * metal : refactor rms kernel loading ggml-ci * ci : try to add memory leaks check ggml-ci * ci : try to enable memory leak detection for Mac * cont : seems to be working	2025-09-13 16:24:22 +03:00
Georgi Gerganov	f161463a54	metal : allow ops to run concurrently (#15929 ) * metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]	2025-09-13 13:54:28 +03:00
Georgi Gerganov	84d7b2fca1	metal : fix memory leaks (#15962 ) ggml-ci	2025-09-13 12:45:04 +03:00
Aaron Teo	40be51152d	ggml-zdnn: fix #15414 , activate FP16 and BF16 acceleration and incorrect zTensor free (#15839 )	2025-09-13 02:39:52 +08:00
Ruben Ortlam	304ac5693d	Vulkan iGPU device selection overhaul and PCI ID API support (#15947 ) * vulkan: implement ggml igpu device type, implement pci id support * fix compiler warning * prevent printf overflow warning	2025-09-12 13:24:21 +02:00
Mathieu Baudier	6c88ad8fa7	vulkan: Make device memory check more portable (#15939 )	2025-09-12 09:06:20 +02:00
Neo Zhang Jianyu	704d90c987	Revert "sycl: add usage of enqueue_functions extension (#14244 )" (#15910 ) * Revert "sycl: add usage of enqueue_functions extension (#14244)" This reverts commit `8308f98c7f`. * fix missed revert code, format the code	2025-09-12 09:15:12 +08:00
Diego Devesa	360d6533db	ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797 ) * ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type ggml-backend : add device id to device props llama : only use iGPU devices if there are no GPU devices llama : do not use multiple devices from different backends with the same device id	2025-09-11 22:47:38 +02:00
Johannes Gäßler	0e6ff0046f	CUDA: larger SRAM reads for tile FA, AMD FP16 dot (#15927 ) * CUDA: larger SRAM reads for tile FA, AMD FP16 dot * fix logic for availability of v_dot2_f32_f16	2025-09-11 21:19:58 +02:00
Daniel Bevenius	24a6734daf	ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (#15922 ) This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when the target hardware actually supports them. The motivation for this is to fix ggml CI build failures where the feature detection correctly identifies that i8mm is not supported, adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are still enabled, causing the compiler to attempt to use vmmlaq_s32 intrinsics without i8mm support. Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499	2025-09-11 14:39:12 +01:00

... 3 4 5 6 7 ...

1734 Commits